Application of Clustering in Data Science using Real-life Examples

www.edureka.in/data-science
Data Science Webinar Series:
Applications of Clustering in Real Life
View Data Science Courses at : www.edureka.in/data_science
*
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
View Data Science Courses at : www.edureka.in/data_science
*

www.edureka.in/data-scienceSlide 2
Meet Your Instructor
Mr. Kumaran Ponnambalam
• Director, Data Engineering & PS, Transera Inc,
San Francisco Bay Area

Meet Your Instructor
 Understand Data Science Applications and Prospects
 Get an overview of Machine Learning
 Understand the difference between Supervised and Unsupervised Learning
 Learn Clustering and K-means Clustering
 Implement K-means clustering in R
At the end of this session, you will be able to

Objectives
 Understand Data Science Applications and Prospects
 Get an overview of Machine Learning
 Understand the difference between Supervised and Unsupervised Learning
 Learn Clustering and K-means Clustering
 Implement K-means clustering in R
At the end of this session, you will be able to

Data Science Applications: Wine Recommendation

Data Science Applications: Pizza Hut

Data Science Applications: NetFlix

Data Science Applications: Summarize News

How about this?

What’s Common in these Applications?
According to Wikipedia: Data science is the study of the generalizable extraction of knowledge
from data, yet the key word is science.
These scenarios involve:
 Storing, organizing and integrating huge amount of unstructured data
 Processing and analyzing the data
 Extracting knowledge, insights and predict future from the data
Storage of big data is done in Hadoop. For more details on Hadoop please refer Big data and
Hadoop blog http://www.edureka.in/blog/category/big-data-and-hadoop/
Processing, Analyzing, extracting knowledge and insights are done through Machine Learning

Data Science: Demand Supply Gap
Big Data Analyst
Big Data Architect
Big Data Engineer
Big Data Research Analyst
Big Data Visualizer
Data Scientist
50
43
44
31
23
18
50
57
56
69
77
82
Filled job vs unfilled jobs in big data
Filled Unfilled
Vacancy/Filled(%)
Gartner Says Big Data Creates Big Jobs: 4.4 Million IT
Jobs Globally to Support Big Data By
2015http://www.gartner.com/newsroom/id/2207915

Data Science: Job Trends

Machine Learning Categories
Types of Learning
Supervised
Learning
Unsupervised
Learning
Inferring a function
from labelled
training data.
Trying to find hidden
structure in
unlabelled data.

Machine Learning Categories
What category do the applications below fall into?
Supervised Learning Supervised Learning
Unsupervised Learning Unsupervised Learning

Common Machine Learning Algorithms
Types of Learning
Supervised Learning
Unsupervised Learning
Algorithms
 Naïve Bayes
 Support Vector Machines
 Random Forests
 Decision Trees
Algorithms
 K-means
 Fuzzy Clustering
 Hierarchical Clustering

Clustering

Clustering: Scenarios
The following scenarios implement Clustering:
 A telephone company needs to establish its network by putting its towers in a particular region it
has acquired. The location of putting these towers can be found by clustering algorithm so that
all its users receive optimum signal strength.
 The Miami DEA wants to make its law enforcement more stringent and hence have decided to
make their patrol vans stationed across the area so that the areas of high crime rates are in
vicinity to the patrol vans.
 A Hospital Care chain wants to open a series of Emergency-Care wards, keeping in mind the
factor of maximum accident prone areas in a region.

Some More Use-Cases of Clustering
Slide 18
 Organizing data into clusters shows internal structure of the data
Ex. Clusty and clustering genes
 Sometimes the partitioning is the goal
Ex. Market segmentation
 Prepare for other AI techniques
Ex. Summarize news (cluster and then find centroid)
 Discovery in data
Ex. Underlying rules, reoccurring patterns, topics, etc.

What is Clustering?
Slide 19
Organizing data into clusters such that there is:
 High intra-cluster similarity
 Low inter-cluster similarity
 Informally, finding natural groupings among
objects
http://en.wikipedia.org/wiki/Cluster_analysis

www.edureka.in/data-scienceSlide 20Slide 20
K-Means Clustering

www.edureka.in/data-scienceSlide 21Slide 21
K-Means Clustering
The process by which objects are classified into
a number of groups so that they are as much
dissimilar as possible from one group to another
group, but as much similar as possible within
each group.
The objects in group 1 should be as similar as
possible.
But there should be much difference between an
object in group 1 and group 2.
The attributes of the objects are allowed to
determine which objects should be grouped
together.
Total population
Group 1
Group 2 Group 3
Group 4

K-Means: Pizza Hut Clustering Example

Let us suppose the following points are the delivery locations for Pizza.

Lets locate three cluster centres randomly
C1
C3
C2

Find the distance of the points as shown.
C1
C3
C2

Assign the points to the nearest cluster centres based on the distance between each centre and the points.
C1
C2
C3

Re-assign the cluster centres and locate nearest points.
C1
C2
C3

Re-assign the cluster centres and locate nearest points, calculate the distance.
C1
C2
C3

Form the three clusters.
C1
C2
C3

ObjectiveFunctionValue
i.e.,Distortion
Elbow method
The value of k should be such that even if we increase the value of k from here on, the distortion
remains constant. This is the ideal value of k, for the clusters created.
The Elbow Curve

Now let us consider the another scenario of clustering :
The data from “Google page rank”.
Notice, that the data given here are sentences and not vectors.
Can we apply K-means clustering to it?
We will take a deep dive into TF-IDF in module 3 of this course.
Let’s look at the Another Scenario
For analyzing this type of data we use “TF-IDF algorithm” which converts these attributes to vectors.

Demo
More Information on R setup and applications at:
http://www.edureka.in/blog/category/business-analytics-with-r/

 Module 1
» Introduction to Data Science
 Module 2
» Basic Data Manipulation using R
 Module 3
» Machine Learning Techniques using R Part -1
- Clustering
- TF-IDF and Cosine Similarity
- Association Rule Mining
 Module 4
- Supervised and Unsupervised Learning
- Decision Tree Classifier
Course Topics
 Module 5
- Random Forest Classifier
- Naïve Bayer’s Classifier
 Module 6
» Introduction to Hadoop Architecture
 Module 7
» Integrating R with Hadoop
 Module 8
» Mahout Introduction and Algorithm
Implementation
 Module 9
» Additional Mahout Algorithms and Parallel
Processing in R
 Module 10
» Project

Questions?
Enroll for the Complete Course at : www.edureka.in/data_science
Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
www.edureka.in/data_science
Please Don’t forget to fill in the survey report
Class Recording and Presentation will be available in 24 hours at:
http://www.edureka.in/blog/application-of-clustering-in-data-science-using-real-life-examples/

Application of Clustering in Data Science using Real-life Examples

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Application of Clustering in Data Science using Real-life Examples

Ähnlich wie Application of Clustering in Data Science using Real-life Examples (20)

Mehr von Edureka!

Mehr von Edureka! (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Application of Clustering in Data Science using Real-life Examples

Hinweis der Redaktion