This document outlines an Edureka webinar on applications of clustering in real life. The webinar instructor is Kumaran Ponnambalam. The objectives are to understand data science applications and prospects, machine learning categories, clustering and k-means clustering. Examples of clustering applications include wine recommendation, pizza delivery optimization, and news summarization. K-means clustering is demonstrated on pizza delivery location data. The webinar also discusses data science job trends and covers 10 modules on data science topics including machine learning techniques in R.
Application of Clustering in Data Science using Real-life Examples
1. www.edureka.in/data-science
Data Science Webinar Series:
Applications of Clustering in Real Life
View Data Science Courses at : www.edureka.in/data_science
*
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
View Data Science Courses at : www.edureka.in/data_science
*
2. www.edureka.in/data-scienceSlide 2
Meet Your Instructor
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Mr. Kumaran Ponnambalam
• Director, Data Engineering & PS, Transera Inc,
San Francisco Bay Area
3. www.edureka.in/data-scienceSlide 3
Meet Your Instructor
Understand Data Science Applications and Prospects
Get an overview of Machine Learning
Understand the difference between Supervised and Unsupervised Learning
Learn Clustering and K-means Clustering
Implement K-means clustering in R
At the end of this session, you will be able to
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
4. www.edureka.in/data-scienceSlide 4
Objectives
Understand Data Science Applications and Prospects
Get an overview of Machine Learning
Understand the difference between Supervised and Unsupervised Learning
Learn Clustering and K-means Clustering
Implement K-means clustering in R
At the end of this session, you will be able to
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
10. www.edureka.in/data-scienceSlide 10
What’s Common in these Applications?
According to Wikipedia: Data science is the study of the generalizable extraction of knowledge
from data, yet the key word is science.
These scenarios involve:
Storing, organizing and integrating huge amount of unstructured data
Processing and analyzing the data
Extracting knowledge, insights and predict future from the data
Storage of big data is done in Hadoop. For more details on Hadoop please refer Big data and
Hadoop blog http://www.edureka.in/blog/category/big-data-and-hadoop/
Processing, Analyzing, extracting knowledge and insights are done through Machine Learning
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
11. Slide 11 www.edureka.in/data-science
Data Science: Demand Supply Gap
Big Data Analyst
Big Data Architect
Big Data Engineer
Big Data Research Analyst
Big Data Visualizer
Data Scientist
50
43
44
31
23
18
50
57
56
69
77
82
Filled job vs unfilled jobs in big data
Filled Unfilled
Vacancy/Filled(%)
Gartner Says Big Data Creates Big Jobs: 4.4 Million IT
Jobs Globally to Support Big Data By
2015http://www.gartner.com/newsroom/id/2207915
13. www.edureka.in/data-scienceSlide 13
Machine Learning Categories
Types of Learning
Supervised
Learning
Unsupervised
Learning
Inferring a function
from labelled
training data.
Trying to find hidden
structure in
unlabelled data.
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
14. www.edureka.in/data-scienceSlide 14
Machine Learning Categories
What category do the applications below fall into?
Supervised Learning Supervised Learning
Unsupervised Learning Unsupervised Learning
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
15. www.edureka.in/data-scienceSlide 15
Common Machine Learning Algorithms
Types of Learning
Supervised Learning
Unsupervised Learning
Algorithms
Naïve Bayes
Support Vector Machines
Random Forests
Decision Trees
Algorithms
K-means
Fuzzy Clustering
Hierarchical Clustering
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
17. www.edureka.in/data-scienceSlide 17
Clustering: Scenarios
The following scenarios implement Clustering:
A telephone company needs to establish its network by putting its towers in a particular region it
has acquired. The location of putting these towers can be found by clustering algorithm so that
all its users receive optimum signal strength.
The Miami DEA wants to make its law enforcement more stringent and hence have decided to
make their patrol vans stationed across the area so that the areas of high crime rates are in
vicinity to the patrol vans.
A Hospital Care chain wants to open a series of Emergency-Care wards, keeping in mind the
factor of maximum accident prone areas in a region.
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
18. www.edureka.in/data-scienceSlide 18
Some More Use-Cases of Clustering
Slide 18
Organizing data into clusters shows internal structure of the data
Ex. Clusty and clustering genes
Sometimes the partitioning is the goal
Ex. Market segmentation
Prepare for other AI techniques
Ex. Summarize news (cluster and then find centroid)
Discovery in data
Ex. Underlying rules, reoccurring patterns, topics, etc.
19. www.edureka.in/data-scienceSlide 19
What is Clustering?
Slide 19
Organizing data into clusters such that there is:
High intra-cluster similarity
Low inter-cluster similarity
Informally, finding natural groupings among
objects
http://en.wikipedia.org/wiki/Cluster_analysis
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
21. www.edureka.in/data-scienceSlide 21Slide 21
K-Means Clustering
The process by which objects are classified into
a number of groups so that they are as much
dissimilar as possible from one group to another
group, but as much similar as possible within
each group.
The objects in group 1 should be as similar as
possible.
But there should be much difference between an
object in group 1 and group 2.
The attributes of the objects are allowed to
determine which objects should be grouped
together.
Total population
Group 1
Group 2 Group 3
Group 4
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
23. www.edureka.in/data-scienceSlide 23
Let us suppose the following points are the delivery locations for Pizza.
K-Means: Pizza Hut Clustering Example
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
24. www.edureka.in/data-scienceSlide 24
Lets locate three cluster centres randomly
C1
C3
C2
K-Means: Pizza Hut Clustering Example
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
25. www.edureka.in/data-scienceSlide 25
Find the distance of the points as shown.
C1
C3
C2
K-Means: Pizza Hut Clustering Example
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
26. www.edureka.in/data-scienceSlide 26
Assign the points to the nearest cluster centres based on the distance between each centre and the points.
C1
C2
C3
K-Means: Pizza Hut Clustering Example
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
27. www.edureka.in/data-scienceSlide 27
Re-assign the cluster centres and locate nearest points.
C1
C2
C3
K-Means: Pizza Hut Clustering Example
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
28. www.edureka.in/data-scienceSlide 28
Re-assign the cluster centres and locate nearest points, calculate the distance.
C1
C2
C3
K-Means: Pizza Hut Clustering Example
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
31. www.edureka.in/data-scienceSlide 31
Now let us consider the another scenario of clustering :
The data from “Google page rank”.
Notice, that the data given here are sentences and not vectors.
Can we apply K-means clustering to it?
We will take a deep dive into TF-IDF in module 3 of this course.
Let’s look at the Another Scenario
For analyzing this type of data we use “TF-IDF algorithm” which converts these attributes to vectors.
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
32. Slide 32 www.edureka.in/data-science
Demo
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
More Information on R setup and applications at:
http://www.edureka.in/blog/category/business-analytics-with-r/
33. Slide 33 www.edureka.in/data-science
Module 1
» Introduction to Data Science
Module 2
» Basic Data Manipulation using R
Module 3
» Machine Learning Techniques using R Part -1
- Clustering
- TF-IDF and Cosine Similarity
- Association Rule Mining
Module 4
» Machine Learning Techniques using R Part -2
- Supervised and Unsupervised Learning
- Decision Tree Classifier
Course Topics
Module 5
» Machine Learning Techniques using R Part -3
- Random Forest Classifier
- Naïve Bayer’s Classifier
Module 6
» Introduction to Hadoop Architecture
Module 7
» Integrating R with Hadoop
Module 8
» Mahout Introduction and Algorithm
Implementation
Module 9
» Additional Mahout Algorithms and Parallel
Processing in R
Module 10
» Project
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
34. Slide 34
Questions?
Enroll for the Complete Course at : www.edureka.in/data_science
Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
www.edureka.in/data_science
Please Don’t forget to fill in the survey report
Class Recording and Presentation will be available in 24 hours at:
http://www.edureka.in/blog/application-of-clustering-in-data-science-using-real-life-examples/
Hinweis der Redaktion
Netflix uses 1 petabyte to store the videos for streaming.
BitTorrent Sync has transferred over 30 petabytes of data since its pre-alpha release in January 2013.
The 2009 movie Avatar is reported to have taken over 1 petabyte of local storage at Weta Digital for the rendering of the 3D CGI effects.
One petabyte of average MP3-encoded songs (for mobile, roughly one megabyte per minute), would require 2000 years to play.