SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Hierarchical clustering
in Python & elsewhere
For @PyDataConf London, June 2015, by Frank Kelly
Data Scientist, Engineer @analyticsseo
@norhustla
Hierarchical
Clustering
Theory Practice Visualisation
Origins & definitions
Methods & considerations
Hierachical theory
Metrics & performance
My use case
Python libraries
Example
Static
Interactive
Further ideas
All opinions expressed are my own
Who am I?
All opinions expressed are my own
Attribution: www.alexmaclean.com
Clustering: a recap
Clustering is an unsupervised learning
problem
"SLINK-Gaussian-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons -
https://commons.wikimedia.org/wiki/File:SLINK-Gaussian-data.svg#/media/File:SLINK-Gaussian-data.svg
based on some
notion of similarity.
whereby we aim to
group subsets of
entities with one
another
Origins
1930s:
Anthropology
&
Psychology
http://dienekes.blogspot.co.uk/2013/12/europeans-neolithic-farmers-mesolithic.html
Diverse applications
Attribution: stack overflow, wikipedia, scikit-learn.org, http://www.poolparty.biz/
Two
main
purposes
Exploratory analysis – standalone tool
(Data mining)
As a component of a supervised learning
pipeline (in which distinct classifiers or
regression models are trained for each
cluster).
(Machine Learning)
Clustering considerations
Partitioning
criteria
(single /
multi level)
Separation
Exclusive /
non-
exclusive
Clustering
space
(Full-space /
sub-space)
Similarity
measure
(distance /
connectivity)
Use case: search keywords
RD
P
P
P
KW
KW
KW
KW
KW
CP
CP
KW
KW
KW
The
competition!
KW
KW
CP
CD
You
Opportunity!
CD = Competing domains
CP = Competitor’s pages
RD = Ranking domain
P = Your page
KW = Keyword
….x 100,000 !!
Use case: search keywords
KW…so we have found 100,000 new ‘s – now what?
How do we summarise and present these to a client?
Clients’ questions…
• Do search categories in general
align with my website structure?
• Which categories of opportunity
keywords have the highest
search volume, bring the most
visitors, revenue etc.?
• Which keywords are not
relevant?
Website-like structure
Requirements
• Need: visual insights;
structure
• Allow targeting of
problem in hand
• May develop into a
semi- supervised
solution
• High-dimensional and sparse
data set
• Values correspond to word
frequencies
• Recommended methods
include: hierarchical
clustering, Kmeans with an
appropriate distance measure,
topic modelling (LDA, LSI),
co-clustering
Options for text
clustering?
Hierarchical Clustering
bringing structure
2 types
Agglomerative
Divisive Deterministic algorithms!
Attribution: Wikipedia
Agglomerative
Start with many
“singleton” clusters
…
Merge 2 at a time
continuously
…
Build a hierarchy
Divisive
Start with a huge “macro”
cluster
…
Iteratively split into 2
groups
…
Build a hierarchy
Agglomerative method:
Linkage types
• Single (similarity between
most similar – based on nearest
neighbour - two elements)
• Complete (similarity between
most dissimilar two elements)
Attribution: https://www.coursera.org/course/clusteranalysis
Agglomerative method:
Linkage types
Average link
( avg. of similarity between
all inter-cluster pairs )
Computationally expensive (Na*Nb)
Trick: Centroid link (similarity
between centroid of two clusters)
Attribution: https://www.coursera.org/course/clusteranalysis
Ward’s criterion
• Minimise a function: total in-cluster variance
• As defined by, e.g.:
• Once merged, then the SSE will increase
(cluster becomes bigger) by:
https://en.wikipedia.org/wiki/Ward's_method
Divisive clustering
• Top-down approach
• Criterion to split: Ward’s criterion
• Handling noise: Use a threshold to determine
the termination criteria
Attribution: https://www.coursera.org/course/clusteranalysis
Similarity measures
This will certainly influence the shape of the
clusters!
• Numerical: Use a variation of the Manhattan
distance (e.g. City block, Euclidean)
• Binary: Manhattan, Jaccard co-efficient,
Hamming
• Text: Cosine similarity.
Cosine similarity
Represent a document by a bag of terms
Record the frequency of a particular term (word/ topic/ phrase)
If d1 and d2 are two term vectors,
…can thus calculate the similarity between them
Attribution: https://www.coursera.org/course/clusteranalysis
Gather word documents = keyword phrases
Aggregate search words with URL “words”
Text clustering:
preparations
• Add features where possible
o I added URL words to my word set
• Stem words
o Choose the right stemmer – too severe can be bad
• Stop words
o NLTK tokeniser
o Scikit learn TF-IDF tokeniser
• Low frequency cut-off
o 2 => words appearing less than twice in whole corpus
• High frequency cut-off
o 0.5 => words that appear in more than 50% of documents
• N-grams
o Single words, bi-grams, tri-grams
• Beware of foreign languages
o Separate datasets if possible
Text preparation
Dimensionality
• Get a sparse matrix
o Mostly zeros
• Reduce the number of dimensions
o PCA
o Spectral clustering
• The “curse” of dimensionality
Results: reduced dimensions
Results: reduced dimensions
The
dendrogram
Assess the quality of your
clusters
• Internal: Purity, completeness & homogeneity
• External: Adjusted Rand index, Normalised
Information index
Topic labelling
Hierarchical Clustering
Beyond Python (!?)
Life on the inside:
Elasticsearch
• Why not perform pre-processing and clustering
inside elasticsearch?
• Document store
• TF-IDF and other
• Stop words
• Language specific analysers
Elasticsearch
- try it ! -
• https://www.elastic.co/
• NoSQL document store
• Aggregations and stats
• Fast, distributed
• Quick to set up
Document storage in ES
Lingo 3G algorithm
• Lingo 3G: Hierarchical clustering off-the-shelf
• Built-in part of speech (POS)
• User-defined word/synonym/label dictionaries
• Built-in stemmer / word inflection database
• Multi-lingual support, advanced tuning
• Commercial: costs attached
http://download.carrotsearch.com/lingo3g/manual/#section.es
http://project.carrot2.org/algorithms.html
Elasticsearch with
clustering – Utopia?
Carrot2’s Lingo3G in action :
http://search.carrot2.org/stable/search
Foamtree visualisation example
Visualisation of hierarchical structure possible for
large datasets via “lazy loading”
http://get.carrotsearch.com/foamtree/demo/demos/large.html
Limitations of hierarchical
clustering
• Can’t undo what’s done (divisive method, work
on sub clusters, cannot re-merge). Even true for
agglomerative (once merged will never split it
again)
• Every split or merge must be refined
• Methods may not scale well, checking all possible
pairs, complexity goes high
There are extensions: BIRCH, CURE and
CHAMELEON
Thank you!
A decent introductory course to clustering;
https://www.coursera.org/course/clusteranalysis
Hierarchical (agglomerative) clustering in Python:
http://scikit-
learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
Recent (ish) relevant Kaggle challenge: https://www.kaggle.com/c/lshtc
Visualisation: http://carrotsearch.com/foamtree-overview
Clustering elsewhere (Lingo, Lingo3G) with
Carrot2:http://download.carrotsearch.com/
Elasticsearch: https://www.elastic.co/
Analytics SEO: http://www.analyticsseo.com/
Me: @norhustla / frank.kelly@cantab.net
Attribution: http://wynway.com/
Extra slide: Why work
inside the database?
1. Sharing data (management of)
Support concurrent access by multiple readers and writers
2. Data Model Enforcement
Make sure all applications see clean, organised data
3. Scale
Work with datasets too large to fit in memory (over a certain size,
need specialised algorithms to deal with the data -> bottleneck)
The database organises and exposes algorithms for you
conveniently
4. Flexibility
Use the data in new, unanticipated ways -> anticipate a broad set
of ways of accessing the data

Weitere ähnliche Inhalte

Was ist angesagt?

K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...Edureka!
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithmhadifar
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet AllocationSangwoo Mo
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingFraboni Ec
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classificationSung Yub Kim
 
Information retrieval dynamic indexing
Information retrieval dynamic indexingInformation retrieval dynamic indexing
Information retrieval dynamic indexingNadia Nahar
 
Theory of first order logic
Theory of first order logicTheory of first order logic
Theory of first order logicDevaddd
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERINGsingh7599
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classificationsathish sak
 
The Joy of SciPy
The Joy of SciPyThe Joy of SciPy
The Joy of SciPykammeyer
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)ananth
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysisguru_prasadg
 

Was ist angesagt? (20)

K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Introduction to Clustering algorithm
Introduction to Clustering algorithmIntroduction to Clustering algorithm
Introduction to Clustering algorithm
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Xgboost
XgboostXgboost
Xgboost
 
Latent Dirichlet Allocation
Latent Dirichlet AllocationLatent Dirichlet Allocation
Latent Dirichlet Allocation
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
Information retrieval dynamic indexing
Information retrieval dynamic indexingInformation retrieval dynamic indexing
Information retrieval dynamic indexing
 
Theory of first order logic
Theory of first order logicTheory of first order logic
Theory of first order logic
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Clustering ppt
Clustering pptClustering ppt
Clustering ppt
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
K MEANS CLUSTERING
K MEANS CLUSTERINGK MEANS CLUSTERING
K MEANS CLUSTERING
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Kmeans
KmeansKmeans
Kmeans
 
Bayes Classification
Bayes ClassificationBayes Classification
Bayes Classification
 
The Joy of SciPy
The Joy of SciPyThe Joy of SciPy
The Joy of SciPy
 
An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)An overview of Hidden Markov Models (HMM)
An overview of Hidden Markov Models (HMM)
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 

Andere mochten auch

Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clusteringishmecse13
 
Machine Learning and Data Mining: 08 Clustering: Hierarchical
Machine Learning and Data Mining: 08 Clustering: Hierarchical Machine Learning and Data Mining: 08 Clustering: Hierarchical
Machine Learning and Data Mining: 08 Clustering: Hierarchical Pier Luca Lanzi
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clusteringguestfee8698
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudRob Gillen
 
28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical Clustering28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical ClusteringAndres Mendez-Vazquez
 
Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_SagarSagar Kumar
 
Machine Learning and Data Mining: 05 Advanced Association Rule Mining
Machine Learning and Data Mining: 05 Advanced Association Rule MiningMachine Learning and Data Mining: 05 Advanced Association Rule Mining
Machine Learning and Data Mining: 05 Advanced Association Rule MiningPier Luca Lanzi
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using wekarathorenitin87
 
The Open-Source Monitoring Landscape
The Open-Source Monitoring LandscapeThe Open-Source Monitoring Landscape
The Open-Source Monitoring LandscapeMike Merideth
 
Document Classification and Clustering
Document Classification and ClusteringDocument Classification and Clustering
Document Classification and ClusteringAnkur Shrivastava
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification Mahmoud Alfarra
 
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing SystemsShuyo Nakatani
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXBenjamin Bengfort
 

Andere mochten auch (20)

Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
Machine Learning and Data Mining: 08 Clustering: Hierarchical
Machine Learning and Data Mining: 08 Clustering: Hierarchical Machine Learning and Data Mining: 08 Clustering: Hierarchical
Machine Learning and Data Mining: 08 Clustering: Hierarchical
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clustering
 
Text clustering
Text clusteringText clustering
Text clustering
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
08 clustering
08 clustering08 clustering
08 clustering
 
Cluster Analysis
Cluster AnalysisCluster Analysis
Cluster Analysis
 
NLTK in 20 minutes
NLTK in 20 minutesNLTK in 20 minutes
NLTK in 20 minutes
 
Scaling Document Clustering in the Cloud
Scaling Document Clustering in the CloudScaling Document Clustering in the Cloud
Scaling Document Clustering in the Cloud
 
Alz Hack II
Alz Hack IIAlz Hack II
Alz Hack II
 
28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical Clustering28 Machine Learning Unsupervised Hierarchical Clustering
28 Machine Learning Unsupervised Hierarchical Clustering
 
Weka_Manual_Sagar
Weka_Manual_SagarWeka_Manual_Sagar
Weka_Manual_Sagar
 
Machine Learning and Data Mining: 05 Advanced Association Rule Mining
Machine Learning and Data Mining: 05 Advanced Association Rule MiningMachine Learning and Data Mining: 05 Advanced Association Rule Mining
Machine Learning and Data Mining: 05 Advanced Association Rule Mining
 
Data mining techniques using weka
Data mining techniques using wekaData mining techniques using weka
Data mining techniques using weka
 
The Open-Source Monitoring Landscape
The Open-Source Monitoring LandscapeThe Open-Source Monitoring Landscape
The Open-Source Monitoring Landscape
 
Data Mining using Weka
Data Mining using WekaData Mining using Weka
Data Mining using Weka
 
Document Classification and Clustering
Document Classification and ClusteringDocument Classification and Clustering
Document Classification and Clustering
 
Document clustering and classification
Document clustering and classification Document clustering and classification
Document clustering and classification
 
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
[Karger+ NIPS11] Iterative Learning for Reliable Crowdsourcing Systems
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
 

Ähnlich wie Hierarchical clustering in Python and beyond

clustering_classification.ppt
clustering_classification.pptclustering_classification.ppt
clustering_classification.pptHODECE21
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectEnrico Daga
 
AAT LOD Microthesauri
AAT LOD MicrothesauriAAT LOD Microthesauri
AAT LOD MicrothesauriMarcia Zeng
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
Metadata first, ontologies second
Metadata first, ontologies secondMetadata first, ontologies second
Metadata first, ontologies secondJoseba Abaitua
 
DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!dclsocialmedia
 
Linking Folksonomies to Knowledge Organization Systems
Linking Folksonomies to Knowledge Organization SystemsLinking Folksonomies to Knowledge Organization Systems
Linking Folksonomies to Knowledge Organization SystemsJakob .
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research ObjectsCarole Goble
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...Kelle Cruz
 
Research Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityResearch Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityOscar Corcho
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoAshok Venkatesan
 
Open Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and ExchangeOpen Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and Exchangelagoze
 
Reduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective SearchReduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective SearchLucidworks
 

Ähnlich wie Hierarchical clustering in Python and beyond (20)

04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
clustering_classification.ppt
clustering_classification.pptclustering_classification.ppt
clustering_classification.ppt
 
Knowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything ProjectKnowledge graph construction with a façade - The SPARQL Anything Project
Knowledge graph construction with a façade - The SPARQL Anything Project
 
AAT LOD Microthesauri
AAT LOD MicrothesauriAAT LOD Microthesauri
AAT LOD Microthesauri
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
Metadata first, ontologies second
Metadata first, ontologies secondMetadata first, ontologies second
Metadata first, ontologies second
 
Elastic pivorak
Elastic pivorakElastic pivorak
Elastic pivorak
 
DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!DITA's New Thang: Going Mapless!
DITA's New Thang: Going Mapless!
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
 
Linking Folksonomies to Knowledge Organization Systems
Linking Folksonomies to Knowledge Organization SystemsLinking Folksonomies to Knowledge Organization Systems
Linking Folksonomies to Knowledge Organization Systems
 
The Rhetoric of Research Objects
The Rhetoric of Research ObjectsThe Rhetoric of Research Objects
The Rhetoric of Research Objects
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...Collaborations in the Extreme: 
The rise of open code development in the scie...
Collaborations in the Extreme: 
The rise of open code development in the scie...
 
Research Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibilityResearch Objects for improved sharing and reproducibility
Research Objects for improved sharing and reproducibility
 
Recommender Systems and Linked Open Data
Recommender Systems and Linked Open DataRecommender Systems and Linked Open Data
Recommender Systems and Linked Open Data
 
Recommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and DatoRecommending Semantic Nearest Neighbors Using Storm and Dato
Recommending Semantic Nearest Neighbors Using Storm and Dato
 
UCIAD overview
UCIAD overviewUCIAD overview
UCIAD overview
 
Open Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and ExchangeOpen Archives Initiative Object Reuse and Exchange
Open Archives Initiative Object Reuse and Exchange
 
Reduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective SearchReduce Query Time Up to 60% with Selective Search
Reduce Query Time Up to 60% with Selective Search
 
Keynote at AImWD
Keynote at AImWDKeynote at AImWD
Keynote at AImWD
 

Kürzlich hochgeladen

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 

Kürzlich hochgeladen (20)

9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 

Hierarchical clustering in Python and beyond

  • 1. Hierarchical clustering in Python & elsewhere For @PyDataConf London, June 2015, by Frank Kelly Data Scientist, Engineer @analyticsseo @norhustla
  • 2. Hierarchical Clustering Theory Practice Visualisation Origins & definitions Methods & considerations Hierachical theory Metrics & performance My use case Python libraries Example Static Interactive Further ideas All opinions expressed are my own
  • 3. Who am I? All opinions expressed are my own
  • 5. Clustering is an unsupervised learning problem "SLINK-Gaussian-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:SLINK-Gaussian-data.svg#/media/File:SLINK-Gaussian-data.svg based on some notion of similarity. whereby we aim to group subsets of entities with one another
  • 7. Diverse applications Attribution: stack overflow, wikipedia, scikit-learn.org, http://www.poolparty.biz/
  • 8. Two main purposes Exploratory analysis – standalone tool (Data mining) As a component of a supervised learning pipeline (in which distinct classifiers or regression models are trained for each cluster). (Machine Learning)
  • 9. Clustering considerations Partitioning criteria (single / multi level) Separation Exclusive / non- exclusive Clustering space (Full-space / sub-space) Similarity measure (distance / connectivity)
  • 10. Use case: search keywords RD P P P KW KW KW KW KW CP CP KW KW KW The competition! KW KW CP CD You Opportunity! CD = Competing domains CP = Competitor’s pages RD = Ranking domain P = Your page KW = Keyword
  • 12. Use case: search keywords KW…so we have found 100,000 new ‘s – now what? How do we summarise and present these to a client?
  • 13. Clients’ questions… • Do search categories in general align with my website structure? • Which categories of opportunity keywords have the highest search volume, bring the most visitors, revenue etc.? • Which keywords are not relevant?
  • 15. Requirements • Need: visual insights; structure • Allow targeting of problem in hand • May develop into a semi- supervised solution
  • 16. • High-dimensional and sparse data set • Values correspond to word frequencies • Recommended methods include: hierarchical clustering, Kmeans with an appropriate distance measure, topic modelling (LDA, LSI), co-clustering Options for text clustering?
  • 18. 2 types Agglomerative Divisive Deterministic algorithms! Attribution: Wikipedia
  • 19. Agglomerative Start with many “singleton” clusters … Merge 2 at a time continuously … Build a hierarchy Divisive Start with a huge “macro” cluster … Iteratively split into 2 groups … Build a hierarchy
  • 20. Agglomerative method: Linkage types • Single (similarity between most similar – based on nearest neighbour - two elements) • Complete (similarity between most dissimilar two elements) Attribution: https://www.coursera.org/course/clusteranalysis
  • 21.
  • 22.
  • 23. Agglomerative method: Linkage types Average link ( avg. of similarity between all inter-cluster pairs ) Computationally expensive (Na*Nb) Trick: Centroid link (similarity between centroid of two clusters) Attribution: https://www.coursera.org/course/clusteranalysis
  • 24. Ward’s criterion • Minimise a function: total in-cluster variance • As defined by, e.g.: • Once merged, then the SSE will increase (cluster becomes bigger) by: https://en.wikipedia.org/wiki/Ward's_method
  • 25. Divisive clustering • Top-down approach • Criterion to split: Ward’s criterion • Handling noise: Use a threshold to determine the termination criteria Attribution: https://www.coursera.org/course/clusteranalysis
  • 26. Similarity measures This will certainly influence the shape of the clusters! • Numerical: Use a variation of the Manhattan distance (e.g. City block, Euclidean) • Binary: Manhattan, Jaccard co-efficient, Hamming • Text: Cosine similarity.
  • 27. Cosine similarity Represent a document by a bag of terms Record the frequency of a particular term (word/ topic/ phrase) If d1 and d2 are two term vectors, …can thus calculate the similarity between them Attribution: https://www.coursera.org/course/clusteranalysis
  • 28.
  • 29. Gather word documents = keyword phrases
  • 30. Aggregate search words with URL “words”
  • 31. Text clustering: preparations • Add features where possible o I added URL words to my word set • Stem words o Choose the right stemmer – too severe can be bad • Stop words o NLTK tokeniser o Scikit learn TF-IDF tokeniser • Low frequency cut-off o 2 => words appearing less than twice in whole corpus • High frequency cut-off o 0.5 => words that appear in more than 50% of documents • N-grams o Single words, bi-grams, tri-grams • Beware of foreign languages o Separate datasets if possible
  • 33. Dimensionality • Get a sparse matrix o Mostly zeros • Reduce the number of dimensions o PCA o Spectral clustering • The “curse” of dimensionality
  • 34.
  • 38. Assess the quality of your clusters • Internal: Purity, completeness & homogeneity • External: Adjusted Rand index, Normalised Information index
  • 41. Life on the inside: Elasticsearch • Why not perform pre-processing and clustering inside elasticsearch? • Document store • TF-IDF and other • Stop words • Language specific analysers
  • 42. Elasticsearch - try it ! - • https://www.elastic.co/ • NoSQL document store • Aggregations and stats • Fast, distributed • Quick to set up
  • 44. Lingo 3G algorithm • Lingo 3G: Hierarchical clustering off-the-shelf • Built-in part of speech (POS) • User-defined word/synonym/label dictionaries • Built-in stemmer / word inflection database • Multi-lingual support, advanced tuning • Commercial: costs attached http://download.carrotsearch.com/lingo3g/manual/#section.es http://project.carrot2.org/algorithms.html
  • 45. Elasticsearch with clustering – Utopia? Carrot2’s Lingo3G in action : http://search.carrot2.org/stable/search Foamtree visualisation example Visualisation of hierarchical structure possible for large datasets via “lazy loading” http://get.carrotsearch.com/foamtree/demo/demos/large.html
  • 46. Limitations of hierarchical clustering • Can’t undo what’s done (divisive method, work on sub clusters, cannot re-merge). Even true for agglomerative (once merged will never split it again) • Every split or merge must be refined • Methods may not scale well, checking all possible pairs, complexity goes high There are extensions: BIRCH, CURE and CHAMELEON
  • 47. Thank you! A decent introductory course to clustering; https://www.coursera.org/course/clusteranalysis Hierarchical (agglomerative) clustering in Python: http://scikit- learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html Recent (ish) relevant Kaggle challenge: https://www.kaggle.com/c/lshtc Visualisation: http://carrotsearch.com/foamtree-overview Clustering elsewhere (Lingo, Lingo3G) with Carrot2:http://download.carrotsearch.com/ Elasticsearch: https://www.elastic.co/ Analytics SEO: http://www.analyticsseo.com/ Me: @norhustla / frank.kelly@cantab.net Attribution: http://wynway.com/
  • 48. Extra slide: Why work inside the database? 1. Sharing data (management of) Support concurrent access by multiple readers and writers 2. Data Model Enforcement Make sure all applications see clean, organised data 3. Scale Work with datasets too large to fit in memory (over a certain size, need specialised algorithms to deal with the data -> bottleneck) The database organises and exposes algorithms for you conveniently 4. Flexibility Use the data in new, unanticipated ways -> anticipate a broad set of ways of accessing the data

Hinweis der Redaktion

  1. http://dienekes.blogspot.co.uk/2013/12/europeans-neolithic-farmers-mesolithic.html
  2. https://upload.wikimedia.org/wikipedia/commons/3/39/Swiss_complete.png