SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Downloaden Sie, um offline zu lesen
Cluster stability
Nees Jan van Eck and Ludo Waltman
Centre for Science and Technology Studies (CWTS), Leiden University
Workshop “Comparison of Algorithms”, Amsterdam
April 20, 2015
Problem statement
• A clustering technique can be used to obtain highly
detailed clustering results (i.e., a large number of
clusters)
• A clustering technique can be used to force each
publication to be assigned to a cluster
• However, in a highly detailed clustering, is the
assignment of publications to clusters still
meaningful?
• The assignment of a publication to a cluster may be
based on very little information (e.g., a single
citation relation)
1
Example: Waltman and Van Eck (2012)
2
Cluster stability
• To ensure that publications are assigned to clusters
in a meaningful way, we introduce the notion of
stable clusters
• Essentially, a cluster is stable if it is insensitive to
small changes in the underlying data
• Bootstrapping is used to make small changes in the
data
• There is no formal statistical framework
• To some extent, this resembles the stability
intervals in the CWTS Leiden Ranking
3
Identification of stable clusters:
Step 1
• Collect the citation network of publications
• Create a large number (e.g., 100) of bootstrap
citation networks
• In each bootstrap citation network, perform
clustering:
– Clustering technique of Waltman and Van Eck (2012)
– User-defined resolution parameter
– Smart local moving algorithm of Waltman and Van Eck (2013)
• For each pair of publications, calculate the
proportion of the bootstrap clustering results in
which the publications are in the same cluster
4
Identification of stable clusters:
Step 2
• Create a network of publications with an edge
between two publications if the publications are in
the same cluster in at least a certain proportion
(e.g., 0.9) of the bootstrap clustering results
• Identify connected components in the newly
created network
• Each connected component represents a stable
cluster
5
Non-parametric bootstrapping
• Sample with replacement from the set of all citation
relations between publications
• Make sure to obtain a sample that is of the same
size as the original set of citation relations
• Some citation relations will occur multiple times in
the sample, others won’t occur in it at all
• Based on the sampled citation relations, create a
bootstrap citation network
• Edges have integer weights in this network
6
Parametric bootstrapping
• A bootstrap citation network is a weighted variant
of the original citation network, with each edge
having an integer weight drawn from a Poisson
distribution with mean 1 (cf. Rosvall & Bergstrom,
2009)
• Total edge weight in the bootstrap citation network
will be approximately equal to the number of edges
in the original network
• For large networks, parametric and non-parametric
bootstrapping coincide
• We use parametric bootstrapping
7
Data
• Library & Information Sciences (LIS):
– Time period: 1996-2013
– Publications: 31,534
– Citation links: 131,266
• Astrophysics (Berlin dataset):
– Time period: 2003-2010
– Publications: 101,828
– Citation links: 924,171
8
Cluster stability LIS
9
Stable clusters LIS (resolution 2)
10
Stable clusters LIS (resolution 2)
11
Cluster stability Berlin
12
Cluster stability
13
LIS Berlin
Conclusions
• What is a good clustering of publication?
– High accuracy: Publications in the same cluster are topically
related
– High level of detail: It is possible to have a large number of
clusters
– Comprehensiveness: The clustering includes all publications
– Uniformity in cluster size: Clusters are of roughly the same size
• It seems impossible to obtain a clustering that has
all properties listed above
• At least one property needs to be given up
14
Conclusions
• Why cannot we have an accurate and detailed
clustering that includes all publications?
– Consider the field of scientometrics
– We would expect an accurate and detailed clustering to have
clusters dealing with topics such as indicators, science mapping,
collaboration, patents, etc.
– However, many publications in scientometrics (e.g., case studies)
do not neatly belong to one of these topics and therefore cannot
be accurately assigned to a cluster
• If we want to have an accurate and detailed
clustering, we need to be satisfied with a clustering
that doesn’t comprehensively cover all publications
• The clustering covers only publications related to
the main topics in the fields 15
Conclusions
• Analysis of cluster stability offers an approach to
distinguish between meaningful and non-
meaningful assignments of publications to clusters
• Clustering based on direct citations is
computationally attractive but ignores relevant
information (e.g., bibliographic coupling)
• A post processing procedure can be developed to
try to assign ‘isolated publications’ to stable
clusters based on additional information
• Cluster stability is a general idea that can be
applied also to other clustering approaches
16
References
Rosvall, M., & Bergstrom, C.T. (2009). Mapping change
in large networks. PLoS ONE, 5(1), e8694.
http://dx.doi.org/10.1371/journal.pone.0008694
Waltman, L., & Van Eck, N.J. (2012). A new methodology
for constructing a publication-level classification
system of science. JASIST, 63(12), 2378-2392.
http://dx.doi.org/10.1002/asi.22748
Waltman, L., & Van Eck, N.J. (2013). A smart local moving
algorithm for large-scale modularity-based community
detection. European Physical Journal B, 86(11), 471.
http://dx.doi.org/10.1140/epjb/e2013-40829-0
17

Weitere ähnliche Inhalte

Was ist angesagt?

VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
Nees Jan van Eck
 
VOSviewer and CitNetExplorer Tutorial
VOSviewer and CitNetExplorer TutorialVOSviewer and CitNetExplorer Tutorial
VOSviewer and CitNetExplorer Tutorial
Nees Jan van Eck
 

Was ist angesagt? (20)

Science Mapping and Research Positioning
Science Mapping and Research PositioningScience Mapping and Research Positioning
Science Mapping and Research Positioning
 
On cluster stability
On cluster stabilityOn cluster stability
On cluster stability
 
Advanced citation matching and large-scale cited reference extraction
Advanced citation matching and large-scale cited reference extractionAdvanced citation matching and large-scale cited reference extraction
Advanced citation matching and large-scale cited reference extraction
 
Large-scale analysis of bibliometric data sources
Large-scale analysis of bibliometric data sourcesLarge-scale analysis of bibliometric data sources
Large-scale analysis of bibliometric data sources
 
Intermediacy of publications
Intermediacy of publicationsIntermediacy of publications
Intermediacy of publications
 
Large-scale analysis of bibliometric networks
Large-scale analysis of bibliometric networksLarge-scale analysis of bibliometric networks
Large-scale analysis of bibliometric networks
 
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
VOSviewer and CitNetExplorer: Software tools for bibliometric analysis of s...
 
VOSviewer and CitNetExplorer Tutorial
VOSviewer and CitNetExplorer TutorialVOSviewer and CitNetExplorer Tutorial
VOSviewer and CitNetExplorer Tutorial
 
Bibliometric network analysis: Software tools, techniques, and an analysis o...
Bibliometric network analysis: Software tools, techniques, and an analysis o...Bibliometric network analysis: Software tools, techniques, and an analysis o...
Bibliometric network analysis: Software tools, techniques, and an analysis o...
 
VOSviewer: A software tool for analyzing and visualizing scientific literature
VOSviewer: A software tool for analyzing and visualizing scientific literatureVOSviewer: A software tool for analyzing and visualizing scientific literature
VOSviewer: A software tool for analyzing and visualizing scientific literature
 
Visual exploration of scientific literature using VOSviewer and CitNetExplorer
Visual exploration of scientific literature using VOSviewer and CitNetExplorerVisual exploration of scientific literature using VOSviewer and CitNetExplorer
Visual exploration of scientific literature using VOSviewer and CitNetExplorer
 
Open data sources in VOSviewer
Open data sources in VOSviewerOpen data sources in VOSviewer
Open data sources in VOSviewer
 
Large-scale visualization of science
Large-scale visualization of scienceLarge-scale visualization of science
Large-scale visualization of science
 
Visualizing science based on open data sources
Visualizing science based on open data sourcesVisualizing science based on open data sources
Visualizing science based on open data sources
 
Bibliometric visualization using VOSviewer
Bibliometric visualization using VOSviewerBibliometric visualization using VOSviewer
Bibliometric visualization using VOSviewer
 
Large-scale visualization of science: Methods, tools, and applications
Large-scale visualization of science: Methods, tools, and applicationsLarge-scale visualization of science: Methods, tools, and applications
Large-scale visualization of science: Methods, tools, and applications
 
Using full-text data to create improved term maps
Using full-text data to create improved term mapsUsing full-text data to create improved term maps
Using full-text data to create improved term maps
 
The landscape of research on research
The landscape of research on researchThe landscape of research on research
The landscape of research on research
 
Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...
Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...
Visualizing science using VOSviewer based on Crossref, Microsoft Academic, an...
 
Open science: Implications for bibliometrics and scientometrics
Open science: Implications for bibliometrics and scientometricsOpen science: Implications for bibliometrics and scientometrics
Open science: Implications for bibliometrics and scientometrics
 

Ähnlich wie Cluster stability

Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network Models
Rajarshi Guha
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
guru_prasadg
 

Ähnlich wie Cluster stability (20)

Introduction to Data Analytics with R
Introduction to Data Analytics with RIntroduction to Data Analytics with R
Introduction to Data Analytics with R
 
2016 Cytoscape 3.3 Tutorial
2016 Cytoscape 3.3 Tutorial2016 Cytoscape 3.3 Tutorial
2016 Cytoscape 3.3 Tutorial
 
Data mining Techniques
Data mining TechniquesData mining Techniques
Data mining Techniques
 
DS9 - Clustering.pptx
DS9 - Clustering.pptxDS9 - Clustering.pptx
DS9 - Clustering.pptx
 
algoritma klastering.pdf
algoritma klastering.pdfalgoritma klastering.pdf
algoritma klastering.pdf
 
Characterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network ModelsCharacterization of Chemical Libraries Using Scaffolds and Network Models
Characterization of Chemical Libraries Using Scaffolds and Network Models
 
2015 Cytoscape 3.2 Tutorial
2015 Cytoscape 3.2 Tutorial2015 Cytoscape 3.2 Tutorial
2015 Cytoscape 3.2 Tutorial
 
unit 1.pptx
unit 1.pptxunit 1.pptx
unit 1.pptx
 
Incremental clustering in search engines
Incremental clustering in search enginesIncremental clustering in search engines
Incremental clustering in search engines
 
Week_2_Lecture.pdf
Week_2_Lecture.pdfWeek_2_Lecture.pdf
Week_2_Lecture.pdf
 
Data Mining Technniques
Data Mining TechnniquesData Mining Technniques
Data Mining Technniques
 
Chap8 basic cluster_analysis
Chap8 basic cluster_analysisChap8 basic cluster_analysis
Chap8 basic cluster_analysis
 
Clusteryanam
ClusteryanamClusteryanam
Clusteryanam
 
Clustering
ClusteringClustering
Clustering
 
Data Mining Lecture_8(a).pptx
Data Mining Lecture_8(a).pptxData Mining Lecture_8(a).pptx
Data Mining Lecture_8(a).pptx
 
CLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdfCLUSTERING IN DATA MINING.pdf
CLUSTERING IN DATA MINING.pdf
 
Data analytics for engineers- introduction
Data analytics for engineers-  introductionData analytics for engineers-  introduction
Data analytics for engineers- introduction
 
2016 Bio-IT World Cell Line Coordination 2016-04-06v1
2016 Bio-IT World Cell Line Coordination 2016-04-06v12016 Bio-IT World Cell Line Coordination 2016-04-06v1
2016 Bio-IT World Cell Line Coordination 2016-04-06v1
 
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v12016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
 
Cytoscape Network Visualization and Analysis
Cytoscape Network Visualization and AnalysisCytoscape Network Visualization and Analysis
Cytoscape Network Visualization and Analysis
 

Mehr von Nees Jan van Eck

Mehr von Nees Jan van Eck (11)

Crossref as a source of open bibliographic metadata
Crossref as a source of open bibliographic metadataCrossref as a source of open bibliographic metadata
Crossref as a source of open bibliographic metadata
 
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...
Bibliometrische visualisaties voor het bijhouden van wetenschappelijke litera...
 
Community detection using citation relations and textual similarities in a la...
Community detection using citation relations and textual similarities in a la...Community detection using citation relations and textual similarities in a la...
Community detection using citation relations and textual similarities in a la...
 
A scientometric perspective on university ranking
A scientometric perspective on university rankingA scientometric perspective on university ranking
A scientometric perspective on university ranking
 
Open data sources in VOSviewer
Open data sources in VOSviewerOpen data sources in VOSviewer
Open data sources in VOSviewer
 
A scientometric perspective on university ranking
A scientometric perspective on university rankingA scientometric perspective on university ranking
A scientometric perspective on university ranking
 
CWTS Leiden Ranking: An advanced bibliometric approach to university ranking
CWTS Leiden Ranking: An advanced bibliometric approach to university rankingCWTS Leiden Ranking: An advanced bibliometric approach to university ranking
CWTS Leiden Ranking: An advanced bibliometric approach to university ranking
 
Open data sources in VOSviewer
Open data sources in VOSviewerOpen data sources in VOSviewer
Open data sources in VOSviewer
 
Scientometric approaches to classification
Scientometric approaches to classificationScientometric approaches to classification
Scientometric approaches to classification
 
Accuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and ScopusAccuracy of citation data in Web of Science and Scopus
Accuracy of citation data in Web of Science and Scopus
 
How to design a ranking system: Criteria and opportunities for a comparison
How to design a ranking system: Criteria and opportunities for a comparisonHow to design a ranking system: Criteria and opportunities for a comparison
How to design a ranking system: Criteria and opportunities for a comparison
 

Cluster stability

  • 1. Cluster stability Nees Jan van Eck and Ludo Waltman Centre for Science and Technology Studies (CWTS), Leiden University Workshop “Comparison of Algorithms”, Amsterdam April 20, 2015
  • 2. Problem statement • A clustering technique can be used to obtain highly detailed clustering results (i.e., a large number of clusters) • A clustering technique can be used to force each publication to be assigned to a cluster • However, in a highly detailed clustering, is the assignment of publications to clusters still meaningful? • The assignment of a publication to a cluster may be based on very little information (e.g., a single citation relation) 1
  • 3. Example: Waltman and Van Eck (2012) 2
  • 4. Cluster stability • To ensure that publications are assigned to clusters in a meaningful way, we introduce the notion of stable clusters • Essentially, a cluster is stable if it is insensitive to small changes in the underlying data • Bootstrapping is used to make small changes in the data • There is no formal statistical framework • To some extent, this resembles the stability intervals in the CWTS Leiden Ranking 3
  • 5. Identification of stable clusters: Step 1 • Collect the citation network of publications • Create a large number (e.g., 100) of bootstrap citation networks • In each bootstrap citation network, perform clustering: – Clustering technique of Waltman and Van Eck (2012) – User-defined resolution parameter – Smart local moving algorithm of Waltman and Van Eck (2013) • For each pair of publications, calculate the proportion of the bootstrap clustering results in which the publications are in the same cluster 4
  • 6. Identification of stable clusters: Step 2 • Create a network of publications with an edge between two publications if the publications are in the same cluster in at least a certain proportion (e.g., 0.9) of the bootstrap clustering results • Identify connected components in the newly created network • Each connected component represents a stable cluster 5
  • 7. Non-parametric bootstrapping • Sample with replacement from the set of all citation relations between publications • Make sure to obtain a sample that is of the same size as the original set of citation relations • Some citation relations will occur multiple times in the sample, others won’t occur in it at all • Based on the sampled citation relations, create a bootstrap citation network • Edges have integer weights in this network 6
  • 8. Parametric bootstrapping • A bootstrap citation network is a weighted variant of the original citation network, with each edge having an integer weight drawn from a Poisson distribution with mean 1 (cf. Rosvall & Bergstrom, 2009) • Total edge weight in the bootstrap citation network will be approximately equal to the number of edges in the original network • For large networks, parametric and non-parametric bootstrapping coincide • We use parametric bootstrapping 7
  • 9. Data • Library & Information Sciences (LIS): – Time period: 1996-2013 – Publications: 31,534 – Citation links: 131,266 • Astrophysics (Berlin dataset): – Time period: 2003-2010 – Publications: 101,828 – Citation links: 924,171 8
  • 11. Stable clusters LIS (resolution 2) 10
  • 12. Stable clusters LIS (resolution 2) 11
  • 15. Conclusions • What is a good clustering of publication? – High accuracy: Publications in the same cluster are topically related – High level of detail: It is possible to have a large number of clusters – Comprehensiveness: The clustering includes all publications – Uniformity in cluster size: Clusters are of roughly the same size • It seems impossible to obtain a clustering that has all properties listed above • At least one property needs to be given up 14
  • 16. Conclusions • Why cannot we have an accurate and detailed clustering that includes all publications? – Consider the field of scientometrics – We would expect an accurate and detailed clustering to have clusters dealing with topics such as indicators, science mapping, collaboration, patents, etc. – However, many publications in scientometrics (e.g., case studies) do not neatly belong to one of these topics and therefore cannot be accurately assigned to a cluster • If we want to have an accurate and detailed clustering, we need to be satisfied with a clustering that doesn’t comprehensively cover all publications • The clustering covers only publications related to the main topics in the fields 15
  • 17. Conclusions • Analysis of cluster stability offers an approach to distinguish between meaningful and non- meaningful assignments of publications to clusters • Clustering based on direct citations is computationally attractive but ignores relevant information (e.g., bibliographic coupling) • A post processing procedure can be developed to try to assign ‘isolated publications’ to stable clusters based on additional information • Cluster stability is a general idea that can be applied also to other clustering approaches 16
  • 18. References Rosvall, M., & Bergstrom, C.T. (2009). Mapping change in large networks. PLoS ONE, 5(1), e8694. http://dx.doi.org/10.1371/journal.pone.0008694 Waltman, L., & Van Eck, N.J. (2012). A new methodology for constructing a publication-level classification system of science. JASIST, 63(12), 2378-2392. http://dx.doi.org/10.1002/asi.22748 Waltman, L., & Van Eck, N.J. (2013). A smart local moving algorithm for large-scale modularity-based community detection. European Physical Journal B, 86(11), 471. http://dx.doi.org/10.1140/epjb/e2013-40829-0 17