SlideShare a Scribd company logo
1 of 29
Challenging Problems forChallenging Problems for
Scalable Mining ofScalable Mining of
Heterogeneous Social andHeterogeneous Social and
Information NetworksInformation Networks
Jiawei Han
Computer Science , University of Illinois at Urbana-Champaign
Collaborated with many, especially Yizhou Sun, Ming Ji, Chi Wang, Tim Weninger, Xiaoxin Yin, Bo
Zhao
Acknowledgements: ARL, NSF, AFOSR (MURI), NASA, Microsoft, IBM, Yahoo!, Boeing
August 12, 2013
1
2
OutlineOutline
 Why Is Mining Heterogeneous Social and Info NetworksWhy Is Mining Heterogeneous Social and Info Networks
Promising?Promising?
 Homogeneous vs. Heterogeneous Social and Info. NetworksHomogeneous vs. Heterogeneous Social and Info. Networks
 On the Power of Mining Structured, Heterogeneous Social andOn the Power of Mining Structured, Heterogeneous Social and
Info. NetworksInfo. Networks
 Challenges on BigMine: Scalable Mining of MassiveChallenges on BigMine: Scalable Mining of Massive
Heterogeneous Social and Information NetworksHeterogeneous Social and Information Networks
 PathSim: Online, Query-Based Similarity SearchPathSim: Online, Query-Based Similarity Search
 PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path
 Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge
 ConclusionsConclusions
Where There Is Information,Where There Is Information,
There Are Networks!There Are Networks!
Social Networking WebsitesSocial Networking Websites Biological Network: Protein InteractionBiological Network: Protein Interaction
Research Collaboration NetworkResearch Collaboration Network Product Recommendation Network via EmailsProduct Recommendation Network via Emails
The Real World: Heterogeneous NetworksThe Real World: Heterogeneous Networks
 Multiple object types and/or multiple link types
VenueVenue PaperPaper AuthorAuthor
DBLP Bibliographic NetworkDBLP Bibliographic Network The IMDB Movie NetworkThe IMDB Movie Network
ActorActor
MovieMovie
DirectorDirector
MovieMovie
StudioStudio
Homogeneous networks are information lossinformation loss projection of
heterogeneous networks!
The Facebook NetworkThe Facebook Network
Directly mining information-richer heterogeneous networksDirectly mining information-richer heterogeneous networks
Structured Heterogeneous Network ModelingStructured Heterogeneous Network Modeling
Leads to the New Power of Data Mining!Leads to the New Power of Data Mining!
 DBLP: A Computer Science bibliographic database
A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues), …
5
Power of het. network modeling: Treat Author, Venue, Term, Paper all first-class citizens!
6
OutlineOutline
 Why Is Mining Heterogeneous Social and Info NetworksWhy Is Mining Heterogeneous Social and Info Networks
Promising?Promising?
 Homogeneous vs. Heterogeneous Social and Info. NetworksHomogeneous vs. Heterogeneous Social and Info. Networks
 On the Power of Mining Structured, Heterogeneous Social andOn the Power of Mining Structured, Heterogeneous Social and
Info. NetworksInfo. Networks
 Challenges on BigMine: Scalable Mining of MassiveChallenges on BigMine: Scalable Mining of Massive
Heterogeneous Social and Information NetworksHeterogeneous Social and Information Networks
 PathSim: Online, Query-Based Similarity SearchPathSim: Online, Query-Based Similarity Search
 PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path
 Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge
 ConclusionsConclusions
7
On the Power of Mining Structured,On the Power of Mining Structured,
Heterogeneous NetworksHeterogeneous Networks
 Links carry a lot of hidden information in structured,Links carry a lot of hidden information in structured,
heterogeneous social and information networksheterogeneous social and information networks
 Effectiveness of miningEffectiveness of mining
 Clustering in heterogeneous networks: Rank-basedClustering in heterogeneous networks: Rank-based
clustering: (RankClus [EDBT’09] and NetClus [KDD’09]) andclustering: (RankClus [EDBT’09] and NetClus [KDD’09]) and
user-guided, meta-path-based clustering [KDD’12]user-guided, meta-path-based clustering [KDD’12]
 Knowledge propgation through heterogeneous linksKnowledge propgation through heterogeneous links
(GNetMine [ECMLPKDD’10]) and Rank-based classification(GNetMine [ECMLPKDD’10]) and Rank-based classification
(RankClass [KDD’11])(RankClass [KDD’11])
 Meta-path-based similarity search (PathSim [VLDB’11])Meta-path-based similarity search (PathSim [VLDB’11])
 Meta-path-based prediction in heterogeneous networksMeta-path-based prediction in heterogeneous networks
(PathPredict [ASONAM’11])(PathPredict [ASONAM’11])
RankClus:RankClus: Integrated Clustering and RankingIntegrated Clustering and Ranking
 Highly ranked objects are
more important (i.e.,
more weighted) in a
cluster than weakly
ranked ones
 Ranking will make more
sense within one cluster
than in multiple clusters
 Ranking, as the feature of
the cluster, is conditional
to a specific cluster
Sub-Network
Ranking
Clustering
8
 Clustering and ranking mutually enhance each other at each iteration
 RankClus [EDBT’09]: An efficient, EM-like algorithm
9
with Star Network Schemawith Star Network Schema
[KDD’09][KDD’09]
 Beyond bi-typed information network: A Star Network Schema
 Split a network into different layers, each representing by a net-
cluster
10
NetClus: Database System ClusterNetClus: Database System Cluster
database 0.0995511
databases 0.0708818
system 0.0678563
data 0.0214893
query 0.0133316
systems 0.0110413
queries 0.0090603
management 0.00850744
object 0.00837766
relational 0.0081175
processing 0.00745875
based 0.00736599
distributed 0.0068367
xml 0.00664958
oriented 0.00589557
design 0.00527672
web 0.00509167
information 0.0050518
model 0.00499396
efficient 0.00465707
Surajit Chaudhuri 0.00678065
Michael Stonebraker 0.00616469
Michael J. Carey 0.00545769
C. Mohan 0.00528346
David J. DeWitt 0.00491615
Hector Garcia-Molina 0.00453497
H. V. Jagadish 0.00434289
David B. Lomet 0.00397865
Raghu Ramakrishnan 0.0039278
Philip A. Bernstein 0.00376314
Joseph M. Hellerstein 0.00372064
Jeffrey F. Naughton 0.00363698
Yannis E. Ioannidis 0.00359853
Jennifer Widom 0.00351929
Per-Ake Larson 0.00334911
Rakesh Agrawal 0.00328274
Dan Suciu 0.00309047
Michael J. Franklin 0.00304099
Umeshwar Dayal 0.00290143
Abraham Silberschatz 0.00278185
VLDB 0.318495
SIGMOD Conf. 0.313903
ICDE 0.188746
PODS 0.107943
EDBT 0.0436849
Go one-level deeper:
Authors in XML, Xquery cluster
Term Venue Author
Rank-Based Clustering for OthersRank-Based Clustering for Others
11
RankCompete: Organize your photo album automatically!RankCompete: Organize your photo album automatically!
Rank treatments for AIDS from MEDLINERank treatments for AIDS from MEDLINE
12
Classification in Heterogeneous NetworksClassification in Heterogeneous Networks
 GNetMine
[ECMLPKDD'10]:
Knowledge propagation
across heterogeneous links
 RankClass [KDD’11]:
Integration of ranking and
classification in
heterogeneous network
analysis
 Highly ranked objects play
more role in classification An object can only be ranked high in some focused classes
 Class membership and ranking are stat. distributions
 Let ranking and classification mutually enhance each other!
 Output: Classification results + ranking list of objects within each class
Experiments with Very Small Training SetExperiments with Very Small Training Set
 DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. network
 Rank objects within each class (with extremely limited label information)
 Obtain High classification accuracy and excellent rankings within each class
Database Data Mining AI IR
Top-5 ranked
conferences
VLDB KDD IJCAI SIGIR
SIGMOD SDM AAAI ECIR
ICDE ICDM ICML CIKM
PODS PKDD CVPR WWW
EDBT PAKDD ECML WSDM
Top-5 ranked
terms
data mining learning retrieval
database data knowledge information
query clustering reasoning web
system classification logic search
xml frequent cognition text
13
Similarity Search: Find Similar Objects in NetworksSimilarity Search: Find Similar Objects in Networks
 Who are most similar to Christos Faloutsos?
 Meta-Path: Meta-level description of a path between
two objects
Christos’s students or close collaborators Similar reputation at similar venues
Meta-Path: Author-Paper-Author (APA) Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)
14
Schema of the
DBLP Network
Different meta-paths lead
to very different results!
 Different meta-paths carry
rather different semantics
Which Similarity Measure Is Better?Which Similarity Measure Is Better?
 Anhai Doan
 CS, Wisconsin
 Database area
 PhD: 2002
Meta-Path: Author-Paper-Venue-Paper-Author (APVPA)
• Jignesh Patel
• CS, Wisconsin
• Database area
• PhD: 1998
• Amol Deshpande
• CS, Maryland
• Database area
• PhD: 2004
• Jun Yang
• CS, Duke
• Database area
• PhD: 2001
15
PathSim [VLDB’11]
PathPredict:PathPredict: Meta-Path Based New Co-authorMeta-Path Based New Co-author
Relationship Prediction in DBLP [ASONAM’11]Relationship Prediction in DBLP [ASONAM’11]
 Co-authorship prediction: Whether two authors are going to
collaborate for the first time
 Co-authorship encoded in meta-path
 Author-Paper-Author (A-P-A)
 Topological features encoded in meta-paths as below:
Meta-paths between authors under length 4Meta-paths between authors under length 4
Meta-Path Semantic Meaning
16
The Success of PathPredict: Exploring Meta-PathsThe Success of PathPredict: Exploring Meta-Paths
 Explain the prediction
power of each meta-path
 Wald Test for logistic
regression
 Higher prediction accuracy
than using projected
homogeneous network
 11% higher in
prediction accuracy
 Citation prediction
 The selected meta-
paths could be rather
different
17
Co-author predictionCo-author prediction for Jian Peifor Jian Pei: Only 42 among 4809: Only 42 among 4809
candidates are true first-time co-authors!candidates are true first-time co-authors!
(Feature collected in [1996, 2002]; Test period in
[2003,2009])
18
OutlineOutline
 Why Is Mining Heterogeneous Social and Info NetworksWhy Is Mining Heterogeneous Social and Info Networks
Promising?Promising?
 Homogeneous vs. Heterogeneous Social and Info. NetworksHomogeneous vs. Heterogeneous Social and Info. Networks
 On the Power of Mining Structured, Heterogeneous Social andOn the Power of Mining Structured, Heterogeneous Social and
Info. NetworksInfo. Networks
 Challenges on BigMine: Scalable Mining of MassiveChallenges on BigMine: Scalable Mining of Massive
Heterogeneous Social and Information NetworksHeterogeneous Social and Information Networks
 PathSim: Online, Query-Based Similarity SearchPathSim: Online, Query-Based Similarity Search
 PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path
 Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge
 ConclusionsConclusions
19
Challenges on BigMineChallenges on BigMine
 Scalable mining of massive information networks: Necessity
 Many such networks are gigantic: News, PubMed, …
 DBLP is a small one: 2M papers and 0.8M authors, …
 Meta-path: Potentially long chains of matrix multiplication of
such networks
 APVPA: AP X PV X VP X PA
 Comparative analysis of multi-meta-paths is costly
 Scalable mining of massive information networks: Possibility
 Many functions do not need to compute eigen values
 Top-k computation may save computation cost substantially
 Precomputation may save online computation substantially
 Clustering-based precomputation:
20
Computing Eigen Values: When Need It?Computing Eigen Values: When Need It?
 Computations needed
 Clustering (RankClus), classification (RankClass), similarity
search (PathSim), prediction (PathPredict)
 A small # of interactive processing (e.g., EM-styled)
 Meta-path-based prediction : Selection from a set of “parallel”
meta-paths
Long Meta-Path May Not Carry the Right SemanticsLong Meta-Path May Not Carry the Right Semantics
 Repeat the meta-path 2, 4, and infinite times for conference
similarity query
21
22
Top-K Computation Is What We NeedTop-K Computation Is What We Need
 Similarity search: “Who are similar to Christos?”
 There is no need/interest to calculate and rank the remaining
0.8M authors
 Only top-k (e.g., top-100) authors are needed in practice
 Lots of optimizations can be explored for top-k computation
 Precomputation vs. online computation
 Precomputation of long meta-paths will save online, costly
multi-matrix multiplication
 Clustering-based precomputation
 Example: top-k similarity authors
 Precomputation by clustering: only computing rather
similar author groups
Co-Clustering-Based Pruning AlgorithmCo-Clustering-Based Pruning Algorithm
 General idea:
 Store commuting matrices for short path schemas and
compute top-k queries on line
 Framework
 Generate co-clusters for materialized commuting matrices, for
feature objects and target objects
 Derive upper bound for similarity between object and target
cluster, and between object and object
 Safely pruning target clusters and objects if the upper
bound similarity is lower than current threshold
 Dynamically update top-k threshold
Similarity Search: Experiments on EfficiencySimilarity Search: Experiments on Efficiency
 Searching for top-20 objects vs.
1001th-1020th objects: PathSim-
pruning is more efficient than
PathSim-baseline
 The denser the corresponding
commuting matrix, the more
PathSim-pruning can improve
 The more neighbors of a query, the
more PathSim-pruning can improve
 Then compare the efficiency under
different top-k’s (k = 5, 10, 20) for
PathSim-pruning using query set 1
 A smaller top-k has stronger
pruning power, and thus needs less
execution time
24
PathPredict: Exploring Big Data SpacePathPredict: Exploring Big Data Space
 Scalable computation in
really huge heterogeneous
networks?
 Sampling may lead to
similar judgment on
importance of meta-path
 Query-dependent
prediction can be
“selective” and thus may
not need that much
resources
 Precomputation and
clustering may further
enhance its efficiency
25
26
Mining Query-Relevant “Hidden” NetworksMining Query-Relevant “Hidden” Networks
 Query-relevant hidden networks
 What is the hidden network closely relevant to “SVM”?
 The network should contains weighted network consisting of
papers, terms, authors and venues
 Is “kernel machine” closely relevant to “SVM”? How could we
know it?
 It takes substantial computation to derive such a
“weighted/ranked” hidden heterogeneous network
 Due to the diversity of queries (e.g., SVM + Cloud + SIGMOD), it is
impossible to precompute every possible combinations
 How can we compute such hidden network efficiently on the fly?
 An interesting open problem
27
ConclusionsConclusions
 Heterogeneous social & information networks are ubiquitous
 Most datasets can be “organized” or “transformed” into
“structured” multi-typed heterogeneous info. networks
 Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, …
 Surprisingly rich knowledge can be mined from structured
heterogeneous info. networks
 Clustering, ranking, classification, path prediction, ……
 Knowledge is power, but knowledge is hidden in massive, but
“relatively structured” nodes and links!
 Challenge to BigMine: How to mining massive, heterogeneous
information networks efficiently
 Some progress/tricks on scalability and efficiency
 Many open problems and much more to be explored!
From Data Mining to Mining Info. NetworksFrom Data Mining to Mining Info. Networks
28
Han, Kamber and Pei,
Data Mining, 3rd
ed. 2011
Yu, Han and Faloutsos (eds.),
Link Mining, 2010
Sun and Han, Mining Heterogeneous
Information Networks, 2012
ReferencesReferences
 M. Ji, J. Han, and M. Danilevsky, "Ranking-Based Classification of Heterogeneous Information
Networks", KDD'11.
 Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies,
Morgan & Claypool Publishers, 2012
 Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous
Information Network Analysis", EDBT’09
 Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with
Star Network Schema", KDD’09
 Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity Search in
Heterogeneous Information Networks”, VLDB'11
 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in
Heterogeneous Bibliographic Networks", ASONAM'11
 Y. Sun, J. Han, C. C. Aggarwal, N. Chawla, “When Will It Happen? Relationship Prediction in
Heterogeneous Information Networks”, WSDM'12
 F. Tao, et al., “EventCube: Multi-Dimensional Search and Mining of Structured and Text Data”,
(system demo) KDD’13
 C. Wang, J. Han, et al., “Mining Advisor-Advisee Relationships from Research Publication
Networks", KDD'10
 C. Wang, M. Danilevsky, et al., “A Phrase Mining Framework for Recursive Construction of a
Topical Hierarchy”, KDD’13
29

More Related Content

What's hot

Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesKathirvel Ayyaswamy
 
Data mining & big data presentation 01
Data mining & big data presentation 01Data mining & big data presentation 01
Data mining & big data presentation 01Aseem Chakrabarthy
 
Big data ppt
Big data pptBig data ppt
Big data pptYash Raj
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035Neelam Rawat
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - IntroductionAlex Meadows
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and howbobosenthil
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data WarehousingThomas Kejser
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataRichard Vidgen
 

What's hot (20)

Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Data mining & big data presentation 01
Data mining & big data presentation 01Data mining & big data presentation 01
Data mining & big data presentation 01
 
Sina Sohangir Presentation on IWMC 2015
Sina Sohangir Presentation on IWMC 2015Sina Sohangir Presentation on IWMC 2015
Sina Sohangir Presentation on IWMC 2015
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Research paper on big data and hadoop
Research paper on big data and hadoopResearch paper on big data and hadoop
Research paper on big data and hadoop
 
Big data-analytics-cpe8035
Big data-analytics-cpe8035Big data-analytics-cpe8035
Big data-analytics-cpe8035
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - Introduction
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Big Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning GuruBig Data Hadoop Training by Easylearning Guru
Big Data Hadoop Training by Easylearning Guru
 
What is big data?
What is big data?What is big data?
What is big data?
 
#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"
 
Big data - what, why, where, when and how
Big data - what, why, where, when and howBig data - what, why, where, when and how
Big data - what, why, where, when and how
 
Big Data vs Data Warehousing
Big Data vs Data WarehousingBig Data vs Data Warehousing
Big Data vs Data Warehousing
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 

Viewers also liked

Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosLarge Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosBigMine
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...BigMine
 
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 Big & Personal: the data and the models behind Netflix recommendations by Xa... Big & Personal: the data and the models behind Netflix recommendations by Xa...
Big & Personal: the data and the models behind Netflix recommendations by Xa...BigMine
 
Mining Interesting Meta-Paths from Complex Heterogeneous Information Networks
Mining Interesting Meta-Paths from Complex Heterogeneous Information NetworksMining Interesting Meta-Paths from Complex Heterogeneous Information Networks
Mining Interesting Meta-Paths from Complex Heterogeneous Information NetworksBaoxu Shi
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like YouSalford Systems
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 

Viewers also liked (7)

Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosLarge Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
 
Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...Processing Reachability Queries with Realistic Constraints on Massive Network...
Processing Reachability Queries with Realistic Constraints on Massive Network...
 
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 Big & Personal: the data and the models behind Netflix recommendations by Xa... Big & Personal: the data and the models behind Netflix recommendations by Xa...
Big & Personal: the data and the models behind Netflix recommendations by Xa...
 
Mining Interesting Meta-Paths from Complex Heterogeneous Information Networks
Mining Interesting Meta-Paths from Complex Heterogeneous Information NetworksMining Interesting Meta-Paths from Complex Heterogeneous Information Networks
Mining Interesting Meta-Paths from Complex Heterogeneous Information Networks
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 

Similar to Challenging Problems for Scalable Mining of Heterogeneous Social and Information Networks by Jiawei Han

Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1DanWooster1
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.pptadmsoyadm4
 
UNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfUNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfNancykumari47
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataAndre Freitas
 
The technical case for a semantic web
The technical case for a semantic webThe technical case for a semantic web
The technical case for a semantic webTony Dobaj
 
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...Artificial Intelligence Institute at UofSC
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachAndre Freitas
 
Semantic Interoperability and Information Brokering in Global Information Sys...
Semantic Interoperability and Information Brokering in Global Information Sys...Semantic Interoperability and Information Brokering in Global Information Sys...
Semantic Interoperability and Information Brokering in Global Information Sys...Amit Sheth
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsIJMER
 
Noshir Contractor's view on the future of Linked Data
Noshir Contractor's view on the future of Linked DataNoshir Contractor's view on the future of Linked Data
Noshir Contractor's view on the future of Linked DataCarlos Pedrinaci
 
Family Tree's Journey Toward An Edge Dominant System
Family Tree's Journey Toward An Edge Dominant SystemFamily Tree's Journey Toward An Edge Dominant System
Family Tree's Journey Toward An Edge Dominant SystemRandy Ynchausti
 
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]Shirshanka Das
 

Similar to Challenging Problems for Scalable Mining of Heterogeneous Social and Information Networks by Jiawei Han (20)

Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
data mining
data miningdata mining
data mining
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
UNIT2-Data Mining.pdf
UNIT2-Data Mining.pdfUNIT2-Data Mining.pdf
UNIT2-Data Mining.pdf
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
The technical case for a semantic web
The technical case for a semantic webThe technical case for a semantic web
The technical case for a semantic web
 
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
Relationships at the Heart of Semantic Web: Modeling, Discovering, Validating...
 
ECCS 2010
ECCS 2010ECCS 2010
ECCS 2010
 
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
 
Semantic Interoperability and Information Brokering in Global Information Sys...
Semantic Interoperability and Information Brokering in Global Information Sys...Semantic Interoperability and Information Brokering in Global Information Sys...
Semantic Interoperability and Information Brokering in Global Information Sys...
 
Data Mining: Future Trends and Applications
Data Mining: Future Trends and ApplicationsData Mining: Future Trends and Applications
Data Mining: Future Trends and Applications
 
Noshir Contractor's view on the future of Linked Data
Noshir Contractor's view on the future of Linked DataNoshir Contractor's view on the future of Linked Data
Noshir Contractor's view on the future of Linked Data
 
Data Mining
Data MiningData Mining
Data Mining
 
Family Tree's Journey Toward An Edge Dominant System
Family Tree's Journey Toward An Edge Dominant SystemFamily Tree's Journey Toward An Edge Dominant System
Family Tree's Journey Toward An Edge Dominant System
 
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
 

Recently uploaded

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Challenging Problems for Scalable Mining of Heterogeneous Social and Information Networks by Jiawei Han

  • 1. Challenging Problems forChallenging Problems for Scalable Mining ofScalable Mining of Heterogeneous Social andHeterogeneous Social and Information NetworksInformation Networks Jiawei Han Computer Science , University of Illinois at Urbana-Champaign Collaborated with many, especially Yizhou Sun, Ming Ji, Chi Wang, Tim Weninger, Xiaoxin Yin, Bo Zhao Acknowledgements: ARL, NSF, AFOSR (MURI), NASA, Microsoft, IBM, Yahoo!, Boeing August 12, 2013 1
  • 2. 2 OutlineOutline  Why Is Mining Heterogeneous Social and Info NetworksWhy Is Mining Heterogeneous Social and Info Networks Promising?Promising?  Homogeneous vs. Heterogeneous Social and Info. NetworksHomogeneous vs. Heterogeneous Social and Info. Networks  On the Power of Mining Structured, Heterogeneous Social andOn the Power of Mining Structured, Heterogeneous Social and Info. NetworksInfo. Networks  Challenges on BigMine: Scalable Mining of MassiveChallenges on BigMine: Scalable Mining of Massive Heterogeneous Social and Information NetworksHeterogeneous Social and Information Networks  PathSim: Online, Query-Based Similarity SearchPathSim: Online, Query-Based Similarity Search  PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path  Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge  ConclusionsConclusions
  • 3. Where There Is Information,Where There Is Information, There Are Networks!There Are Networks! Social Networking WebsitesSocial Networking Websites Biological Network: Protein InteractionBiological Network: Protein Interaction Research Collaboration NetworkResearch Collaboration Network Product Recommendation Network via EmailsProduct Recommendation Network via Emails
  • 4. The Real World: Heterogeneous NetworksThe Real World: Heterogeneous Networks  Multiple object types and/or multiple link types VenueVenue PaperPaper AuthorAuthor DBLP Bibliographic NetworkDBLP Bibliographic Network The IMDB Movie NetworkThe IMDB Movie Network ActorActor MovieMovie DirectorDirector MovieMovie StudioStudio Homogeneous networks are information lossinformation loss projection of heterogeneous networks! The Facebook NetworkThe Facebook Network Directly mining information-richer heterogeneous networksDirectly mining information-richer heterogeneous networks
  • 5. Structured Heterogeneous Network ModelingStructured Heterogeneous Network Modeling Leads to the New Power of Data Mining!Leads to the New Power of Data Mining!  DBLP: A Computer Science bibliographic database A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues), … 5 Power of het. network modeling: Treat Author, Venue, Term, Paper all first-class citizens!
  • 6. 6 OutlineOutline  Why Is Mining Heterogeneous Social and Info NetworksWhy Is Mining Heterogeneous Social and Info Networks Promising?Promising?  Homogeneous vs. Heterogeneous Social and Info. NetworksHomogeneous vs. Heterogeneous Social and Info. Networks  On the Power of Mining Structured, Heterogeneous Social andOn the Power of Mining Structured, Heterogeneous Social and Info. NetworksInfo. Networks  Challenges on BigMine: Scalable Mining of MassiveChallenges on BigMine: Scalable Mining of Massive Heterogeneous Social and Information NetworksHeterogeneous Social and Information Networks  PathSim: Online, Query-Based Similarity SearchPathSim: Online, Query-Based Similarity Search  PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path  Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge  ConclusionsConclusions
  • 7. 7 On the Power of Mining Structured,On the Power of Mining Structured, Heterogeneous NetworksHeterogeneous Networks  Links carry a lot of hidden information in structured,Links carry a lot of hidden information in structured, heterogeneous social and information networksheterogeneous social and information networks  Effectiveness of miningEffectiveness of mining  Clustering in heterogeneous networks: Rank-basedClustering in heterogeneous networks: Rank-based clustering: (RankClus [EDBT’09] and NetClus [KDD’09]) andclustering: (RankClus [EDBT’09] and NetClus [KDD’09]) and user-guided, meta-path-based clustering [KDD’12]user-guided, meta-path-based clustering [KDD’12]  Knowledge propgation through heterogeneous linksKnowledge propgation through heterogeneous links (GNetMine [ECMLPKDD’10]) and Rank-based classification(GNetMine [ECMLPKDD’10]) and Rank-based classification (RankClass [KDD’11])(RankClass [KDD’11])  Meta-path-based similarity search (PathSim [VLDB’11])Meta-path-based similarity search (PathSim [VLDB’11])  Meta-path-based prediction in heterogeneous networksMeta-path-based prediction in heterogeneous networks (PathPredict [ASONAM’11])(PathPredict [ASONAM’11])
  • 8. RankClus:RankClus: Integrated Clustering and RankingIntegrated Clustering and Ranking  Highly ranked objects are more important (i.e., more weighted) in a cluster than weakly ranked ones  Ranking will make more sense within one cluster than in multiple clusters  Ranking, as the feature of the cluster, is conditional to a specific cluster Sub-Network Ranking Clustering 8  Clustering and ranking mutually enhance each other at each iteration  RankClus [EDBT’09]: An efficient, EM-like algorithm
  • 9. 9 with Star Network Schemawith Star Network Schema [KDD’09][KDD’09]  Beyond bi-typed information network: A Star Network Schema  Split a network into different layers, each representing by a net- cluster
  • 10. 10 NetClus: Database System ClusterNetClus: Database System Cluster database 0.0995511 databases 0.0708818 system 0.0678563 data 0.0214893 query 0.0133316 systems 0.0110413 queries 0.0090603 management 0.00850744 object 0.00837766 relational 0.0081175 processing 0.00745875 based 0.00736599 distributed 0.0068367 xml 0.00664958 oriented 0.00589557 design 0.00527672 web 0.00509167 information 0.0050518 model 0.00499396 efficient 0.00465707 Surajit Chaudhuri 0.00678065 Michael Stonebraker 0.00616469 Michael J. Carey 0.00545769 C. Mohan 0.00528346 David J. DeWitt 0.00491615 Hector Garcia-Molina 0.00453497 H. V. Jagadish 0.00434289 David B. Lomet 0.00397865 Raghu Ramakrishnan 0.0039278 Philip A. Bernstein 0.00376314 Joseph M. Hellerstein 0.00372064 Jeffrey F. Naughton 0.00363698 Yannis E. Ioannidis 0.00359853 Jennifer Widom 0.00351929 Per-Ake Larson 0.00334911 Rakesh Agrawal 0.00328274 Dan Suciu 0.00309047 Michael J. Franklin 0.00304099 Umeshwar Dayal 0.00290143 Abraham Silberschatz 0.00278185 VLDB 0.318495 SIGMOD Conf. 0.313903 ICDE 0.188746 PODS 0.107943 EDBT 0.0436849 Go one-level deeper: Authors in XML, Xquery cluster Term Venue Author
  • 11. Rank-Based Clustering for OthersRank-Based Clustering for Others 11 RankCompete: Organize your photo album automatically!RankCompete: Organize your photo album automatically! Rank treatments for AIDS from MEDLINERank treatments for AIDS from MEDLINE
  • 12. 12 Classification in Heterogeneous NetworksClassification in Heterogeneous Networks  GNetMine [ECMLPKDD'10]: Knowledge propagation across heterogeneous links  RankClass [KDD’11]: Integration of ranking and classification in heterogeneous network analysis  Highly ranked objects play more role in classification An object can only be ranked high in some focused classes  Class membership and ranking are stat. distributions  Let ranking and classification mutually enhance each other!  Output: Classification results + ranking list of objects within each class
  • 13. Experiments with Very Small Training SetExperiments with Very Small Training Set  DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. network  Rank objects within each class (with extremely limited label information)  Obtain High classification accuracy and excellent rankings within each class Database Data Mining AI IR Top-5 ranked conferences VLDB KDD IJCAI SIGIR SIGMOD SDM AAAI ECIR ICDE ICDM ICML CIKM PODS PKDD CVPR WWW EDBT PAKDD ECML WSDM Top-5 ranked terms data mining learning retrieval database data knowledge information query clustering reasoning web system classification logic search xml frequent cognition text 13
  • 14. Similarity Search: Find Similar Objects in NetworksSimilarity Search: Find Similar Objects in Networks  Who are most similar to Christos Faloutsos?  Meta-Path: Meta-level description of a path between two objects Christos’s students or close collaborators Similar reputation at similar venues Meta-Path: Author-Paper-Author (APA) Meta-Path: Author-Paper-Venue-Paper-Author (APVPA) 14 Schema of the DBLP Network Different meta-paths lead to very different results!  Different meta-paths carry rather different semantics
  • 15. Which Similarity Measure Is Better?Which Similarity Measure Is Better?  Anhai Doan  CS, Wisconsin  Database area  PhD: 2002 Meta-Path: Author-Paper-Venue-Paper-Author (APVPA) • Jignesh Patel • CS, Wisconsin • Database area • PhD: 1998 • Amol Deshpande • CS, Maryland • Database area • PhD: 2004 • Jun Yang • CS, Duke • Database area • PhD: 2001 15 PathSim [VLDB’11]
  • 16. PathPredict:PathPredict: Meta-Path Based New Co-authorMeta-Path Based New Co-author Relationship Prediction in DBLP [ASONAM’11]Relationship Prediction in DBLP [ASONAM’11]  Co-authorship prediction: Whether two authors are going to collaborate for the first time  Co-authorship encoded in meta-path  Author-Paper-Author (A-P-A)  Topological features encoded in meta-paths as below: Meta-paths between authors under length 4Meta-paths between authors under length 4 Meta-Path Semantic Meaning 16
  • 17. The Success of PathPredict: Exploring Meta-PathsThe Success of PathPredict: Exploring Meta-Paths  Explain the prediction power of each meta-path  Wald Test for logistic regression  Higher prediction accuracy than using projected homogeneous network  11% higher in prediction accuracy  Citation prediction  The selected meta- paths could be rather different 17 Co-author predictionCo-author prediction for Jian Peifor Jian Pei: Only 42 among 4809: Only 42 among 4809 candidates are true first-time co-authors!candidates are true first-time co-authors! (Feature collected in [1996, 2002]; Test period in [2003,2009])
  • 18. 18 OutlineOutline  Why Is Mining Heterogeneous Social and Info NetworksWhy Is Mining Heterogeneous Social and Info Networks Promising?Promising?  Homogeneous vs. Heterogeneous Social and Info. NetworksHomogeneous vs. Heterogeneous Social and Info. Networks  On the Power of Mining Structured, Heterogeneous Social andOn the Power of Mining Structured, Heterogeneous Social and Info. NetworksInfo. Networks  Challenges on BigMine: Scalable Mining of MassiveChallenges on BigMine: Scalable Mining of Massive Heterogeneous Social and Information NetworksHeterogeneous Social and Information Networks  PathSim: Online, Query-Based Similarity SearchPathSim: Online, Query-Based Similarity Search  PathPredict: Query-Based Prediction Using Meta-PathPathPredict: Query-Based Prediction Using Meta-Path  Efficient Hidden Network Discovery: A Scalability ChallengeEfficient Hidden Network Discovery: A Scalability Challenge  ConclusionsConclusions
  • 19. 19 Challenges on BigMineChallenges on BigMine  Scalable mining of massive information networks: Necessity  Many such networks are gigantic: News, PubMed, …  DBLP is a small one: 2M papers and 0.8M authors, …  Meta-path: Potentially long chains of matrix multiplication of such networks  APVPA: AP X PV X VP X PA  Comparative analysis of multi-meta-paths is costly  Scalable mining of massive information networks: Possibility  Many functions do not need to compute eigen values  Top-k computation may save computation cost substantially  Precomputation may save online computation substantially  Clustering-based precomputation:
  • 20. 20 Computing Eigen Values: When Need It?Computing Eigen Values: When Need It?  Computations needed  Clustering (RankClus), classification (RankClass), similarity search (PathSim), prediction (PathPredict)  A small # of interactive processing (e.g., EM-styled)  Meta-path-based prediction : Selection from a set of “parallel” meta-paths
  • 21. Long Meta-Path May Not Carry the Right SemanticsLong Meta-Path May Not Carry the Right Semantics  Repeat the meta-path 2, 4, and infinite times for conference similarity query 21
  • 22. 22 Top-K Computation Is What We NeedTop-K Computation Is What We Need  Similarity search: “Who are similar to Christos?”  There is no need/interest to calculate and rank the remaining 0.8M authors  Only top-k (e.g., top-100) authors are needed in practice  Lots of optimizations can be explored for top-k computation  Precomputation vs. online computation  Precomputation of long meta-paths will save online, costly multi-matrix multiplication  Clustering-based precomputation  Example: top-k similarity authors  Precomputation by clustering: only computing rather similar author groups
  • 23. Co-Clustering-Based Pruning AlgorithmCo-Clustering-Based Pruning Algorithm  General idea:  Store commuting matrices for short path schemas and compute top-k queries on line  Framework  Generate co-clusters for materialized commuting matrices, for feature objects and target objects  Derive upper bound for similarity between object and target cluster, and between object and object  Safely pruning target clusters and objects if the upper bound similarity is lower than current threshold  Dynamically update top-k threshold
  • 24. Similarity Search: Experiments on EfficiencySimilarity Search: Experiments on Efficiency  Searching for top-20 objects vs. 1001th-1020th objects: PathSim- pruning is more efficient than PathSim-baseline  The denser the corresponding commuting matrix, the more PathSim-pruning can improve  The more neighbors of a query, the more PathSim-pruning can improve  Then compare the efficiency under different top-k’s (k = 5, 10, 20) for PathSim-pruning using query set 1  A smaller top-k has stronger pruning power, and thus needs less execution time 24
  • 25. PathPredict: Exploring Big Data SpacePathPredict: Exploring Big Data Space  Scalable computation in really huge heterogeneous networks?  Sampling may lead to similar judgment on importance of meta-path  Query-dependent prediction can be “selective” and thus may not need that much resources  Precomputation and clustering may further enhance its efficiency 25
  • 26. 26 Mining Query-Relevant “Hidden” NetworksMining Query-Relevant “Hidden” Networks  Query-relevant hidden networks  What is the hidden network closely relevant to “SVM”?  The network should contains weighted network consisting of papers, terms, authors and venues  Is “kernel machine” closely relevant to “SVM”? How could we know it?  It takes substantial computation to derive such a “weighted/ranked” hidden heterogeneous network  Due to the diversity of queries (e.g., SVM + Cloud + SIGMOD), it is impossible to precompute every possible combinations  How can we compute such hidden network efficiently on the fly?  An interesting open problem
  • 27. 27 ConclusionsConclusions  Heterogeneous social & information networks are ubiquitous  Most datasets can be “organized” or “transformed” into “structured” multi-typed heterogeneous info. networks  Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, …  Surprisingly rich knowledge can be mined from structured heterogeneous info. networks  Clustering, ranking, classification, path prediction, ……  Knowledge is power, but knowledge is hidden in massive, but “relatively structured” nodes and links!  Challenge to BigMine: How to mining massive, heterogeneous information networks efficiently  Some progress/tricks on scalability and efficiency  Many open problems and much more to be explored!
  • 28. From Data Mining to Mining Info. NetworksFrom Data Mining to Mining Info. Networks 28 Han, Kamber and Pei, Data Mining, 3rd ed. 2011 Yu, Han and Faloutsos (eds.), Link Mining, 2010 Sun and Han, Mining Heterogeneous Information Networks, 2012
  • 29. ReferencesReferences  M. Ji, J. Han, and M. Danilevsky, "Ranking-Based Classification of Heterogeneous Information Networks", KDD'11.  Y. Sun and J. Han, Mining Heterogeneous Information Networks: Principles and Methodologies, Morgan & Claypool Publishers, 2012  Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis", EDBT’09  Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema", KDD’09  Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks”, VLDB'11  Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in Heterogeneous Bibliographic Networks", ASONAM'11  Y. Sun, J. Han, C. C. Aggarwal, N. Chawla, “When Will It Happen? Relationship Prediction in Heterogeneous Information Networks”, WSDM'12  F. Tao, et al., “EventCube: Multi-Dimensional Search and Mining of Structured and Text Data”, (system demo) KDD’13  C. Wang, J. Han, et al., “Mining Advisor-Advisee Relationships from Research Publication Networks", KDD'10  C. Wang, M. Danilevsky, et al., “A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy”, KDD’13 29

Editor's Notes

  1. May replace with Jure Lescovec