This document provides an overview of link mining and collective classification algorithms. It discusses how link mining can be used for tasks like node labeling, link prediction, entity resolution, and group detection on graph-structured data. It presents relational classifiers and collective classification as two common link mining algorithms. Relational classifiers extend traditional classifiers by incorporating relational features between linked nodes, while collective classification iteratively propagates predictions between linked nodes. The document provides examples of how these algorithms have been applied to problems like predicting ad click-through rates and friendships. It also discusses entity resolution and how relational clustering algorithms can leverage links between entities to improve resolution.
2. Alternate Title…..
What
Machine Learning/Statistics/Data Mining
can do for YOU!
1.Predict future values
2.Fill-in missing values Supervised Learning
3 Identify anomalies What are some common
3.Identify
machine learning algorithms?
4.Find patterns
Unsupervised Learning
5.Identify Clusters
3. So, what’s Link Mining???
Machine learning when you have graphs (or networks)
Nodes are entities
• People
• Places
• Organizations
• Text
Links are relationships
p
• Friends
• MemberOf
• LivesIn
• Tweeted
• Posted
e.g., heterogeneous multi-relational data, multimodal
data …..
4. Ex: Social Media Relationships
User-User
Friends
Collaborators
Family
Ua Ub Fan/Follower
Replies
Co-Edits
Co-Mentions, etc.
User Doc
User-Doc
U Doc1 Comments
Edits, etc.
U Q URL User-Query-Click
U Tag Doc User-Tag-Doc
5. Link Mining Tasks
Node Labeling
Link Prediction
Entity Resolution
G oup etect o
Group Detection
6. Node Labeling
What is Harry’s
h
political persuasion?
Harry
Natasha
11. Deduplication Problem Statement
Cluster the records/mentions that correspond to
same entityy
Intensional Variant: Compute cluster representative
16. Link Mining Algorithms
Node Labeling
Link Prediction
Entity Resolution
G oup etect o
Group Detection
17. Link Mining Algorithms
Node Labeling 1. Relational Classifiers
2. Collective Classifiers
Link Prediction
Entity Resolution
G oup etect o
Group Detection
18. Relational Classifiers
Given: a w
b 1
5 2 x
c
d 3 y
4
e z
Task: Predict attribute Alternate task: Predict existence
of some of the entities of relationship between entities
?
1 ? 1 2 ?
?
2 ? 1 3 ?
...
relational features
...
.
?
5 ? 4 5 ?
same-attribute-value
local features
avg value of
l f neighbors
hb
number of shared neighbors
number of neighbors participate in relation
19. Relational Classifiers
Values are represented as a fixed-length feature
vector
Instances are treated independently of each other
Relational features are computed by aggregating
over related entities
Any classification or regression model can be used
for learning and prediction
20. Application Case Studies
Two example applications that use relational
classifiers
Focus is on types of relational features used
Case Study 1: Predicting click-through rate of
search result ads
Case St d 2 P di ti f i d hi i a social
C Study 2: Predicting friendships in i l
network
21. Case Study 1:
Predicting Ad Click-Through Rate
Click Through
Task: Predict the click through rate (CTR) of an
click-through
online ad, given that it is seen by the user, where
the ad is described by:
URL to which user is sent when clicking on ad
Bid terms used to determine when to display ad
Title d text f d
Titl and t t of ad
Our description is based on approach by
[Richardson et al., WWW07]
22. Relational Features Used
Average CTR Average CTR
CTR?
Ad Ad1 Ad2 Ad3 Ad4 Ad5 Ad6
contains-bid-term
BT1 BT2 BT3
BT4 BT5 BT6
contains-bid-term
t i bid t
(according to search engine)
related-bid-term
(containing subsets or
supersets of the term)
… … … queried-bid-term
…
Count Count
23. Case Study 2:
Predicting Friendships
Task: Predict new friendships among users, based
users
on their descriptive attributes, their existing
friendships, and their family ties.
p , y
Our description is based on approach by
p pp y
[Zheleva et al., SNAKDD08]
24. Relational Features Used
“Petworks” - social networks of pets
count, density
P3 P8
count, proportion
P6
P9
P4 count
count
t
P7 P5 P10
P1 P2
Friends?
P11
F1 Jaccard coeff
in-family F2
same-breed
same breed
25. Key Idea: Feature Construction
Feature informativeness is key to the success of a
relational classifier
Features can be:
Attributes of entity/entities
Match predicate on attributes of entities
Attributes of related entities
Encode structural features
Based on overlap in sets
o erlap
26. Link Mining Algorithms
Node Labeling 1. Relational Classifiers
2. Collective Classifiers
Link Prediction
Entity Resolution
G oup etect o
Group Detection
27. Collective Classification
[Neville & Jensen, SRL00; Lu & Getoor, ICML03, Sen et al. AI Mag08]
Extends relational classifiers by allowing relational
features to be functions of predicted attributes/relations
of neighbors
At training time, these features are computed based on
observed values in the training set
At i f
inference ti
time, th algorithm it t
the l ith iterates, computing ti
relational features based on the current prediction for
any unobserved attributes
In the first, bootstrap, iteration, only local features are
used
28. CC: Learning
label set:
P2
P4
P1
P3
P10
P8 P5
P6
P9
P7
Learn models (local and relational) f
L d l (l l d l ti l) from
fully labeled training set
29. CC: Inference (1)
P1
P2
P5
P3
P4
Step 1 B t t
St 1: Bootstrap using entity attributes only
i tit tt ib t l
30. CC: Inference (2)
P1
P2
P5
P3
P4
Step 2 Iteratively d t th
St 2: It ti l update the category of each entity,
t f h tit
based on related entities’ categories
31. CC Key Idea
Rather than make predictions independently, begin
with relational classifier, and then ‘propagate’
p p g
classification
Variations:
Propagate probabilities, rather than mode (related to
Gibbs Sampling)
Batch vs. Incremental updates
Ordering strategies
Active area of research: active learning, semi
semi-
supervised learning, more principled joint
probabilistic models, etc.
32. Link Mining Algorithms
Node Labeling
Link Prediction
Entity Resolution
G oup etect o
Group Detection
33. The Entity Resolution Problem
James
John
Smith
Smith
“John Smith”
“Jim Smith”
“J Smith”
“James Smith
James Smith”
Jonathan Smith “Jon Smith”
“J Smith”
“Jonthan Smith”
Issues:
1. Identification
2. Disambiguation
37. P1: “JOSTLE: Partitioning of Unstructured Meshes for
Massively Parallel Machines”, C. Walshaw, M. Cross,
M. G. Everett, S. Johnson J
P2: “Partitioning Mapping of Unstructured Meshes to
Partitioning
Parallel Machine Topologies”, C. Walshaw, M. Cross, M.
G. Everett, S. Johnson, K. McManus J
P3: “Dynamic Mesh Partitioning: A Unied Optimisation and
Dynamic
Load-Balancing Algorithm”, C. Walshaw, M. Cross, M.
G. Everett
P4: “Code Generation for Machines with Multiregister
Operations”, Alfred V. Aho, Stephen C. Johnson,
Jefferey D. Ullman J
P5: “Deterministic Parsing of Ambiguous Grammars”, A.
g g
Aho, S. Johnson, J. Ullman J
P6: “Compilers: Principles, Techniques, and Tools”, A. Aho,
R. Sethi, J. Ullman
38. P1: “JOSTLE: Partitioning of Unstructured Meshes for
Massively Parallel Machines”, C. Walshaw, M. Cross,
M. G. Everett, S. Johnson
P2: “Partitioning Mapping of Unstructured Meshes to
Partitioning
Parallel Machine Topologies”, C. Walshaw, M. Cross, M.
G. Everett, S. Johnson, K. McManus
P3: “Dynamic Mesh Partitioning: A Unied Optimisation and
Dynamic
Load-Balancing Algorithm”, C. Walshaw, M. Cross, M.
G. Everett
P4: “Code Generation for Machines with Multiregister
Operations”, Alfred V. Aho, Stephen C. Johnson,
Jefferey D. Ullman
P5: “Deterministic Parsing of Ambiguous Grammars”, A.
g g
Aho, S. Johnson, J. Ullman
P6: “Compilers: Principles, Techniques, and Tools”, A. Aho,
R. Sethi, J. Ullman
39. Relational Clustering (RC-ER)
P1 C. Walshaw M. Cross M. G. Everett S. Johnson
P2 C.
C Walshaw M.
M Cross M. Everett S. Johnson K.
K McManus
P4 Alfred V. Aho Jefferey D. Ullman Stephen C. Johnson
P5 A. Aho J. Ullman S. Johnson
40. Relational Clustering (RC-ER)
P1 C. Walshaw M. Cross M. G. Everett S. Johnson
P2 C.
C Walshaw M.
M Cross M. Everett S. Johnson K.
K McManus
P4 Alfred V. Aho Jefferey D. Ullman Stephen C. Johnson
P5 A. Aho J. Ullman S. Johnson
41. Relational Clustering (RC-ER)
P1 C. Walshaw M. Cross M. G. Everett S. Johnson
P2 C.
C Walshaw M.
M Cross M. Everett S. Johnson K.
K McManus
P4 Alfred V. Aho Jefferey D. Ullman Stephen C. Johnson
P5 A. Aho J. Ullman S. Johnson
42. Relational Clustering (RC-ER)
P1 C. Walshaw M. Cross M. G. Everett S. Johnson
P2 C.
C Walshaw M.
M Cross M. Everett S. Johnson K.
K McManus
P4 Alfred V. Aho Jefferey D. Ullman Stephen C. Johnson
P5 A. Aho J. Ullman S. Johnson
43. Cut-based Formulation of RC-ER
M. G. Everett S. Johnson M. G. Everett S. Johnson
M. Everett S. Johnson M. Everett S. Johnson
S. Johnson S. Johnson
A. Aho A. Aho
Stephen C. Stephen C.
Alfred V. Aho Johnson Alfred V. Aho Johnson
Good separation of attributes Worse in terms of attributes
Many cluster-cluster relationships Fewer cluster-cluster relationships
Aho-Johnson1 Aho-Johnson2
Aho Johnson1, Aho Johnson2, Aho-Johnson1 Everett Johnson2
Aho Johnson1, Everett-Johnson2
Everett-Johnson1
44. Objective Function
Minimize:
w sim
i
i
j
A A (ci ,c j ) wR simR (ci , c j )
i
weight for similarity of weight for Similarity based on relational
attributes attributes relations edges between ci and cj
Greedy clustering algorithm: merge cluster pair with max
reduction in objective function
(ci ,c j ) w A sim A (ci ,c j ) w R (|N (ci )||N (c j )|)
Similarity of attributes Common cluster neighborhood
45. Relational Clustering Algorithm
1. Find similar references using ‘blocking’
2. Bootstrap clusters using attributes and relations
3. Compute similarities for cluster pairs and insert into
priority queue
4. Repeat until priority queue is empty
5. Find ‘closest’ cluster pair
6. Stop if similarity below threshold
7. Merge to create new cluster
8.
8 Update similarity for ‘related’ clusters
related
O(n l
O( k log n) algorithm w/ efficient i l
) l ith / ffi i t implementation
t ti
46. Evaluation Datasets
CiteSeer
1,504 citations to machine learning papers (Lawrence et al.)
2,892 references to 1,165 author entities
arXiv
29,555 publications from High Energy Physics (KDD Cup’03)
58,515 refs to 9,200 authors
Elsevier BioBase
156,156 Biology papers (IBM KDD Challenge ’05)
831,991 author refs
Keywords, topic classifications, language, country and affiliation
of corresponding author, etc
p g ,
47. Baselines
A: Pair-wise duplicate decisions w/ attributes only
Names: Soft-TFIDF with Levenstein, Jaro, Jaro-Winkler
Other textual attributes: TF-IDF
A*: Transitive closure over A
A+N: Add attribute similarity of co-occurring refs
A+N*: Transitive closure over A+N
Evaluate pair-wise decisions over references
F1-measure
F1 measure (harmonic mean of precision and recall)
48. ER over Entire Dataset
Algorithm CiteSeer arXiv BioBase
A 0.980 0.976 0.568
A* 0.990 0.971 0.559
A+N 0.973 0.938 0.710
A+N
A+N* 0.984
0 984 0.934
0 934 0.753
0 753
RC-ER 0.995 0.985 0.818
RC-ER outperforms baselines in all datasets
Collective resolution better than naïve relational resolution
51. Privacy breaches in OSNs
Identity disclosure
A mapping from a record Who is ?
to a specific individual
Attribute disclosure
Find attribute value that the Is liberal?
user intended to stay private
Social link disclosure
Participation in a sensitive
Friends?
relationship or communication
p
Affiliation link disclosure Support gay
Participation in a group revealing
a sensitive attribute value
marriage
52. Other Linqs Projects
Key Opinion Leader Identification
Active Surveying in Social Networks
Ontology Alignment and Folksonomy construction
Label Acquisition & Active Learning in Network Data
Inference & Search in Camera Networks
Identifying R l in Social Networks
Id tif i Roles i S i l N t k
Group Recommendation in Social Networks
Social Search
Analysis of Dynamic Networks: loyalty, stability, diversity
Ranking and Retrieval in Biological Networks
Discourse level
Discourse-level sentiment analysis
Bilingual Word Sense Disambiguation
Visual Analytics:
D-Dupe, C G
DD C-Group, G-Pare
GP
Others …
http://www.cs.umd.edu/linqs
53. Other Linqs Projects
Key Opinion Leader Identification
Active Surveying in Social Networks
Ontology Alignment and Folksonomy construction
Label Acquisition & Active Learning in Network Data
Inference & Search in Camera Networks
Identifying R l in Social Networks
Id tif i Roles i S i l N t k
Group Recommendation in Social Networks
Social Search
Analysis of Dynamic Networks: loyalty, stability, diversity
Ranking and Retrieval in Biological Networks
Discourse level
Discourse-level sentiment analysis
Bilingual Word Sense Disambiguation
Visual Analytics:
D-Dupe, C G
DD C-Group, G-Pare
GP
Others …
http://www.cs.umd.edu/linqs
54. Conclusion
Link mining algorithms can be useful tools for social
media
Need algorithms that can handle the multi-modal,
multi-relational, temporal nature of social media
Collective algorithms make use of
Structure to define features and propagate
information, allows us to improve the overall accuracy
i f ti ll t i th ll
While there are important pitfalls to take into
account (confidence and privacy) there are
privacy),
many potential benefits and payoffs (improved
personalization and context-aware predictions!)
context aware
55. http://www.cs.umd.edu/linqs
Work sponsored by the National Science Foundation,
Maryland Industrial Partners (MIPS), National Geospatial Agency
(MIPS) Agency,
Airforce Research Laboratory, DARPA,
Google, Microsoft, and Yahoo!