Lise Getoor, "

Link Mining

Lise Getoor
Li G t
University of Maryland, College Park

August 22, 2012

Alternate Title…..
What
Machine Learning/Statistics/Data Mining
can do for YOU!

1.Predict future values

2.Fill-in missing values Supervised Learning

3 Identify anomalies What are some common
3.Identify
machine learning algorithms?
4.Find patterns
Unsupervised Learning
5.Identify Clusters

So, what’s Link Mining???
 Machine learning when you have graphs (or networks)
 Nodes are entities
• People
• Places
• Organizations
• Text
 Links are relationships
p
• Friends
• MemberOf
• LivesIn
• Tweeted
• Posted
 e.g., heterogeneous multi-relational data, multimodal
data …..

Ex: Social Media Relationships
User-User
Friends
Collaborators
Family
Ua Ub Fan/Follower
Replies
Co-Edits
Co-Mentions, etc.
User Doc
User-Doc
U Doc1 Comments
Edits, etc.

U Q URL User-Query-Click

U Tag Doc User-Tag-Doc

Link Mining Tasks
 Node Labeling
 Link Prediction
 Entity Resolution
 G oup etect o
Group Detection

Node Labeling

What is Harry’s
h
political persuasion?

Harry

Natasha

Link Prediction

Friends?

Entity Resolution
 Aka: deduplication, co-reference resolution, record
linkage, reference consolidation, etc.
g

Abstract Problem Statement
Real Digital World
World Records /
Mentions

Deduplication Problem Statement
 Cluster the records/mentions that correspond to
same entityy

Deduplication Problem Statement
 Cluster the records/mentions that correspond to
same entityy
 Intensional Variant: Compute cluster representative

Record Linkage Problem Statement
 Link records that match across databases

B
A

Reference Matching Problem
 Match noisy records to clean records in a reference
table

Reference
Table
T bl

InfoVis Co-Author Network Fragment

before after

Link Mining Algorithms
 Node Labeling
 Link Prediction
 G oup etect o
Group Detection

Link Mining Algorithms
 Node Labeling 1. Relational Classifiers
2. Collective Classifiers
 Link Prediction
 G oup etect o
Group Detection

Relational Classifiers
Given: a w
b 1
5 2 x
c

d 3 y
4
e z

Task: Predict attribute Alternate task: Predict existence
of some of the entities of relationship between entities
?
1 ? 1 2 ?
?
2 ? 1 3 ?

...
relational features
...
.

?
5 ? 4 5 ?

same-attribute-value
local features
avg value of
l f neighbors
hb
number of shared neighbors
number of neighbors participate in relation

Relational Classifiers
 Values are represented as a fixed-length feature
vector

 Instances are treated independently of each other

 Relational features are computed by aggregating
over related entities

 Any classification or regression model can be used
for learning and prediction

Application Case Studies
 Two example applications that use relational
classifiers
 Focus is on types of relational features used

 Case Study 1: Predicting click-through rate of
search result ads
 Case St d 2 P di ti f i d hi i a social
C Study 2: Predicting friendships in i l
network

Case Study 1:
Predicting Ad Click-Through Rate
Click Through
 Task: Predict the click through rate (CTR) of an
click-through
online ad, given that it is seen by the user, where
the ad is described by:
 URL to which user is sent when clicking on ad
 Bid terms used to determine when to display ad
 Title d text f d
Titl and t t of ad

 Our description is based on approach by
 [Richardson et al., WWW07]

Relational Features Used
Average CTR Average CTR
CTR?

Ad Ad1 Ad2 Ad3 Ad4 Ad5 Ad6

contains-bid-term

BT1 BT2 BT3
BT4 BT5 BT6
contains-bid-term
t i bid t
(according to search engine)
related-bid-term
(containing subsets or
supersets of the term)

… … … queried-bid-term

…

Count Count

Case Study 2:
Predicting Friendships

 Task: Predict new friendships among users, based
users
on their descriptive attributes, their existing
friendships, and their family ties.
p , y

 Our description is based on approach by
p pp y
 [Zheleva et al., SNAKDD08]

Relational Features Used
 “Petworks” - social networks of pets

count, density

P3 P8

count, proportion
P6

P9
P4 count
count
t

P7 P5 P10
P1 P2
Friends?
P11
F1 Jaccard coeff
in-family F2
same-breed
same breed

Key Idea: Feature Construction
 Feature informativeness is key to the success of a
relational classifier

 Features can be:
 Attributes of entity/entities
 Match predicate on attributes of entities
 Attributes of related entities
 Encode structural features
 Based on overlap in sets
o erlap

Collective Classification
[Neville & Jensen, SRL00; Lu & Getoor, ICML03, Sen et al. AI Mag08]

 Extends relational classifiers by allowing relational
features to be functions of predicted attributes/relations
of neighbors
 At training time, these features are computed based on
observed values in the training set
 At i f
inference ti
time, th algorithm it t
the l ith iterates, computing ti
relational features based on the current prediction for
any unobserved attributes
 In the first, bootstrap, iteration, only local features are
used

CC: Learning
 label set:

P2
P4
P1
P3
P10
P8 P5

P6
P9
P7

Learn models (local and relational) f
L d l (l l d l ti l) from
fully labeled training set

CC: Inference (1)

P1

P2
P5

P3
P4

Step 1 B t t
St 1: Bootstrap using entity attributes only
i tit tt ib t l

CC: Inference (2)

P1

P2
P5

P3
P4

Step 2 Iteratively d t th
St 2: It ti l update the category of each entity,
t f h tit
based on related entities’ categories

CC Key Idea
 Rather than make predictions independently, begin
with relational classifier, and then ‘propagate’
p p g
classification

 Variations:
 Propagate probabilities, rather than mode (related to
Gibbs Sampling)
 Batch vs. Incremental updates
 Ordering strategies

 Active area of research: active learning, semi
semi-
supervised learning, more principled joint
probabilistic models, etc.

The Entity Resolution Problem
James
John
Smith
Smith

“John Smith”

“Jim Smith”
“J Smith”

“James Smith
James Smith”

Jonathan Smith “Jon Smith”

“J Smith”
“Jonthan Smith”
Issues:
1. Identification
2. Disambiguation

Relational Identification

Very similar names.
Added evidence from
shared co-authors

Relational Disambiguation

Very similar names
but no shared
collaborators

Collective Entity Resolution

One resolution
provides evidence
for another => joint
j
resolution

P1: “JOSTLE: Partitioning of Unstructured Meshes for
Massively Parallel Machines”, C. Walshaw, M. Cross,
M. G. Everett, S. Johnson J

P2: “Partitioning Mapping of Unstructured Meshes to
Partitioning
Parallel Machine Topologies”, C. Walshaw, M. Cross, M.
G. Everett, S. Johnson, K. McManus J

P3: “Dynamic Mesh Partitioning: A Unied Optimisation and
Dynamic
Load-Balancing Algorithm”, C. Walshaw, M. Cross, M.
G. Everett

P4: “Code Generation for Machines with Multiregister
Operations”, Alfred V. Aho, Stephen C. Johnson,
Jefferey D. Ullman J

P5: “Deterministic Parsing of Ambiguous Grammars”, A.
g g
Aho, S. Johnson, J. Ullman J

P6: “Compilers: Principles, Techniques, and Tools”, A. Aho,
R. Sethi, J. Ullman

P1: “JOSTLE: Partitioning of Unstructured Meshes for
Massively Parallel Machines”, C. Walshaw, M. Cross,
M. G. Everett, S. Johnson

P2: “Partitioning Mapping of Unstructured Meshes to
Partitioning
Parallel Machine Topologies”, C. Walshaw, M. Cross, M.
G. Everett, S. Johnson, K. McManus

P3: “Dynamic Mesh Partitioning: A Unied Optimisation and
Dynamic
Load-Balancing Algorithm”, C. Walshaw, M. Cross, M.
G. Everett

P4: “Code Generation for Machines with Multiregister
Operations”, Alfred V. Aho, Stephen C. Johnson,
Jefferey D. Ullman

P5: “Deterministic Parsing of Ambiguous Grammars”, A.
g g
Aho, S. Johnson, J. Ullman

P6: “Compilers: Principles, Techniques, and Tools”, A. Aho,
R. Sethi, J. Ullman

Relational Clustering (RC-ER)

P1 C. Walshaw M. Cross M. G. Everett S. Johnson

P2 C.
C Walshaw M.
M Cross M. Everett S. Johnson K.
K McManus

P4 Alfred V. Aho Jefferey D. Ullman Stephen C. Johnson

P5 A. Aho J. Ullman S. Johnson

Cut-based Formulation of RC-ER

M. G. Everett S. Johnson M. G. Everett S. Johnson

M. Everett S. Johnson M. Everett S. Johnson

S. Johnson S. Johnson

A. Aho A. Aho
Stephen C. Stephen C.
Alfred V. Aho Johnson Alfred V. Aho Johnson

Good separation of attributes Worse in terms of attributes
Many cluster-cluster relationships Fewer cluster-cluster relationships
 Aho-Johnson1 Aho-Johnson2
Aho Johnson1, Aho Johnson2,  Aho-Johnson1 Everett Johnson2
Aho Johnson1, Everett-Johnson2
Everett-Johnson1

Objective Function
 Minimize:

 w sim
i
i
j
A A (ci ,c j )  wR simR (ci , c j )
i

weight for similarity of weight for Similarity based on relational
attributes attributes relations edges between ci and cj

 Greedy clustering algorithm: merge cluster pair with max
reduction in objective function

 (ci ,c j ) w A sim A (ci ,c j )  w R (|N (ci )||N (c j )|)
Similarity of attributes Common cluster neighborhood

Relational Clustering Algorithm
1. Find similar references using ‘blocking’
2. Bootstrap clusters using attributes and relations
3. Compute similarities for cluster pairs and insert into
priority queue

4. Repeat until priority queue is empty
5. Find ‘closest’ cluster pair
6. Stop if similarity below threshold
7. Merge to create new cluster
8.
8 Update similarity for ‘related’ clusters
related

 O(n l
O( k log n) algorithm w/ efficient i l
) l ith / ffi i t implementation
t ti

Evaluation Datasets
 CiteSeer
 1,504 citations to machine learning papers (Lawrence et al.)
 2,892 references to 1,165 author entities

 arXiv
 29,555 publications from High Energy Physics (KDD Cup’03)
 58,515 refs to 9,200 authors

 Elsevier BioBase
 156,156 Biology papers (IBM KDD Challenge ’05)
 831,991 author refs
 Keywords, topic classifications, language, country and affiliation
of corresponding author, etc
p g ,

Baselines
 A: Pair-wise duplicate decisions w/ attributes only
 Names: Soft-TFIDF with Levenstein, Jaro, Jaro-Winkler
 Other textual attributes: TF-IDF
 A*: Transitive closure over A

 A+N: Add attribute similarity of co-occurring refs
 A+N*: Transitive closure over A+N

 Evaluate pair-wise decisions over references
 F1-measure
F1 measure (harmonic mean of precision and recall)

ER over Entire Dataset
Algorithm CiteSeer arXiv BioBase
A 0.980 0.976 0.568
A* 0.990 0.971 0.559
A+N 0.973 0.938 0.710
A+N
A+N* 0.984
0 984 0.934
0 934 0.753
0 753
RC-ER 0.995 0.985 0.818

 RC-ER outperforms baselines in all datasets
 Collective resolution better than naïve relational resolution

ER over Entire Dataset
Algorithm CiteSeer arXiv BioBase
A 0.980 0.976 0.568
A* 0.990 0.971 0.559
A+N 0.973 0.938 0.710
A+N
A+N* 0.984
0 984 0.934
0 934 0.753
0 753
RC-ER 0.995 0.985 0.818

 CiteSeer: Near perfect resolution; 22% error reduction
 arXiv: 6 500 additional correct resolutions; 20% error reduction
6,500
 BioBase: Biggest improvement over baselines

Privacy breaches in OSNs
 Identity disclosure
 A mapping from a record Who is ?
to a specific individual

 Attribute disclosure
 Find attribute value that the Is liberal?
user intended to stay private

 Social link disclosure
 Participation in a sensitive
Friends?
relationship or communication
p

 Affiliation link disclosure Support gay
 Participation in a group revealing
a sensitive attribute value
marriage

Other Linqs Projects
 Key Opinion Leader Identification
 Active Surveying in Social Networks
 Ontology Alignment and Folksonomy construction
 Label Acquisition & Active Learning in Network Data
 Inference & Search in Camera Networks
 Identifying R l in Social Networks
Id tif i Roles i S i l N t k
 Group Recommendation in Social Networks
 Social Search
 Analysis of Dynamic Networks: loyalty, stability, diversity
 Ranking and Retrieval in Biological Networks
 Discourse level
Discourse-level sentiment analysis
 Bilingual Word Sense Disambiguation
 Visual Analytics:
 D-Dupe, C G
DD C-Group, G-Pare
GP
 Others …
http://www.cs.umd.edu/linqs

Conclusion
 Link mining algorithms can be useful tools for social
media
 Need algorithms that can handle the multi-modal,
multi-relational, temporal nature of social media
 Collective algorithms make use of
 Structure to define features and propagate
information, allows us to improve the overall accuracy
i f ti ll t i th ll
 While there are important pitfalls to take into
account (confidence and privacy) there are
privacy),
many potential benefits and payoffs (improved
personalization and context-aware predictions!)
context aware

http://www.cs.umd.edu/linqs

Work sponsored by the National Science Foundation,
Maryland Industrial Partners (MIPS), National Geospatial Agency
(MIPS) Agency,
Airforce Research Laboratory, DARPA,
Google, Microsoft, and Yahoo!

Lise Getoor, "

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Lise Getoor, "

Ähnlich wie Lise Getoor, " (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Lise Getoor, "