Declarative analysis of noisy information networks

Declarative Analysis of Noisy
Information Networks
Walaa Eldin Moustafa
Galileo Namata
Amol Deshpande
Lise Getoor
University of Maryland

Outline

Motivations/Contributions
Framework
Declarative Language
Implementation
Results
Related and Future Work

Motivation
• Users/objects are modeled as nodes,
relationships as edges
• The observed networks are noisy and
incomplete.
– Some users may have more than one account
– Communication may contain a lot of spam
• Missing attributes, links, having multiple
references to the same entity
• Need to extract underlying information
network.

Inference Operations
• Attribute Prediction
– To predict values of missing attributes
• Link Prediction
– To predict missing links
• Entity Resolution
– To predict if two references refer to the same entity
• These prediction tasks can use:
– Local node information
– Relational information surrounding the node

Attribute Prediction
Task: Predict topic of the
paper
A Statistical Model for Language Model Based
Multilingual Entity Arabic Word
Detection and Tracking Segmentation.

Automatic Rule
Refinement for Why Not?
Information Extraction

Join Optimization of An Annotation Tracing Lineage Beyond
Information Extraction Management System for Relational Operators
Output: Quality Matters! Relational Databases

Use links between nodes (collective attribute
D
NL ?
prediction) [Sen et al., AI Magazine 2008] B
Legend

Attribute Prediction
Task: Predict topic of the
paper
A Statistical Model for Language Model Based
Multilingual Entity Arabic Word
Detection and Tracking Segmentation.

P2 Automatic Rule P1
Refinement for Why Not?
Information Extraction

Join Optimization of An Annotation Tracing Lineage Beyond
Information Extraction Management System for Relational Operators
Output: Quality Matters! Relational Databases

D
NL ?
B

Legend

Link Prediction
• Goal: Predict new links
• Using local similarity
• Using relational similarity [Liben-Nowell et al.,
CIKM 2003] Graham
Cormode
Flip Korn

Lukasz
Golab
Divesh
Srivastava

Avishek
Saha

Vladislav
Theodore Shkapenyuk
Nick
Koudas Johnson

Entity Resolution
• Goal: to deduce that two references refer to
the same entity
• Can be based on node attributes (local)
– e.g. string similarity between titles or author
names
• Local information only may not be enough

Jian Li Jian Li

Entity Resolution

Use links between the nodes (collective entity
resolution) [Bhattacharya et al., TKDD 2007]

Petre Prabhu Amol Barna
Stoica Babu Deshpande
Saha

William Samir
Roberts Khuller

Jian Li Jian Li

Joint Inference
• Each task helps others get better predictions.
• How to combine the tasks?
– One after other (pipelined), or interleaved?
• GAIA:
– A Java library for applying multiple joint AP, LP, ER
learning and inference tasks: [Namata et al., MLG
2009, Namata et al., KDUD 2009]
– Inference can be pipelined or interleaved.

Our Goal and Contributions
• Motivation: To support declarative network
inference
• Desiderata:
– User declaratively specifies the prediction features
• Local features
• Relational features
– Declaratively specify tasks
• Attribute prediction, Link prediction, Entity resolution
– Specify arbitrary interleaving or pipelining
– Support for complex prediction functions

Handle all that efficiently

Unifying Framework

Specify the domain
Specify the domain
For attribute prediction,
the domain is a subset of
the graph nodes.
Compute features
Compute features
For link prediction and
entity resolution, the
Make Predictions, and Compute
Make Predictions, and Compute domain is a subset of
Confidence in the Predictions
Confidence in the Predictions pairs of nodes.

Choose Which Predictions to
Apply
Apply

Unifying Framework

Specify the domain
Specify the domain
Local: word frequency,
income, etc.
Relational: degree,
Compute features
Compute features clustering coeff., no. of
neighbors with each
attribute value, common
Make Predictions, and Compute neighbors between pairs
Confidence in the Predictions of nodes, etc.

Apply
Apply

Unifying Framework

Specify the domain
Specify the domain
Attribute prediction: the
missing attribute
Compute features
Compute features Link prediction: add link
or not?

Make Predictions, and Compute Entity resolution: merge
Confidence in the Predictions two nodes or not?

Apply
Apply

Unifying Framework

Specify the Domain
Specify the Domain
After predictions are made,
the graph changes:
Attribute prediction
Compute Features
Compute Features changes local attributes.
Link prediction changes the
graph links.
Entity resolution changes
Make Predictions, and Compute both local attributes and
Confidence in the Predictions graph links.

Apply
Apply

Datalog
• Use Datalog to express:
– Domains
– Local and relational features
• Extend Datalog with operational semantics
(vs. fix-point semantics) to express:
– Predictions (in the form of updates)
– Iteration

Specifying Features

Degree:
Degree(X, COUNT<Y>) :-Edge(X, Y)

Number of Neighbors with attribute ‘A’
NumNeighbors(X, COUNT<Y>) :− Edge(X, Y), Node(Y, Att=’A’)

Clustering Coefficient
NeighborCluster(X, COUNT<Y,Z>) :−Edge(X,Y), Edge(X,Z), Edge(Y,Z)
ClusteringCoeff(X, C) :−NeighborCluster(X,N), Degree(X,D), C=2*N/(D*(D-1))

Jaccard Coefficient
IntersectionCount(X, Y, COUNT<Z>) :−Edge(X, Z), Edge(Y, Z)
UnionCount(X, Y, D) :−Degree(X,D1), Degree(Y,D2), D=D1+D2-D3, IntersectionCount(X,
Y, D3)
Jaccard(X, Y, J) :−IntersectionCount(X, Y, N), UnionCount(X, Y, D), J=N/D

Specifying Domains
• Domains are used to restrict the space of
computation for the prediction elements.
• Space for this feature is |V|2
Similarity(X, Y, S) :−Node(X, Att=V1), Node(Y, Att=V1),
S=f(V1, V2)
• Using this domain the space becomes |E|:
DOMAIN D(X,Y) :- Edge(X, Y)
• Other DOMAIN predicates:
– Equality
– Locality sensitive hashing
– String similarity joins
– Traverse edges

Feature Vector
• Features of prediction elements are combined in
a single predicate to create the feature vector:
DOMAIN D(X, Y) :- …
{
P1(X, Y, F1) :- …
…
Pn(X, Y, Fn) :- …
Features(X, Y, F1, …, Fn) :- P1(X, Y, F1) , …, Pn(X, Y,
Fn)
}

Update Operation
DEFINE Merge(X, Y)
{
INSERT Edge(X, Z) :- Edge(Y, Z)
DELETE Edge(Y, Z)
UPDATE Node(X, A=ANew) :- Node(X,A=AX),
Node(Y,A=AY), ANew=(AX+AY)/2
UPDATE Node(X, B=BNew) :- Node(X,B=BX),
Node(X,B=BX), BNew=max(BX,BY)
DELETE Node(Y)
}
Merge(X, Y) :- Features (X, Y, F1,…,Fn), predict-
ER(F1,…,Fn) = true, confidence-ER(F1,…,Fn) > 0.95

Prediction and Confidence Functions

• The prediction and confidence functions are
user defined functions
• Can be based on logistic regression, Bayes
classifier, or any other classification algorithm
• The confidence is the class membership value
– In logistic regression, the confidence can be the
value of the logistic function
– In Bayes classifier, the confidence can be the
posterior probability value

Iteration
• Iteration is supported by ITERATE construct.
• Takes the number of iterations as a
parameter, or * to iterate until no more
predictions.
• ITERATE (*)
{
MERGE(X,Y) :-Features (X, Y, F1,…,Fn),
predict-ER(F1,…,Fn) = true,
confidence-ER(F1,…,Fn) IN TOP
10%

Pipelining
DOMAIN ER(X,Y) :- …. DOMAIN LP(X,Y) :- ….
{ {
ER1(X,Y,F1) :- … LP1(X,Y,F1) :- …
ER2(X,Y,F1) :- … LP2(X,Y,F1) :- …
Features-ER(X,Y,F1,F2) :- … Features-LP(X,Y,F1,F2) :- …
} }

ITERATE(*)
{
INSERT EDGE(X,Y) :- FT-LP(X,Y,F1,F2), predict-LP(X,Y,F1,F2), confidence-LP(X,Y,F1,F2
IN TOP 10%
}
ITERATE(*)
{
MERGE(X,Y) :- FT-ER(X,Y,F1,F2), predict-ER(X,Y,F1,F2), confidence-ER(X,Y,F1,F2)
IN TOP 10%
}

Interleaving
DOMAIN ER(X,Y) :- …. DOMAIN LP(X,Y) :- ….
{ {
ER1(X,Y,F1) :- … LP1(X,Y,F1) :- …
ER2(X,Y,F1) :- … LP2(X,Y,F1) :- …
Features-ER(X,Y,F1,F2) :- … Features-LP(X,Y,F1,F2) :- …
} }

ITERATE(*)
{
INSERT EDGE(X,Y) :- FT-LP(X,Y,F1,F2), predict-LP(X,Y,F1,F2), confidence-LP(X,Y,F1,F2
IN TOP 10%

MERGE(X,Y) :- FT-ER(X,Y,F1,F2), predict-ER(X,Y,F1,F2), confidence-ER(X,Y,F1,F2)
IN TOP 10%
}

Implementation
• Prototype based on Java Berkeley DB
• Implemented a query parser, plan generator,
query evaluation engine
• Incremental maintenance:
– Aggregate/non-aggregate incremental
maintenance
– DOMAIN maintenance

Incremental Maintenance
• Predicates in the program correspond to materialized tables
(key/value maps).
• Every set of changes done by AP, LP, or ER are logged into two
change tables ΔNodes and ΔEdges.
– Insertions: |Record | +1 |
– Deletions: |Record | -1 |
– Updates: deletion followed by an insertion
• Aggregate maintenance is performed by aggregating the
change table then refreshing the old table.
• DOMAIN:
DOMAIN L(X):- Subgoals of L L(X) :- Subgoals of L
{ P1’(X) :- L(X), Subgoals of P1
P1(X,Y) :- Subgoals of P1 P1(X) :- L(X) >> Subgoals of P1
}

Synthetic Experiements
• Synthetic graphs. Generated using forest fire, and
preferential attachment generation models.
• Three tasks:
– Attribute Prediction, Link Prediction and Entity Resolution
• Two approaches:
– Recomputing features after every iteration
– Incremental maintenance
• Varied parameters:
– Graph size
– Graph density
– Confidence threshold (update size)

Changing Graph Size
• Varied the graph size from 20K nodes and
200K edges to 100K nodes and 1M edges

Comparison with Derby
• Compared the evaluation of 4 features:
degree, clustering coefficient, common
neighbors and Jaccard.

Real-world Experiment
• Real-world PubMed graph
– Set of publications from the medical domain, their
abstracts, and citations
• 50,634 publications, 115,323 citation edges
• Task: Attribute prediction
– Predict if the paper is categorized as Cognition, Learning,
Perception or Thinking
• Choose top 10% predictions after each iteration, for
10 iterations
• Incremental: 28 minutes. Recompute: 42 minutes

Program
DOMAIN Uncommitted(X):-Node(X,Committed=‘no’)
{
ThinkingNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Thinking’)
PerceptionNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Perception’)
CognitionNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Cognition’)
LearningNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Learning’)
Features-AP(X,A,B,C,D,Abstract):- ThinkingNeighbors(X,A),
PerceptionNeighbors(X,B), CognitionNeighbors(X,C),
LearningNeighbors(X,D),Node(X,Abstract, _,_)
}
ITERATE(10)
{
UPDATE Node(X,_,P,‘yes’):- Features-AP(X,A,B,C,D,Text),P = predict-
AP(X,A,B,C,D,Text),confidence-AP(X,A,B,C,D,Text) IN TOP 10%
}

Related Work
• Dedupalog [Arasu et al., ICDE 2009]:
– Datalog-based entity resolution
• User defines hard and soft rules for deduplication
• System satisfies hard rules and minimizes violations to
soft rules when deduplicating references
• Swoosh [Benjelloun et al., VLDBJ 2008]:
– Generic Entity resolution
• Match function for pairs of nodes (based on a set of
features)
• Merge function determines which pairs should be
merged

Conclusions and Ongoing Work
• Conclusions:
– We built a declarative system to specify graph
inference operations
– We implemented the system on top of Berkeley DB
and implemented incremental maintenance
techniques
• Future work:
– Direct computation of top-k predictions
– Multi-query evaluation (especially on graphs)
– Employing a graph DB engine (e.g. Neo4j)
– Support recursive queries and recursive view
maintenance

References
• [Sen et al., AI Magazine 2008]
– Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, Tina Eliassi-Rad:
Collective Classification in Network Data. AI Magazine 29(3): 93-106 (2008)
• [Liben-Nowell et al., CIKM 2003]
– David Liben-Nowell, Jon M. Kleinberg: The link prediction problem for social networks. CIKM
2003.
• [Bhattacharya et al., TKDD 2007]
– I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM TKDD, 1:1–
36, 2007.
• [Namata et al., MLG 2009]
– G. Namata and L. Getoor: A Pipeline Approach to Graph Identification. MLG Workshop, 2009.
• [Namata et al., KDUD 2009]
– G. Namata and L. Getoor: Identifying Graphs From Noisy and Incomplete Data. SIGKDD
Workshop on Knowledge Discovery from Uncertain Data, 2009.
• [Arasu et al., ICDE 2009]
– A. Arasu, C. Re, and D. Suciu. Large-scale deduplication with constraints using dedupalog. In
ICDE, 2009
• [Benjelloun et al., VLDBJ 2008]
– O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang,and J. Widom. Swoosh: a
generic approach to entity resolution. The VLDB Journal, 2008.

Declarative analysis of noisy information networks

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Viewers also liked

Viewers also liked (20)

Similar to Declarative analysis of noisy information networks

Similar to Declarative analysis of noisy information networks (20)

More from University of New South Wales

More from University of New South Wales (10)

Declarative analysis of noisy information networks