SlideShare a Scribd company logo
1 of 41
Download to read offline
Declarative Analysis of Noisy
  Information Networks
         Walaa Eldin Moustafa
            Galileo Namata
           Amol Deshpande
              Lise Getoor
         University of Maryland
Outline

Motivations/Contributions
       Framework
  Declarative Language
    Implementation
         Results
Related and Future Work
Motivation
Motivation
• Users/objects are modeled as nodes,
  relationships as edges
• The observed networks are noisy and
  incomplete.
  – Some users may have more than one account
  – Communication may contain a lot of spam
• Missing attributes, links, having multiple
  references to the same entity
• Need to extract underlying information
  network.
Inference Operations
• Attribute Prediction
   – To predict values of missing attributes
• Link Prediction
   – To predict missing links
• Entity Resolution
   – To predict if two references refer to the same entity
• These prediction tasks can use:
   – Local node information
   – Relational information surrounding the node
Attribute Prediction
Task: Predict topic of the
paper
          A Statistical Model for            Language Model Based
           Multilingual Entity                    Arabic Word
          Detection and Tracking                 Segmentation.

              Automatic Rule
              Refinement for                       Why Not?
          Information Extraction


  Join Optimization of             An Annotation          Tracing Lineage Beyond
 Information Extraction        Management System for       Relational Operators
Output: Quality Matters!        Relational Databases


       Use links between nodes (collective attribute
                                                  D
                                                                      NL       ?
       prediction) [Sen et al., AI Magazine 2008] B
                                                                      Legend
Attribute Prediction
Task: Predict topic of the
paper
          A Statistical Model for            Language Model Based
           Multilingual Entity                    Arabic Word
          Detection and Tracking                 Segmentation.

   P2         Automatic Rule                                            P1
              Refinement for                       Why Not?
          Information Extraction


  Join Optimization of             An Annotation          Tracing Lineage Beyond
 Information Extraction        Management System for       Relational Operators
Output: Quality Matters!        Relational Databases


                                                                    D
                                                                        NL       ?
                                                                    B


                                                                        Legend
Attribute Prediction
Task: Predict topic of the
paper
          A Statistical Model for            Language Model Based
           Multilingual Entity                    Arabic Word
          Detection and Tracking                 Segmentation.

   P2         Automatic Rule                                            P1
              Refinement for                       Why Not?
          Information Extraction


  Join Optimization of             An Annotation          Tracing Lineage Beyond
 Information Extraction        Management System for       Relational Operators
Output: Quality Matters!        Relational Databases


                                                                    D
                                                                        NL       ?
                                                                    B


                                                                        Legend
Link Prediction
• Goal: Predict new links
• Using local similarity
• Using relational similarity [Liben-Nowell et al.,
  CIKM 2003]                        Graham
                                    Cormode
                            Flip Korn


                                                           Lukasz
                                                           Golab
                 Divesh
               Srivastava


     Avishek
      Saha

                                               Vladislav
                                   Theodore   Shkapenyuk
                  Nick
                 Koudas             Johnson
Entity Resolution
• Goal: to deduce that two references refer to
  the same entity
• Can be based on node attributes (local)
  – e.g. string similarity between titles or author
    names
• Local information only may not be enough

                     Jian Li     Jian Li
Entity Resolution

     Use links between the nodes (collective entity
      resolution) [Bhattacharya et al., TKDD 2007]

      Petre      Prabhu                         Amol      Barna
      Stoica      Babu                        Deshpande
                                                          Saha




William                                                        Samir
Roberts                                                       Khuller



                          Jian Li   Jian Li
Joint Inference
• Each task helps others get better predictions.
• How to combine the tasks?
  – One after other (pipelined), or interleaved?
• GAIA:
  – A Java library for applying multiple joint AP, LP, ER
    learning and inference tasks: [Namata et al., MLG
    2009, Namata et al., KDUD 2009]
  – Inference can be pipelined or interleaved.
Our Goal and Contributions
• Motivation: To support declarative network
  inference
• Desiderata:
   – User declaratively specifies the prediction features
      • Local features
      • Relational features
   – Declaratively specify tasks
      • Attribute prediction, Link prediction, Entity resolution
   – Specify arbitrary interleaving or pipelining
   – Support for complex prediction functions

                    Handle all that efficiently
Outline

Motivations/Contributions
       Framework
  Declarative Language
    Implementation
         Results
Related and Future Work
Unifying Framework

      Specify the domain
      Specify the domain
                                 For attribute prediction,
                                 the domain is a subset of
                                 the graph nodes.
      Compute features
      Compute features
                                 For link prediction and
                                 entity resolution, the
Make Predictions, and Compute
Make Predictions, and Compute    domain is a subset of
 Confidence in the Predictions
 Confidence in the Predictions   pairs of nodes.


 Choose Which Predictions to
 Choose Which Predictions to
          Apply
           Apply
Unifying Framework

      Specify the domain
      Specify the domain
                                 Local: word frequency,
                                 income, etc.
                                 Relational: degree,
      Compute features
      Compute features           clustering coeff., no. of
                                 neighbors with each
                                 attribute value, common
Make Predictions, and Compute
Make Predictions, and Compute    neighbors between pairs
 Confidence in the Predictions
 Confidence in the Predictions   of nodes, etc.


 Choose Which Predictions to
 Choose Which Predictions to
          Apply
           Apply
Unifying Framework

      Specify the domain
      Specify the domain
                                 Attribute prediction: the
                                 missing attribute
      Compute features
      Compute features           Link prediction: add link
                                 or not?

Make Predictions, and Compute
Make Predictions, and Compute    Entity resolution: merge
 Confidence in the Predictions
 Confidence in the Predictions   two nodes or not?


 Choose Which Predictions to
 Choose Which Predictions to
          Apply
           Apply
Unifying Framework

      Specify the Domain
      Specify the Domain
                                 After predictions are made,
                                 the graph changes:
                                 Attribute prediction
      Compute Features
      Compute Features           changes local attributes.
                                 Link prediction changes the
                                 graph links.
                                 Entity resolution changes
Make Predictions, and Compute
Make Predictions, and Compute    both local attributes and
 Confidence in the Predictions
 Confidence in the Predictions   graph links.


 Choose Which Predictions to
 Choose Which Predictions to
          Apply
           Apply
Outline

Motivations/Contributions
       Framework
  Declarative Language
    Implementation
         Results
Related and Future Work
Datalog
• Use Datalog to express:
  – Domains
  – Local and relational features
• Extend Datalog with operational semantics
  (vs. fix-point semantics) to express:
  – Predictions (in the form of updates)
  – Iteration
Specifying Features

Degree:
Degree(X, COUNT<Y>) :-Edge(X, Y)

Number of Neighbors with attribute ‘A’
NumNeighbors(X, COUNT<Y>) :− Edge(X, Y), Node(Y, Att=’A’)

Clustering Coefficient
NeighborCluster(X, COUNT<Y,Z>) :−Edge(X,Y), Edge(X,Z), Edge(Y,Z)
ClusteringCoeff(X, C) :−NeighborCluster(X,N), Degree(X,D), C=2*N/(D*(D-1))

Jaccard Coefficient
IntersectionCount(X, Y, COUNT<Z>) :−Edge(X, Z), Edge(Y, Z)
UnionCount(X, Y, D) :−Degree(X,D1), Degree(Y,D2), D=D1+D2-D3, IntersectionCount(X,
Y, D3)
Jaccard(X, Y, J) :−IntersectionCount(X, Y, N), UnionCount(X, Y, D), J=N/D
Specifying Domains
• Domains are used to restrict the space of
  computation for the prediction elements.
• Space for this feature is |V|2
     Similarity(X, Y, S) :−Node(X, Att=V1), Node(Y, Att=V1),
                            S=f(V1, V2)
• Using this domain the space becomes |E|:
             DOMAIN D(X,Y) :- Edge(X, Y)
• Other DOMAIN predicates:
–   Equality
–   Locality sensitive hashing
–   String similarity joins
–   Traverse edges
Feature Vector
• Features of prediction elements are combined in
  a single predicate to create the feature vector:
  DOMAIN D(X, Y) :- …
  {
    P1(X, Y, F1) :- …
    …
    Pn(X, Y, Fn) :- …
    Features(X, Y, F1, …, Fn) :- P1(X, Y, F1) , …, Pn(X, Y,
    Fn)
  }
Update Operation
DEFINE Merge(X, Y)
{
  INSERT Edge(X, Z) :- Edge(Y, Z)
  DELETE Edge(Y, Z)
  UPDATE Node(X, A=ANew) :- Node(X,A=AX),
  Node(Y,A=AY), ANew=(AX+AY)/2
  UPDATE Node(X, B=BNew) :- Node(X,B=BX),
  Node(X,B=BX), BNew=max(BX,BY)
  DELETE Node(Y)
}
Merge(X, Y) :- Features (X, Y, F1,…,Fn), predict-
  ER(F1,…,Fn) = true, confidence-ER(F1,…,Fn) > 0.95
Prediction and Confidence Functions

• The prediction and confidence functions are
  user defined functions
• Can be based on logistic regression, Bayes
  classifier, or any other classification algorithm
• The confidence is the class membership value
  – In logistic regression, the confidence can be the
    value of the logistic function
  – In Bayes classifier, the confidence can be the
    posterior probability value
Iteration
• Iteration is supported by ITERATE construct.
• Takes the number of iterations as a
   parameter, or * to iterate until no more
   predictions.
• ITERATE (*)
  {
     MERGE(X,Y) :-Features (X, Y, F1,…,Fn),
                predict-ER(F1,…,Fn) = true,
                confidence-ER(F1,…,Fn) IN TOP
   10%
Pipelining
DOMAIN ER(X,Y) :- ….                        DOMAIN LP(X,Y) :- ….
{                                           {
  ER1(X,Y,F1) :- …                            LP1(X,Y,F1) :- …
  ER2(X,Y,F1) :- …                            LP2(X,Y,F1) :- …
  Features-ER(X,Y,F1,F2) :- …                 Features-LP(X,Y,F1,F2) :- …
}                                           }

ITERATE(*)
{
   INSERT EDGE(X,Y) :- FT-LP(X,Y,F1,F2), predict-LP(X,Y,F1,F2), confidence-LP(X,Y,F1,F2
   IN TOP 10%
}
ITERATE(*)
{
   MERGE(X,Y) :- FT-ER(X,Y,F1,F2), predict-ER(X,Y,F1,F2), confidence-ER(X,Y,F1,F2)
   IN TOP 10%
}
Interleaving
DOMAIN ER(X,Y) :- ….                         DOMAIN LP(X,Y) :- ….
{                                            {
  ER1(X,Y,F1) :- …                             LP1(X,Y,F1) :- …
  ER2(X,Y,F1) :- …                             LP2(X,Y,F1) :- …
  Features-ER(X,Y,F1,F2) :- …                  Features-LP(X,Y,F1,F2) :- …
}                                            }

ITERATE(*)
{
   INSERT EDGE(X,Y) :- FT-LP(X,Y,F1,F2), predict-LP(X,Y,F1,F2), confidence-LP(X,Y,F1,F2
   IN TOP 10%

    MERGE(X,Y) :- FT-ER(X,Y,F1,F2), predict-ER(X,Y,F1,F2), confidence-ER(X,Y,F1,F2)
    IN TOP 10%
}
Outline

Motivations/Contributions
       Framework
  Declarative Language
    Implementation
         Results
Related and Future Work
Implementation
• Prototype based on Java Berkeley DB
• Implemented a query parser, plan generator,
  query evaluation engine
• Incremental maintenance:
  – Aggregate/non-aggregate incremental
    maintenance
  – DOMAIN maintenance
Incremental Maintenance
• Predicates in the program correspond to materialized tables
  (key/value maps).
• Every set of changes done by AP, LP, or ER are logged into two
  change tables ΔNodes and ΔEdges.
   – Insertions: |Record | +1 |
   – Deletions: |Record | -1 |
   – Updates: deletion followed by an insertion
• Aggregate maintenance is performed by aggregating the
  change table then refreshing the old table.
• DOMAIN:
   DOMAIN L(X):- Subgoals of L     L(X) :- Subgoals of L
   {                               P1’(X) :- L(X), Subgoals of P1
     P1(X,Y) :- Subgoals of P1     P1(X) :- L(X) >> Subgoals of P1
   }
Outline

Motivations/Contributions
       Framework
  Declarative Language
    Implementation
         Results
Related and Future Work
Synthetic Experiements
• Synthetic graphs. Generated using forest fire, and
  preferential attachment generation models.
• Three tasks:
   – Attribute Prediction, Link Prediction and Entity Resolution
• Two approaches:
   – Recomputing features after every iteration
   – Incremental maintenance
• Varied parameters:
   – Graph size
   – Graph density
   – Confidence threshold (update size)
Changing Graph Size
• Varied the graph size from 20K nodes and
  200K edges to 100K nodes and 1M edges
Comparison with Derby
• Compared the evaluation of 4 features:
  degree, clustering coefficient, common
  neighbors and Jaccard.
Real-world Experiment
• Real-world PubMed graph
   – Set of publications from the medical domain, their
     abstracts, and citations
• 50,634 publications, 115,323 citation edges
• Task: Attribute prediction
   – Predict if the paper is categorized as Cognition, Learning,
     Perception or Thinking
• Choose top 10% predictions after each iteration, for
  10 iterations
• Incremental: 28 minutes. Recompute: 42 minutes
Program
DOMAIN Uncommitted(X):-Node(X,Committed=‘no’)
{
  ThinkingNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Thinking’)
  PerceptionNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Perception’)
  CognitionNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Cognition’)
  LearningNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Learning’)
  Features-AP(X,A,B,C,D,Abstract):- ThinkingNeighbors(X,A),
    PerceptionNeighbors(X,B), CognitionNeighbors(X,C),
    LearningNeighbors(X,D),Node(X,Abstract, _,_)
}
ITERATE(10)
{
  UPDATE Node(X,_,P,‘yes’):- Features-AP(X,A,B,C,D,Text),P = predict-
    AP(X,A,B,C,D,Text),confidence-AP(X,A,B,C,D,Text) IN TOP 10%
}
Outline

Motivations/Contributions
       Framework
  Declarative Language
    Implementation
         Results
Related and Future Work
Related Work
• Dedupalog [Arasu et al., ICDE 2009]:
  – Datalog-based entity resolution
     • User defines hard and soft rules for deduplication
     • System satisfies hard rules and minimizes violations to
       soft rules when deduplicating references
• Swoosh [Benjelloun et al., VLDBJ 2008]:
  – Generic Entity resolution
     • Match function for pairs of nodes (based on a set of
       features)
     • Merge function determines which pairs should be
       merged
Conclusions and Ongoing Work
• Conclusions:
  – We built a declarative system to specify graph
    inference operations
  – We implemented the system on top of Berkeley DB
    and implemented incremental maintenance
    techniques
• Future work:
  –   Direct computation of top-k predictions
  –   Multi-query evaluation (especially on graphs)
  –   Employing a graph DB engine (e.g. Neo4j)
  –   Support recursive queries and recursive view
      maintenance
References
•   [Sen et al., AI Magazine 2008]
     – Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, Tina Eliassi-Rad:
       Collective Classification in Network Data. AI Magazine 29(3): 93-106 (2008)
•   [Liben-Nowell et al., CIKM 2003]
     – David Liben-Nowell, Jon M. Kleinberg: The link prediction problem for social networks. CIKM
       2003.
•   [Bhattacharya et al., TKDD 2007]
     – I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM TKDD, 1:1–
       36, 2007.
•   [Namata et al., MLG 2009]
     – G. Namata and L. Getoor: A Pipeline Approach to Graph Identification. MLG Workshop, 2009.
•   [Namata et al., KDUD 2009]
     – G. Namata and L. Getoor: Identifying Graphs From Noisy and Incomplete Data. SIGKDD
       Workshop on Knowledge Discovery from Uncertain Data, 2009.
•   [Arasu et al., ICDE 2009]
     – A. Arasu, C. Re, and D. Suciu. Large-scale deduplication with constraints using dedupalog. In
       ICDE, 2009
•   [Benjelloun et al., VLDBJ 2008]
     – O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang,and J. Widom. Swoosh: a
       generic approach to entity resolution. The VLDB Journal, 2008.

More Related Content

What's hot

BERT - Part 2 Learning Notes
BERT - Part 2 Learning NotesBERT - Part 2 Learning Notes
BERT - Part 2 Learning NotesSenthil Kumar M
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep LearningAdam Gibson
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersRoelof Pieters
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPindico data
 
UAB 2011- Combining human and computational intelligence
UAB 2011- Combining human and computational intelligenceUAB 2011- Combining human and computational intelligence
UAB 2011- Combining human and computational intelligenceINSEMTIVES project
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
 
RCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMinerRCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMinerbohanairl
 
Transfer Learning: An overview
Transfer Learning: An overviewTransfer Learning: An overview
Transfer Learning: An overviewjins0618
 
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET Journal
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningBigDataCloud
 

What's hot (15)

BERT - Part 2 Learning Notes
BERT - Part 2 Learning NotesBERT - Part 2 Learning Notes
BERT - Part 2 Learning Notes
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep Learning
 
Deep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ersDeep Learning, an interactive introduction for NLP-ers
Deep Learning, an interactive introduction for NLP-ers
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
UAB 2011- Combining human and computational intelligence
UAB 2011- Combining human and computational intelligenceUAB 2011- Combining human and computational intelligence
UAB 2011- Combining human and computational intelligence
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
RCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMinerRCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMiner
 
Communication
CommunicationCommunication
Communication
 
Transfer Learning: An overview
Transfer Learning: An overviewTransfer Learning: An overview
Transfer Learning: An overview
 
Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?Video + Language: Where Does Domain Knowledge Fit in?
Video + Language: Where Does Domain Knowledge Fit in?
 
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
IRJET- Survey on Deep Learning Approaches for Phrase Structure Identification...
 
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher ManningDeep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
Deep Learning for NLP (without Magic) - Richard Socher and Christopher Manning
 
Video+Language: From Classification to Description
Video+Language: From Classification to DescriptionVideo+Language: From Classification to Description
Video+Language: From Classification to Description
 

Viewers also liked

Caching and Microsoft Distributed Cache (aka "Velocity")
Caching and Microsoft Distributed Cache (aka "Velocity")Caching and Microsoft Distributed Cache (aka "Velocity")
Caching and Microsoft Distributed Cache (aka "Velocity")David Giard
 
Pagine Da Manuale Land1 50
Pagine Da Manuale Land1 50Pagine Da Manuale Land1 50
Pagine Da Manuale Land1 50pretorianusx
 
Speech Understanding Dictation To Clinical Data - TEPR 2009
Speech Understanding   Dictation To Clinical Data - TEPR 2009Speech Understanding   Dictation To Clinical Data - TEPR 2009
Speech Understanding Dictation To Clinical Data - TEPR 2009Nick van Terheyden
 
MTIA 2009 - Healthstory Project Overview Dictation To Clinical Data
MTIA 2009 - Healthstory Project Overview   Dictation To Clinical DataMTIA 2009 - Healthstory Project Overview   Dictation To Clinical Data
MTIA 2009 - Healthstory Project Overview Dictation To Clinical DataNick van Terheyden
 
Staceys Outdoor Ed Sac
Staceys Outdoor Ed SacStaceys Outdoor Ed Sac
Staceys Outdoor Ed Sacmrrobbo
 
Standing out from the crowd: You, Your Brand, and Your WordPress Theme
Standing out from the crowd: You, Your Brand, and Your WordPress ThemeStanding out from the crowd: You, Your Brand, and Your WordPress Theme
Standing out from the crowd: You, Your Brand, and Your WordPress Themehollyhagen
 
Pillars.io wake upstartup
Pillars.io wake upstartupPillars.io wake upstartup
Pillars.io wake upstartupBrian Link
 
Gang Announcements February 2010
Gang Announcements February 2010Gang Announcements February 2010
Gang Announcements February 2010David Giard
 
Strategic Energy Systems Planning under Uncertainty
Strategic Energy Systems Planning under UncertaintyStrategic Energy Systems Planning under Uncertainty
Strategic Energy Systems Planning under UncertaintyEmilio L. Cano
 
Verifica Di Resistenza Al Fuoco. Nuove Norme Tecniche 2008
Verifica Di Resistenza Al Fuoco. Nuove Norme Tecniche 2008Verifica Di Resistenza Al Fuoco. Nuove Norme Tecniche 2008
Verifica Di Resistenza Al Fuoco. Nuove Norme Tecniche 2008Eugenio Agnello
 
Word Coach - Pitt Sept 08
Word Coach - Pitt Sept 08Word Coach - Pitt Sept 08
Word Coach - Pitt Sept 08r21270
 
Introd a las inst electricas
Introd a las inst electricasIntrod a las inst electricas
Introd a las inst electricasCatty Rivero
 
Fisiopat Mar
Fisiopat MarFisiopat Mar
Fisiopat Marslidesmed
 
Racalmuto: Centro Commerciale Naturale Borgo Chiaramontano
Racalmuto: Centro Commerciale Naturale Borgo ChiaramontanoRacalmuto: Centro Commerciale Naturale Borgo Chiaramontano
Racalmuto: Centro Commerciale Naturale Borgo ChiaramontanoEugenio Agnello
 
Calling Dr Watson To Radiology - RSNA Presentation
Calling Dr Watson To Radiology - RSNA PresentationCalling Dr Watson To Radiology - RSNA Presentation
Calling Dr Watson To Radiology - RSNA PresentationNick van Terheyden
 
Gestione Opere Pubbliche. Funzioni e compiti e responsabilità
Gestione Opere Pubbliche. Funzioni e compiti e responsabilitàGestione Opere Pubbliche. Funzioni e compiti e responsabilità
Gestione Opere Pubbliche. Funzioni e compiti e responsabilitàEugenio Agnello
 

Viewers also liked (20)

Caching and Microsoft Distributed Cache (aka "Velocity")
Caching and Microsoft Distributed Cache (aka "Velocity")Caching and Microsoft Distributed Cache (aka "Velocity")
Caching and Microsoft Distributed Cache (aka "Velocity")
 
Pagine Da Manuale Land1 50
Pagine Da Manuale Land1 50Pagine Da Manuale Land1 50
Pagine Da Manuale Land1 50
 
Speech Understanding Dictation To Clinical Data - TEPR 2009
Speech Understanding   Dictation To Clinical Data - TEPR 2009Speech Understanding   Dictation To Clinical Data - TEPR 2009
Speech Understanding Dictation To Clinical Data - TEPR 2009
 
Hands
HandsHands
Hands
 
MTIA 2009 - Healthstory Project Overview Dictation To Clinical Data
MTIA 2009 - Healthstory Project Overview   Dictation To Clinical DataMTIA 2009 - Healthstory Project Overview   Dictation To Clinical Data
MTIA 2009 - Healthstory Project Overview Dictation To Clinical Data
 
Staceys Outdoor Ed Sac
Staceys Outdoor Ed SacStaceys Outdoor Ed Sac
Staceys Outdoor Ed Sac
 
Standing out from the crowd: You, Your Brand, and Your WordPress Theme
Standing out from the crowd: You, Your Brand, and Your WordPress ThemeStanding out from the crowd: You, Your Brand, and Your WordPress Theme
Standing out from the crowd: You, Your Brand, and Your WordPress Theme
 
Pillars.io wake upstartup
Pillars.io wake upstartupPillars.io wake upstartup
Pillars.io wake upstartup
 
Gang Announcements February 2010
Gang Announcements February 2010Gang Announcements February 2010
Gang Announcements February 2010
 
Party
PartyParty
Party
 
Strategic Energy Systems Planning under Uncertainty
Strategic Energy Systems Planning under UncertaintyStrategic Energy Systems Planning under Uncertainty
Strategic Energy Systems Planning under Uncertainty
 
Verifica Di Resistenza Al Fuoco. Nuove Norme Tecniche 2008
Verifica Di Resistenza Al Fuoco. Nuove Norme Tecniche 2008Verifica Di Resistenza Al Fuoco. Nuove Norme Tecniche 2008
Verifica Di Resistenza Al Fuoco. Nuove Norme Tecniche 2008
 
InfiniteGraph
InfiniteGraphInfiniteGraph
InfiniteGraph
 
Word Coach - Pitt Sept 08
Word Coach - Pitt Sept 08Word Coach - Pitt Sept 08
Word Coach - Pitt Sept 08
 
Introd a las inst electricas
Introd a las inst electricasIntrod a las inst electricas
Introd a las inst electricas
 
Fisiopat Mar
Fisiopat MarFisiopat Mar
Fisiopat Mar
 
Racalmuto: Centro Commerciale Naturale Borgo Chiaramontano
Racalmuto: Centro Commerciale Naturale Borgo ChiaramontanoRacalmuto: Centro Commerciale Naturale Borgo Chiaramontano
Racalmuto: Centro Commerciale Naturale Borgo Chiaramontano
 
Calling Dr Watson To Radiology - RSNA Presentation
Calling Dr Watson To Radiology - RSNA PresentationCalling Dr Watson To Radiology - RSNA Presentation
Calling Dr Watson To Radiology - RSNA Presentation
 
Gestione Opere Pubbliche. Funzioni e compiti e responsabilità
Gestione Opere Pubbliche. Funzioni e compiti e responsabilitàGestione Opere Pubbliche. Funzioni e compiti e responsabilità
Gestione Opere Pubbliche. Funzioni e compiti e responsabilità
 
J query
J queryJ query
J query
 

Similar to Declarative analysis of noisy information networks

Framework Engineering_Final
Framework Engineering_FinalFramework Engineering_Final
Framework Engineering_FinalYoungSu Son
 
Framework Engineering
Framework EngineeringFramework Engineering
Framework EngineeringYoungSu Son
 
Invoice 2 Vec: Creating AI to Read Documents - Mark Landry - H2O AI World Lon...
Invoice 2 Vec: Creating AI to Read Documents - Mark Landry - H2O AI World Lon...Invoice 2 Vec: Creating AI to Read Documents - Mark Landry - H2O AI World Lon...
Invoice 2 Vec: Creating AI to Read Documents - Mark Landry - H2O AI World Lon...Sri Ambati
 
2011 04-dsi-javaee-in-the-cloud-andreadis
2011 04-dsi-javaee-in-the-cloud-andreadis2011 04-dsi-javaee-in-the-cloud-andreadis
2011 04-dsi-javaee-in-the-cloud-andreadisdandre
 
Innovations in Grid Computing with Oracle Coherence
Innovations in Grid Computing with Oracle CoherenceInnovations in Grid Computing with Oracle Coherence
Innovations in Grid Computing with Oracle CoherenceBob Rhubart
 
OSS Presentation Keynote by Hal Stern
OSS Presentation Keynote by Hal SternOSS Presentation Keynote by Hal Stern
OSS Presentation Keynote by Hal SternOpenStorageSummit
 
Flash Camp Chennai - Social network with ORM
Flash Camp Chennai - Social network with ORMFlash Camp Chennai - Social network with ORM
Flash Camp Chennai - Social network with ORMRIA RUI Society
 
Sieve - Data Quality and Fusion - LWDM2012
Sieve - Data Quality and Fusion - LWDM2012Sieve - Data Quality and Fusion - LWDM2012
Sieve - Data Quality and Fusion - LWDM2012Pablo Mendes
 
Show observe and tell giang nguyen
Show observe and tell   giang nguyenShow observe and tell   giang nguyen
Show observe and tell giang nguyenNguyen Giang
 
Automated BI Modernizations
Automated BI ModernizationsAutomated BI Modernizations
Automated BI Modernizationsdlautzenheiser
 
Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data
Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data
Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data Emulex Corporation
 
Tech Talk SQL Server 2012 Business Intelligence
Tech Talk SQL Server 2012 Business IntelligenceTech Talk SQL Server 2012 Business Intelligence
Tech Talk SQL Server 2012 Business IntelligenceRay Cochrane
 
Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs
Selectivity Estimation for Hybrid Queries over Text-Rich Data GraphsSelectivity Estimation for Hybrid Queries over Text-Rich Data Graphs
Selectivity Estimation for Hybrid Queries over Text-Rich Data GraphsWagner Andreas
 
How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudVMware Tanzu
 

Similar to Declarative analysis of noisy information networks (20)

Lise Getoor, "
Lise Getoor, "Lise Getoor, "
Lise Getoor, "
 
Framework Engineering_Final
Framework Engineering_FinalFramework Engineering_Final
Framework Engineering_Final
 
Framework Engineering
Framework EngineeringFramework Engineering
Framework Engineering
 
Invoice 2 Vec: Creating AI to Read Documents - Mark Landry - H2O AI World Lon...
Invoice 2 Vec: Creating AI to Read Documents - Mark Landry - H2O AI World Lon...Invoice 2 Vec: Creating AI to Read Documents - Mark Landry - H2O AI World Lon...
Invoice 2 Vec: Creating AI to Read Documents - Mark Landry - H2O AI World Lon...
 
2011 04-dsi-javaee-in-the-cloud-andreadis
2011 04-dsi-javaee-in-the-cloud-andreadis2011 04-dsi-javaee-in-the-cloud-andreadis
2011 04-dsi-javaee-in-the-cloud-andreadis
 
Innovations in Grid Computing with Oracle Coherence
Innovations in Grid Computing with Oracle CoherenceInnovations in Grid Computing with Oracle Coherence
Innovations in Grid Computing with Oracle Coherence
 
6 Months Net
6 Months Net6 Months Net
6 Months Net
 
OSS Presentation Keynote by Hal Stern
OSS Presentation Keynote by Hal SternOSS Presentation Keynote by Hal Stern
OSS Presentation Keynote by Hal Stern
 
Flash Camp Chennai - Social network with ORM
Flash Camp Chennai - Social network with ORMFlash Camp Chennai - Social network with ORM
Flash Camp Chennai - Social network with ORM
 
Design1
Design1Design1
Design1
 
Sieve - Data Quality and Fusion - LWDM2012
Sieve - Data Quality and Fusion - LWDM2012Sieve - Data Quality and Fusion - LWDM2012
Sieve - Data Quality and Fusion - LWDM2012
 
Show observe and tell giang nguyen
Show observe and tell   giang nguyenShow observe and tell   giang nguyen
Show observe and tell giang nguyen
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Automated BI Modernizations
Automated BI ModernizationsAutomated BI Modernizations
Automated BI Modernizations
 
Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data
Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data
Emulex and the Evaluator Group Present Why I/O is Strategic for Big Data
 
Tech Talk SQL Server 2012 Business Intelligence
Tech Talk SQL Server 2012 Business IntelligenceTech Talk SQL Server 2012 Business Intelligence
Tech Talk SQL Server 2012 Business Intelligence
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs
Selectivity Estimation for Hybrid Queries over Text-Rich Data GraphsSelectivity Estimation for Hybrid Queries over Text-Rich Data Graphs
Selectivity Estimation for Hybrid Queries over Text-Rich Data Graphs
 
Evolutionary Design Solid
Evolutionary Design SolidEvolutionary Design Solid
Evolutionary Design Solid
 
How to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the CloudHow to Build Modern Data Architectures Both On Premises and in the Cloud
How to Build Modern Data Architectures Both On Premises and in the Cloud
 

More from University of New South Wales (10)

Gremlin
Gremlin Gremlin
Gremlin
 
DHHT - Modeling beyond plain graphs
DHHT - Modeling beyond plain graphsDHHT - Modeling beyond plain graphs
DHHT - Modeling beyond plain graphs
 
Dex
DexDex
Dex
 
Ontological Conjunctive Query Answering over Large Knowledge Bases
Ontological Conjunctive Query Answering over Large Knowledge BasesOntological Conjunctive Query Answering over Large Knowledge Bases
Ontological Conjunctive Query Answering over Large Knowledge Bases
 
Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud
Key-Key-Value Stores for Efficiently Processing Graph Data in the CloudKey-Key-Value Stores for Efficiently Processing Graph Data in the Cloud
Key-Key-Value Stores for Efficiently Processing Graph Data in the Cloud
 
Allegograph
AllegographAllegograph
Allegograph
 
Neo4j
Neo4jNeo4j
Neo4j
 
Dependable Cardinality Forecast for XQuery
Dependable Cardinality Forecast for XQueryDependable Cardinality Forecast for XQuery
Dependable Cardinality Forecast for XQuery
 
GraphREL: A Relational Graph Query Processor
GraphREL: A Relational Graph Query ProcessorGraphREL: A Relational Graph Query Processor
GraphREL: A Relational Graph Query Processor
 
XML Compression Benchmark
XML Compression BenchmarkXML Compression Benchmark
XML Compression Benchmark
 

Declarative analysis of noisy information networks

  • 1. Declarative Analysis of Noisy Information Networks Walaa Eldin Moustafa Galileo Namata Amol Deshpande Lise Getoor University of Maryland
  • 2. Outline Motivations/Contributions Framework Declarative Language Implementation Results Related and Future Work
  • 4. Motivation • Users/objects are modeled as nodes, relationships as edges • The observed networks are noisy and incomplete. – Some users may have more than one account – Communication may contain a lot of spam • Missing attributes, links, having multiple references to the same entity • Need to extract underlying information network.
  • 5. Inference Operations • Attribute Prediction – To predict values of missing attributes • Link Prediction – To predict missing links • Entity Resolution – To predict if two references refer to the same entity • These prediction tasks can use: – Local node information – Relational information surrounding the node
  • 6. Attribute Prediction Task: Predict topic of the paper A Statistical Model for Language Model Based Multilingual Entity Arabic Word Detection and Tracking Segmentation. Automatic Rule Refinement for Why Not? Information Extraction Join Optimization of An Annotation Tracing Lineage Beyond Information Extraction Management System for Relational Operators Output: Quality Matters! Relational Databases Use links between nodes (collective attribute D NL ? prediction) [Sen et al., AI Magazine 2008] B Legend
  • 7. Attribute Prediction Task: Predict topic of the paper A Statistical Model for Language Model Based Multilingual Entity Arabic Word Detection and Tracking Segmentation. P2 Automatic Rule P1 Refinement for Why Not? Information Extraction Join Optimization of An Annotation Tracing Lineage Beyond Information Extraction Management System for Relational Operators Output: Quality Matters! Relational Databases D NL ? B Legend
  • 8. Attribute Prediction Task: Predict topic of the paper A Statistical Model for Language Model Based Multilingual Entity Arabic Word Detection and Tracking Segmentation. P2 Automatic Rule P1 Refinement for Why Not? Information Extraction Join Optimization of An Annotation Tracing Lineage Beyond Information Extraction Management System for Relational Operators Output: Quality Matters! Relational Databases D NL ? B Legend
  • 9. Link Prediction • Goal: Predict new links • Using local similarity • Using relational similarity [Liben-Nowell et al., CIKM 2003] Graham Cormode Flip Korn Lukasz Golab Divesh Srivastava Avishek Saha Vladislav Theodore Shkapenyuk Nick Koudas Johnson
  • 10. Entity Resolution • Goal: to deduce that two references refer to the same entity • Can be based on node attributes (local) – e.g. string similarity between titles or author names • Local information only may not be enough Jian Li Jian Li
  • 11. Entity Resolution Use links between the nodes (collective entity resolution) [Bhattacharya et al., TKDD 2007] Petre Prabhu Amol Barna Stoica Babu Deshpande Saha William Samir Roberts Khuller Jian Li Jian Li
  • 12. Joint Inference • Each task helps others get better predictions. • How to combine the tasks? – One after other (pipelined), or interleaved? • GAIA: – A Java library for applying multiple joint AP, LP, ER learning and inference tasks: [Namata et al., MLG 2009, Namata et al., KDUD 2009] – Inference can be pipelined or interleaved.
  • 13. Our Goal and Contributions • Motivation: To support declarative network inference • Desiderata: – User declaratively specifies the prediction features • Local features • Relational features – Declaratively specify tasks • Attribute prediction, Link prediction, Entity resolution – Specify arbitrary interleaving or pipelining – Support for complex prediction functions Handle all that efficiently
  • 14. Outline Motivations/Contributions Framework Declarative Language Implementation Results Related and Future Work
  • 15. Unifying Framework Specify the domain Specify the domain For attribute prediction, the domain is a subset of the graph nodes. Compute features Compute features For link prediction and entity resolution, the Make Predictions, and Compute Make Predictions, and Compute domain is a subset of Confidence in the Predictions Confidence in the Predictions pairs of nodes. Choose Which Predictions to Choose Which Predictions to Apply Apply
  • 16. Unifying Framework Specify the domain Specify the domain Local: word frequency, income, etc. Relational: degree, Compute features Compute features clustering coeff., no. of neighbors with each attribute value, common Make Predictions, and Compute Make Predictions, and Compute neighbors between pairs Confidence in the Predictions Confidence in the Predictions of nodes, etc. Choose Which Predictions to Choose Which Predictions to Apply Apply
  • 17. Unifying Framework Specify the domain Specify the domain Attribute prediction: the missing attribute Compute features Compute features Link prediction: add link or not? Make Predictions, and Compute Make Predictions, and Compute Entity resolution: merge Confidence in the Predictions Confidence in the Predictions two nodes or not? Choose Which Predictions to Choose Which Predictions to Apply Apply
  • 18. Unifying Framework Specify the Domain Specify the Domain After predictions are made, the graph changes: Attribute prediction Compute Features Compute Features changes local attributes. Link prediction changes the graph links. Entity resolution changes Make Predictions, and Compute Make Predictions, and Compute both local attributes and Confidence in the Predictions Confidence in the Predictions graph links. Choose Which Predictions to Choose Which Predictions to Apply Apply
  • 19. Outline Motivations/Contributions Framework Declarative Language Implementation Results Related and Future Work
  • 20. Datalog • Use Datalog to express: – Domains – Local and relational features • Extend Datalog with operational semantics (vs. fix-point semantics) to express: – Predictions (in the form of updates) – Iteration
  • 21. Specifying Features Degree: Degree(X, COUNT<Y>) :-Edge(X, Y) Number of Neighbors with attribute ‘A’ NumNeighbors(X, COUNT<Y>) :− Edge(X, Y), Node(Y, Att=’A’) Clustering Coefficient NeighborCluster(X, COUNT<Y,Z>) :−Edge(X,Y), Edge(X,Z), Edge(Y,Z) ClusteringCoeff(X, C) :−NeighborCluster(X,N), Degree(X,D), C=2*N/(D*(D-1)) Jaccard Coefficient IntersectionCount(X, Y, COUNT<Z>) :−Edge(X, Z), Edge(Y, Z) UnionCount(X, Y, D) :−Degree(X,D1), Degree(Y,D2), D=D1+D2-D3, IntersectionCount(X, Y, D3) Jaccard(X, Y, J) :−IntersectionCount(X, Y, N), UnionCount(X, Y, D), J=N/D
  • 22. Specifying Domains • Domains are used to restrict the space of computation for the prediction elements. • Space for this feature is |V|2 Similarity(X, Y, S) :−Node(X, Att=V1), Node(Y, Att=V1), S=f(V1, V2) • Using this domain the space becomes |E|: DOMAIN D(X,Y) :- Edge(X, Y) • Other DOMAIN predicates: – Equality – Locality sensitive hashing – String similarity joins – Traverse edges
  • 23. Feature Vector • Features of prediction elements are combined in a single predicate to create the feature vector: DOMAIN D(X, Y) :- … { P1(X, Y, F1) :- … … Pn(X, Y, Fn) :- … Features(X, Y, F1, …, Fn) :- P1(X, Y, F1) , …, Pn(X, Y, Fn) }
  • 24. Update Operation DEFINE Merge(X, Y) { INSERT Edge(X, Z) :- Edge(Y, Z) DELETE Edge(Y, Z) UPDATE Node(X, A=ANew) :- Node(X,A=AX), Node(Y,A=AY), ANew=(AX+AY)/2 UPDATE Node(X, B=BNew) :- Node(X,B=BX), Node(X,B=BX), BNew=max(BX,BY) DELETE Node(Y) } Merge(X, Y) :- Features (X, Y, F1,…,Fn), predict- ER(F1,…,Fn) = true, confidence-ER(F1,…,Fn) > 0.95
  • 25. Prediction and Confidence Functions • The prediction and confidence functions are user defined functions • Can be based on logistic regression, Bayes classifier, or any other classification algorithm • The confidence is the class membership value – In logistic regression, the confidence can be the value of the logistic function – In Bayes classifier, the confidence can be the posterior probability value
  • 26. Iteration • Iteration is supported by ITERATE construct. • Takes the number of iterations as a parameter, or * to iterate until no more predictions. • ITERATE (*) { MERGE(X,Y) :-Features (X, Y, F1,…,Fn), predict-ER(F1,…,Fn) = true, confidence-ER(F1,…,Fn) IN TOP 10%
  • 27. Pipelining DOMAIN ER(X,Y) :- …. DOMAIN LP(X,Y) :- …. { { ER1(X,Y,F1) :- … LP1(X,Y,F1) :- … ER2(X,Y,F1) :- … LP2(X,Y,F1) :- … Features-ER(X,Y,F1,F2) :- … Features-LP(X,Y,F1,F2) :- … } } ITERATE(*) { INSERT EDGE(X,Y) :- FT-LP(X,Y,F1,F2), predict-LP(X,Y,F1,F2), confidence-LP(X,Y,F1,F2 IN TOP 10% } ITERATE(*) { MERGE(X,Y) :- FT-ER(X,Y,F1,F2), predict-ER(X,Y,F1,F2), confidence-ER(X,Y,F1,F2) IN TOP 10% }
  • 28. Interleaving DOMAIN ER(X,Y) :- …. DOMAIN LP(X,Y) :- …. { { ER1(X,Y,F1) :- … LP1(X,Y,F1) :- … ER2(X,Y,F1) :- … LP2(X,Y,F1) :- … Features-ER(X,Y,F1,F2) :- … Features-LP(X,Y,F1,F2) :- … } } ITERATE(*) { INSERT EDGE(X,Y) :- FT-LP(X,Y,F1,F2), predict-LP(X,Y,F1,F2), confidence-LP(X,Y,F1,F2 IN TOP 10% MERGE(X,Y) :- FT-ER(X,Y,F1,F2), predict-ER(X,Y,F1,F2), confidence-ER(X,Y,F1,F2) IN TOP 10% }
  • 29. Outline Motivations/Contributions Framework Declarative Language Implementation Results Related and Future Work
  • 30. Implementation • Prototype based on Java Berkeley DB • Implemented a query parser, plan generator, query evaluation engine • Incremental maintenance: – Aggregate/non-aggregate incremental maintenance – DOMAIN maintenance
  • 31. Incremental Maintenance • Predicates in the program correspond to materialized tables (key/value maps). • Every set of changes done by AP, LP, or ER are logged into two change tables ΔNodes and ΔEdges. – Insertions: |Record | +1 | – Deletions: |Record | -1 | – Updates: deletion followed by an insertion • Aggregate maintenance is performed by aggregating the change table then refreshing the old table. • DOMAIN: DOMAIN L(X):- Subgoals of L L(X) :- Subgoals of L { P1’(X) :- L(X), Subgoals of P1 P1(X,Y) :- Subgoals of P1 P1(X) :- L(X) >> Subgoals of P1 }
  • 32. Outline Motivations/Contributions Framework Declarative Language Implementation Results Related and Future Work
  • 33. Synthetic Experiements • Synthetic graphs. Generated using forest fire, and preferential attachment generation models. • Three tasks: – Attribute Prediction, Link Prediction and Entity Resolution • Two approaches: – Recomputing features after every iteration – Incremental maintenance • Varied parameters: – Graph size – Graph density – Confidence threshold (update size)
  • 34. Changing Graph Size • Varied the graph size from 20K nodes and 200K edges to 100K nodes and 1M edges
  • 35. Comparison with Derby • Compared the evaluation of 4 features: degree, clustering coefficient, common neighbors and Jaccard.
  • 36. Real-world Experiment • Real-world PubMed graph – Set of publications from the medical domain, their abstracts, and citations • 50,634 publications, 115,323 citation edges • Task: Attribute prediction – Predict if the paper is categorized as Cognition, Learning, Perception or Thinking • Choose top 10% predictions after each iteration, for 10 iterations • Incremental: 28 minutes. Recompute: 42 minutes
  • 37. Program DOMAIN Uncommitted(X):-Node(X,Committed=‘no’) { ThinkingNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Thinking’) PerceptionNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Perception’) CognitionNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Cognition’) LearningNeighbors(X,Count<Y>):- Edge(X,Y), Node(Y,Label=‘Learning’) Features-AP(X,A,B,C,D,Abstract):- ThinkingNeighbors(X,A), PerceptionNeighbors(X,B), CognitionNeighbors(X,C), LearningNeighbors(X,D),Node(X,Abstract, _,_) } ITERATE(10) { UPDATE Node(X,_,P,‘yes’):- Features-AP(X,A,B,C,D,Text),P = predict- AP(X,A,B,C,D,Text),confidence-AP(X,A,B,C,D,Text) IN TOP 10% }
  • 38. Outline Motivations/Contributions Framework Declarative Language Implementation Results Related and Future Work
  • 39. Related Work • Dedupalog [Arasu et al., ICDE 2009]: – Datalog-based entity resolution • User defines hard and soft rules for deduplication • System satisfies hard rules and minimizes violations to soft rules when deduplicating references • Swoosh [Benjelloun et al., VLDBJ 2008]: – Generic Entity resolution • Match function for pairs of nodes (based on a set of features) • Merge function determines which pairs should be merged
  • 40. Conclusions and Ongoing Work • Conclusions: – We built a declarative system to specify graph inference operations – We implemented the system on top of Berkeley DB and implemented incremental maintenance techniques • Future work: – Direct computation of top-k predictions – Multi-query evaluation (especially on graphs) – Employing a graph DB engine (e.g. Neo4j) – Support recursive queries and recursive view maintenance
  • 41. References • [Sen et al., AI Magazine 2008] – Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Gallagher, Tina Eliassi-Rad: Collective Classification in Network Data. AI Magazine 29(3): 93-106 (2008) • [Liben-Nowell et al., CIKM 2003] – David Liben-Nowell, Jon M. Kleinberg: The link prediction problem for social networks. CIKM 2003. • [Bhattacharya et al., TKDD 2007] – I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM TKDD, 1:1– 36, 2007. • [Namata et al., MLG 2009] – G. Namata and L. Getoor: A Pipeline Approach to Graph Identification. MLG Workshop, 2009. • [Namata et al., KDUD 2009] – G. Namata and L. Getoor: Identifying Graphs From Noisy and Incomplete Data. SIGKDD Workshop on Knowledge Discovery from Uncertain Data, 2009. • [Arasu et al., ICDE 2009] – A. Arasu, C. Re, and D. Suciu. Large-scale deduplication with constraints using dedupalog. In ICDE, 2009 • [Benjelloun et al., VLDBJ 2008] – O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang,and J. Widom. Swoosh: a generic approach to entity resolution. The VLDB Journal, 2008.