SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
ZenCrowd	
  
Leveraging	
  Probabilis3c	
  Reasoning	
  and	
  
Crowdsourcing	
  Techniques	
  for	
  Large-­‐Scale	
  
En3ty	
  Linking	
  	
  
	
  
Gianluca	
  Demar3ni,	
  Djellel	
  Eddine	
  
Difallah,	
  and	
  Philippe	
  Cudré-­‐Mauroux	
  
eXascale	
  Infolab,	
  University	
  of	
  Fribourg	
  
Switzerland	
  
MoFvaFon	
  
•  Linked	
  Open	
  Data	
  (LOD)	
  
•  Linking	
  enFty	
  from	
  text	
  to	
  LOD	
  
– Rich	
  snippets	
  
– EnFty-­‐centric	
  search	
  
•  Linking	
  
– Algorithmic	
  
– Manual	
  (NYT	
  arFcles)	
  
HTML+ RDFa
Pages
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   2	
  
Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   3	
  
hVp://dbpedia.org/resource/Facebook	
  
hVp://dbpedia.org/resource/Instagram	
  
Zase:Instagram	
  
owl:sameAs	
  
Google	
  
Android	
  
<p>Facebook	
  is	
  not	
  waiFng	
  for	
  its	
  iniFal	
  
public	
  offering	
  to	
  make	
  its	
  first	
  big	
  
purchase.</p><p>In	
  its	
  largest	
  
acquisiFon	
  to	
  date,	
  the	
  social	
  network	
  
has	
  purchased	
  Instagram,	
  the	
  popular	
  
photo-­‐sharing	
  applicaFon,	
  for	
  about	
  $1	
  
billion	
  in	
  cash	
  and	
  stock,	
  the	
  company	
  
said	
  Monday.</p>	
  
<p><span	
  about="hVp://dbpedia.org/resource/
Facebook"><cite	
  property=”rdfs:label">Facebook</
cite>	
  is	
  not	
  waiFng	
  for	
  its	
  iniFal	
  public	
  offering	
  to	
  
make	
  its	
  first	
  big	
  purchase.</span></p><p><span	
  
about="hVp://dbpedia.org/resource/Instagram">In	
  
its	
  largest	
  acquisiFon	
  to	
  date,	
  the	
  social	
  network	
  has	
  
purchased	
  <cite	
  property=”rdfs:label">Instagram</
cite>	
  ,	
  the	
  popular	
  photo-­‐sharing	
  applicaFon,	
  for	
  
about	
  $1	
  billion	
  in	
  cash	
  and	
  stock,	
  the	
  company	
  said	
  
Monday.</span></p>	
  
RDFa	
  
enrichment	
  
HTML:	
  
Crowdsourcing	
  
•  Exploit	
  human	
  intelligence	
  to	
  solve	
  
– Tasks	
  simple	
  for	
  humans,	
  complex	
  for	
  machines	
  
– With	
  a	
  large	
  number	
  of	
  humans	
  (the	
  Crowd)	
  
– Small	
  problems:	
  micro-­‐tasks	
  (Amazon	
  MTurk)	
  
•  Examples	
  
– Wikipedia,	
  Image	
  tagging	
  
•  IncenFves	
  
– Financial,	
  fun,	
  visibility	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   4	
  
ZenCrowd	
  
•  Combine	
  both	
  algorithmic	
  and	
  manual	
  linking	
  
•  Automate	
  manual	
  linking	
  via	
  crowdsourcing	
  
•  Dynamically	
  assess	
  human	
  workers	
  with	
  a	
  
probabilisFc	
  reasoning	
  framework	
  
22-­‐Apr-­‐12	
   5	
  
Crowd	
  
Algorithms	
  Machines	
  
Outline	
  
•  Related	
  approaches	
  
•  ZenCrowd:	
  System	
  architecture	
  
•  ProbabilisFc	
  model	
  to	
  combine	
  automaFc	
  
matching	
  and	
  crowdsourcing	
  results	
  
•  Experimental	
  evaluaFon	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   6	
  
Related	
  Approaches	
  
•  EnFty	
  Linking	
  
•  Ad-­‐hoc	
  Object	
  Retrieval	
  
– Keyword	
  queries	
  
– Looking	
  for	
  a	
  specific	
  enFty	
  
– IR	
  indexing	
  and	
  ranking	
  over	
  RDF	
  data	
  
•  Crowdsourcing	
  
– Training	
  set,	
  tagging,	
  annotaFon	
  
– IR	
  evaluaFon	
  
– CrowdDB	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   7	
  
ZenCrowd	
  Architecture	
  
Micro
Matching
Tasks
HTML
Pages
HTML+ RDFa
Pages
LOD Open Data Cloud
Crowdsourcing
Platform
ZenCrowd
Entity
Extractors
LOD Index Get Entity
Input Output
Probabilistic
Network
Decision Engine
Micro-
TaskManager
Workers Decisions
Algorithmic
Matchers
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   8	
  
Algorithmic	
  Matching	
  
•  Inverted	
  index	
  over	
  LOD	
  enFFes	
  
– DBPedia,	
  Freebase,	
  Geonames,	
  NYT	
  
•  TF-­‐IDF	
  (IR	
  ranking	
  funcFon)	
  
•  Top	
  ranked	
  URIs	
  linked	
  to	
  enFFes	
  in	
  docs	
  
•  Threshold	
  on	
  the	
  ranking	
  funcFon	
  or	
  top	
  N	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   9	
  
EnFty	
  Factor	
  Graphs	
  
•  Graph	
  components	
  
– Workers,	
  links,	
  clicks	
  
– Prior	
  probabiliFes	
  
– Link	
  Factors	
  
– Constraints	
  
•  ProbabilisFc	
  
Inference	
  
– Select	
  all	
  links	
  with	
  
posterior	
  prob	
  >τ	
  
w1
w2
l1
l2
pw1( ) pw2( )
lf1( ) lf2( )
pl1( ) pl2( )
l3
lf3( )
pl3( )
c11
c22
c12
c21
c13
c23
u2-3( )sa1-2( )
2	
  workers,	
  6	
  clicks,	
  3	
  candidate	
  links	
  
Link	
  priors	
  
Worker	
  
priors	
  
Observed	
  
variables	
  
Link	
  
factors	
  
SameAs	
  
constraints	
  
Dataset	
  
Unicity	
  
constraints
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   10	
  
EnFty	
  Factor	
  Graphs	
  
•  Training	
  phase	
  
– IniFalize	
  worker	
  priors	
  
– with	
  k	
  matches	
  on	
  known	
  answers	
  
•  UpdaFng	
  worker	
  Priors	
  
– Use	
  link	
  decision	
  as	
  new	
  observaFons	
  
– Compute	
  new	
  worker	
  probabiliFes	
  
•  IdenFfy	
  (and	
  discard)	
  unreliable	
  workers	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   11	
  
Experimental	
  EvaluaFon	
  
•  Datasets	
  
–  25	
  news	
  arFcles	
  from	
  
•  CNN.com	
  (Global	
  news)	
  
•  NYTimes.com	
  (Global	
  news)	
  
•  Washington-­‐post.com	
  (US	
  local	
  news)	
  
•  Timesofindia.indiaFmes.com	
  (India	
  local	
  news)	
  
•  Swissinfo.com	
  (Switzerland	
  local	
  news)	
  
–  40M	
  enFFes	
  (Freebase,	
  DBPedia,	
  Geonames,	
  NYT)	
  
•  80	
  workers	
  on	
  Amazon	
  MTurk,	
  $0.01	
  per	
  enFty	
  
•  Standard	
  evaluaFon	
  measures:	
  Prec,	
  Recall,	
  Acc	
  
•  Gold	
  Standard:	
  editorial	
  selecFon	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   12	
  
The	
  micro-­‐task	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   13	
  
Experimental	
  EvaluaFon	
  
•  AutomaFc	
  Approach:	
  P/R	
  trade-­‐off	
  
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
1" 2" 3" 4" 5"
Precision)/)Recall)
Top)N)Results)
P"
R"
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   14	
  
Experimental	
  EvaluaFon	
  
•  EnFty	
  Linking	
  with	
  Crowdsourcing	
  and	
  
agreement	
  vote	
  (at	
  least	
  2	
  out	
  of	
  5	
  workers	
  
select	
  the	
  same	
  URI)	
  
	
  
	
  
	
  
	
  
Top-­‐1	
  precision:	
  0.70	
  
the URIs with at least 2 votes are selected as valid links
(we tried various thresholds and manually picked 2 in the
end since it leads to the highest precision scores while keep-
ing good recall values for our experiments). We report on
the performance of this crowdsourcing technique in Table 2.
The values are averaged over all linkable entities for di↵erent
document types and worker communities.
Table 2: Performance results for crowdsourcing with
agreement vote over linkable entities.
US Workers Indian Workers
P R A P R A
GL News 0.79 0.85 0.77 0.60 0.80 0.60
US News 0.52 0.61 0.54 0.50 0.74 0.47
IN News 0.62 0.76 0.65 0.64 0.86 0.63
SW News 0.69 0.82 0.69 0.50 0.69 0.56
All News 0.74 0.82 0.73 0.57 0.78 0.59
The first question we examine is whether there is a di↵er-
ence in reliability between the various populations of work-
Figure
textua
Entity L
We n
ference
method
consisti
phase, c
the wor
is know
In ord
ence in
of bad
ered as
clicks o
In our e
swers in
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   15	
  
Experimental	
  EvaluaFon	
  
•  Worker	
  community	
  and	
  textual	
  context	
  
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
0" 0.2" 0.4" 0.6" 0.8" 1"
Recall&
Precision&
US"
India"
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
1" 2" 3" 4" 5" 6" 7" 8" 9" 10"
Precision)
Document)
Simple"
Snippet"
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   16	
  
Experimental	
  EvaluaFon	
  
•  Worker	
  SelecFon	
  
Top$US$
Worker$
0$
0.5$
1$
0$ 250$ 500$
Worker&Precision&
Number&of&Tasks&
US$Workers$
IN$Workers$
0.6$
0.62$
0.64$
0.66$
0.68$
0.7$
0.72$
0.74$
0.76$
0.78$
0.8$
1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$
Precision)
Top)K)workers)
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   17	
  
Experimental	
  EvaluaFon	
  
•  EnFty	
  Linking	
  with	
  ZenCrowd	
  
– Training	
  with	
  first	
  5	
  enFFes	
  +	
  5%	
  aserwards	
  
– 3	
  consecuFve	
  bad	
  answers	
  lead	
  to	
  blacklisFng	
  
di↵er-
work-
s per-
point
all en-
tasks
higher
orkers
ws as
rms of
ed on
clicks on the links, hence generating noise in our system.
In our experiments, we consider that 3 consecutive bad an-
swers in the training phase is enough to identify the worker
as a spammer and to blacklist him/her. We report the aver-
age results of ZenCrowd when exploiting the training phase,
constraints, and blacklisting in Table 3. As we can observe,
precision and accuracy values are higher in all cases when
compared to the agreement vote approach.
Table 3: Performance results for crowdsourcing with
ZenCrowd over linkable entities.
US Workers Indian Workers
P R A P R A
GL News 0.84 0.87 0.90 0.67 0.64 0.78
US News 0.64 0.68 0.78 0.55 0.63 0.71
IN News 0.84 0.82 0.89 0.75 0.77 0.80
SW News 0.72 0.80 0.85 0.61 0.62 0.73
All News 0.80 0.81 0.88 0.64 0.62 0.76
Finally, we compare ZenCrowd to the state of the art22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   18	
  
Comparing	
  3	
  matching	
  techniques	
  
•  ZenCrowd	
  best	
  for	
  75%	
  of	
  documents	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   19	
  
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
0.7"
0.8"
0.9"
1"
1" 2" 3" 4" 5" 6" 7" 8" 9" 10"11"12"13"14"15"16"17"18"19"20"21"22"23"24"25"
Precision)
)
Document)
Agr."Vote"
ZenCrowd"
Top"1"
Simple	
  Crowdsourcing	
  
ZenCrowd	
  
AutomaFc	
  
Lessons	
  Learnt	
  
•  Crowdsourcing	
  +	
  Prob	
  reasoning	
  works!	
  
•  But	
  
– Different	
  worker	
  communiFes	
  perform	
  differently	
  
– Many	
  low	
  quality	
  workers	
  
– No	
  differences	
  w/	
  different	
  contexts	
  
– CompleFon	
  Fme	
  may	
  vary	
  (based	
  on	
  reward)	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   20	
  
Conclusions	
  
•  ZenCrowd:	
  ProbabilisFc	
  reasoning	
  over	
  automaFc	
  and	
  
crowdsourcing	
  methods	
  for	
  enFty	
  linking	
  
•  Standard	
  crowdsourcing	
  improves	
  6%	
  over	
  automaFc	
  
•  4%	
  -­‐	
  35%	
  improvement	
  over	
  standard	
  crowdsourcing	
  
•  14%	
  average	
  improvement	
  over	
  automaFc	
  approaches	
  
•  Next	
  steps	
  
–  Long-­‐term	
  worker	
  behavior	
  analysis	
  
–  More	
  efficient	
  and	
  effecFve	
  linking	
  by	
  LOD	
  dataset	
  pre-­‐
selecFon	
  
hVp://diuf.unifr.ch/xi/zencrowd/	
  
22-­‐Apr-­‐12	
   Gianluca	
  DemarFni,	
  eXascale	
  Infolab	
   21	
  

Weitere ähnliche Inhalte

Ähnlich wie ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking

Entities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearchEntities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearcheXascale Infolab
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningDetection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningMikel Emaldi Manrique
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...Ilkay Altintas, Ph.D.
 
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...Advanced-Concepts-Team
 
Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...dgarijo
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 
Getting Semantics from the Crowd
Getting Semantics from the CrowdGetting Semantics from the Crowd
Getting Semantics from the CrowdeXascale Infolab
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark TrainingSpark Summit
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...Lucidworks
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesLynn Langit
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
Online Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data SourcesOnline Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data SourcesFabio Benedetti
 

Ähnlich wie ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking (20)

Entities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearchEntities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web Search
 
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningDetection of Related Semantic Datasets Based on Frequent Subgraph Mining
Detection of Related Semantic Datasets Based on Frequent Subgraph Mining
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Dagstuhl2014
Dagstuhl2014Dagstuhl2014
Dagstuhl2014
 
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
 
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...
 
LD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and toolsLD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and tools
 
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8Distributed deep learning_over_spark_20_nov_2014_ver_2.8
Distributed deep learning_over_spark_20_nov_2014_ver_2.8
 
Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...Detecting common scientific workflow fragments using templates and execution ...
Detecting common scientific workflow fragments using templates and execution ...
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Getting Semantics from the Crowd
Getting Semantics from the CrowdGetting Semantics from the Crowd
Getting Semantics from the Crowd
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Online Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data SourcesOnline Index Extraction from Linked Open Data Sources
Online Index Extraction from Linked Open Data Sources
 

Mehr von eXascale Infolab

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictioneXascale Infolab
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...eXascale Infolab
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex GraphseXascale Infolab
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapeXascale Infolab
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...eXascale Infolab
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...eXascale Infolab
 
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceanseXascale Infolab
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutioneXascale Infolab
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataeXascale Infolab
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data ManagementeXascale Infolab
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataeXascale Infolab
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataeXascale Infolab
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingeXascale Infolab
 
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...eXascale Infolab
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingeXascale Infolab
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big DataeXascale Infolab
 

Mehr von eXascale Infolab (20)

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex Graphs
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory map
 
Cikm 2018
Cikm 2018Cikm 2018
Cikm 2018
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
 
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
 
Crowd scheduling www2016
Crowd scheduling www2016Crowd scheduling www2016
Crowd scheduling www2016
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data
 
Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
SSSW 2015 Sense Making
SSSW 2015 Sense MakingSSSW 2015 Sense Making
SSSW 2015 Sense Making
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task Crowdsourcing
 
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 

Kürzlich hochgeladen

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Kürzlich hochgeladen (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking

  • 1. ZenCrowd   Leveraging  Probabilis3c  Reasoning  and   Crowdsourcing  Techniques  for  Large-­‐Scale   En3ty  Linking       Gianluca  Demar3ni,  Djellel  Eddine   Difallah,  and  Philippe  Cudré-­‐Mauroux   eXascale  Infolab,  University  of  Fribourg   Switzerland  
  • 2. MoFvaFon   •  Linked  Open  Data  (LOD)   •  Linking  enFty  from  text  to  LOD   – Rich  snippets   – EnFty-­‐centric  search   •  Linking   – Algorithmic   – Manual  (NYT  arFcles)   HTML+ RDFa Pages 22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   2  
  • 3. Gianluca  DemarFni,  eXascale  Infolab   3   hVp://dbpedia.org/resource/Facebook   hVp://dbpedia.org/resource/Instagram   Zase:Instagram   owl:sameAs   Google   Android   <p>Facebook  is  not  waiFng  for  its  iniFal   public  offering  to  make  its  first  big   purchase.</p><p>In  its  largest   acquisiFon  to  date,  the  social  network   has  purchased  Instagram,  the  popular   photo-­‐sharing  applicaFon,  for  about  $1   billion  in  cash  and  stock,  the  company   said  Monday.</p>   <p><span  about="hVp://dbpedia.org/resource/ Facebook"><cite  property=”rdfs:label">Facebook</ cite>  is  not  waiFng  for  its  iniFal  public  offering  to   make  its  first  big  purchase.</span></p><p><span   about="hVp://dbpedia.org/resource/Instagram">In   its  largest  acquisiFon  to  date,  the  social  network  has   purchased  <cite  property=”rdfs:label">Instagram</ cite>  ,  the  popular  photo-­‐sharing  applicaFon,  for   about  $1  billion  in  cash  and  stock,  the  company  said   Monday.</span></p>   RDFa   enrichment   HTML:  
  • 4. Crowdsourcing   •  Exploit  human  intelligence  to  solve   – Tasks  simple  for  humans,  complex  for  machines   – With  a  large  number  of  humans  (the  Crowd)   – Small  problems:  micro-­‐tasks  (Amazon  MTurk)   •  Examples   – Wikipedia,  Image  tagging   •  IncenFves   – Financial,  fun,  visibility   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   4  
  • 5. ZenCrowd   •  Combine  both  algorithmic  and  manual  linking   •  Automate  manual  linking  via  crowdsourcing   •  Dynamically  assess  human  workers  with  a   probabilisFc  reasoning  framework   22-­‐Apr-­‐12   5   Crowd   Algorithms  Machines  
  • 6. Outline   •  Related  approaches   •  ZenCrowd:  System  architecture   •  ProbabilisFc  model  to  combine  automaFc   matching  and  crowdsourcing  results   •  Experimental  evaluaFon   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   6  
  • 7. Related  Approaches   •  EnFty  Linking   •  Ad-­‐hoc  Object  Retrieval   – Keyword  queries   – Looking  for  a  specific  enFty   – IR  indexing  and  ranking  over  RDF  data   •  Crowdsourcing   – Training  set,  tagging,  annotaFon   – IR  evaluaFon   – CrowdDB   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   7  
  • 8. ZenCrowd  Architecture   Micro Matching Tasks HTML Pages HTML+ RDFa Pages LOD Open Data Cloud Crowdsourcing Platform ZenCrowd Entity Extractors LOD Index Get Entity Input Output Probabilistic Network Decision Engine Micro- TaskManager Workers Decisions Algorithmic Matchers 22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   8  
  • 9. Algorithmic  Matching   •  Inverted  index  over  LOD  enFFes   – DBPedia,  Freebase,  Geonames,  NYT   •  TF-­‐IDF  (IR  ranking  funcFon)   •  Top  ranked  URIs  linked  to  enFFes  in  docs   •  Threshold  on  the  ranking  funcFon  or  top  N   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   9  
  • 10. EnFty  Factor  Graphs   •  Graph  components   – Workers,  links,  clicks   – Prior  probabiliFes   – Link  Factors   – Constraints   •  ProbabilisFc   Inference   – Select  all  links  with   posterior  prob  >τ   w1 w2 l1 l2 pw1( ) pw2( ) lf1( ) lf2( ) pl1( ) pl2( ) l3 lf3( ) pl3( ) c11 c22 c12 c21 c13 c23 u2-3( )sa1-2( ) 2  workers,  6  clicks,  3  candidate  links   Link  priors   Worker   priors   Observed   variables   Link   factors   SameAs   constraints   Dataset   Unicity   constraints 22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   10  
  • 11. EnFty  Factor  Graphs   •  Training  phase   – IniFalize  worker  priors   – with  k  matches  on  known  answers   •  UpdaFng  worker  Priors   – Use  link  decision  as  new  observaFons   – Compute  new  worker  probabiliFes   •  IdenFfy  (and  discard)  unreliable  workers   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   11  
  • 12. Experimental  EvaluaFon   •  Datasets   –  25  news  arFcles  from   •  CNN.com  (Global  news)   •  NYTimes.com  (Global  news)   •  Washington-­‐post.com  (US  local  news)   •  Timesofindia.indiaFmes.com  (India  local  news)   •  Swissinfo.com  (Switzerland  local  news)   –  40M  enFFes  (Freebase,  DBPedia,  Geonames,  NYT)   •  80  workers  on  Amazon  MTurk,  $0.01  per  enFty   •  Standard  evaluaFon  measures:  Prec,  Recall,  Acc   •  Gold  Standard:  editorial  selecFon   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   12  
  • 13. The  micro-­‐task   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   13  
  • 14. Experimental  EvaluaFon   •  AutomaFc  Approach:  P/R  trade-­‐off   0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" 1" 2" 3" 4" 5" Precision)/)Recall) Top)N)Results) P" R" 22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   14  
  • 15. Experimental  EvaluaFon   •  EnFty  Linking  with  Crowdsourcing  and   agreement  vote  (at  least  2  out  of  5  workers   select  the  same  URI)           Top-­‐1  precision:  0.70   the URIs with at least 2 votes are selected as valid links (we tried various thresholds and manually picked 2 in the end since it leads to the highest precision scores while keep- ing good recall values for our experiments). We report on the performance of this crowdsourcing technique in Table 2. The values are averaged over all linkable entities for di↵erent document types and worker communities. Table 2: Performance results for crowdsourcing with agreement vote over linkable entities. US Workers Indian Workers P R A P R A GL News 0.79 0.85 0.77 0.60 0.80 0.60 US News 0.52 0.61 0.54 0.50 0.74 0.47 IN News 0.62 0.76 0.65 0.64 0.86 0.63 SW News 0.69 0.82 0.69 0.50 0.69 0.56 All News 0.74 0.82 0.73 0.57 0.78 0.59 The first question we examine is whether there is a di↵er- ence in reliability between the various populations of work- Figure textua Entity L We n ference method consisti phase, c the wor is know In ord ence in of bad ered as clicks o In our e swers in 22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   15  
  • 16. Experimental  EvaluaFon   •  Worker  community  and  textual  context   0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" 0" 0.2" 0.4" 0.6" 0.8" 1" Recall& Precision& US" India" 0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" 1" 2" 3" 4" 5" 6" 7" 8" 9" 10" Precision) Document) Simple" Snippet" 22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   16  
  • 17. Experimental  EvaluaFon   •  Worker  SelecFon   Top$US$ Worker$ 0$ 0.5$ 1$ 0$ 250$ 500$ Worker&Precision& Number&of&Tasks& US$Workers$ IN$Workers$ 0.6$ 0.62$ 0.64$ 0.66$ 0.68$ 0.7$ 0.72$ 0.74$ 0.76$ 0.78$ 0.8$ 1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$ Precision) Top)K)workers) 22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   17  
  • 18. Experimental  EvaluaFon   •  EnFty  Linking  with  ZenCrowd   – Training  with  first  5  enFFes  +  5%  aserwards   – 3  consecuFve  bad  answers  lead  to  blacklisFng   di↵er- work- s per- point all en- tasks higher orkers ws as rms of ed on clicks on the links, hence generating noise in our system. In our experiments, we consider that 3 consecutive bad an- swers in the training phase is enough to identify the worker as a spammer and to blacklist him/her. We report the aver- age results of ZenCrowd when exploiting the training phase, constraints, and blacklisting in Table 3. As we can observe, precision and accuracy values are higher in all cases when compared to the agreement vote approach. Table 3: Performance results for crowdsourcing with ZenCrowd over linkable entities. US Workers Indian Workers P R A P R A GL News 0.84 0.87 0.90 0.67 0.64 0.78 US News 0.64 0.68 0.78 0.55 0.63 0.71 IN News 0.84 0.82 0.89 0.75 0.77 0.80 SW News 0.72 0.80 0.85 0.61 0.62 0.73 All News 0.80 0.81 0.88 0.64 0.62 0.76 Finally, we compare ZenCrowd to the state of the art22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   18  
  • 19. Comparing  3  matching  techniques   •  ZenCrowd  best  for  75%  of  documents   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   19   0" 0.1" 0.2" 0.3" 0.4" 0.5" 0.6" 0.7" 0.8" 0.9" 1" 1" 2" 3" 4" 5" 6" 7" 8" 9" 10"11"12"13"14"15"16"17"18"19"20"21"22"23"24"25" Precision) ) Document) Agr."Vote" ZenCrowd" Top"1" Simple  Crowdsourcing   ZenCrowd   AutomaFc  
  • 20. Lessons  Learnt   •  Crowdsourcing  +  Prob  reasoning  works!   •  But   – Different  worker  communiFes  perform  differently   – Many  low  quality  workers   – No  differences  w/  different  contexts   – CompleFon  Fme  may  vary  (based  on  reward)   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   20  
  • 21. Conclusions   •  ZenCrowd:  ProbabilisFc  reasoning  over  automaFc  and   crowdsourcing  methods  for  enFty  linking   •  Standard  crowdsourcing  improves  6%  over  automaFc   •  4%  -­‐  35%  improvement  over  standard  crowdsourcing   •  14%  average  improvement  over  automaFc  approaches   •  Next  steps   –  Long-­‐term  worker  behavior  analysis   –  More  efficient  and  effecFve  linking  by  LOD  dataset  pre-­‐ selecFon   hVp://diuf.unifr.ch/xi/zencrowd/   22-­‐Apr-­‐12   Gianluca  DemarFni,  eXascale  Infolab   21