Human Computation for Big Data

Human Computation for Big Data
Gianluca Demartini
eXascale Infolab
University of Fribourg, Switzerland
gianlucademartini.net
exascale.info
CUSO Seminar on Big Data – May 23, 2014 – Fribourg

Gianluca Demartini
• M.Sc. at University of Udine, Italy
• Ph.D. at University of Hannover, Germany
– Entity Retrieval
• Worked for UC Berkeley (on Crowdsourcing), Yahoo! Research
(Spain), L3S Research Center (Germany)
• Post-doc at the eXascale Infolab, Uni Fribourg, Switzerland.
• Lecturer for Social Computing in Fribourg
• Tutorial on Entity Search at ECIR 2012, on Crowdsourcing at
ESWC 2013 and ISWC 2013
• Research Interests
– Information Retrieval, Semantic Web, Human Computation
2
demartini@exascale.info
Gianluca Demartini

Web of Data
• Freebase
– Acquired by Google in July 2010.
– Knowledge Graph launched in May 2012.
• Schema.org
– Driven by major search engine companies
– Machine-readable annotations of Web pages
• Linked Open Data
– 31 billion triples, Sept. 2011
• Volume and Variety
Gianluca Demartini 3

Linked Open Data
Z. Kaoudi and I. Manolescu, ICDE seminar 2013 4

LOD data is an enormous graph
• Subject – Predicate – Object
– Barack Obama – marriedTo – Michelle Obama
• Specific scalable DB systems exist
e1
e2
e3
p1 p2
p3
e4

I will talk about
• Micro-task Crowdsourcing
• Hybrid Human-Machine systems
• Entity Linking/Disambiguation
– On the Web using crowdsourcing
• Improving Crowdsourcing Platform Quality
– Pushing tasks to workers
• Research directions
– Crowdsourced Query Understanding
– Transactive Search

Crowdsourcing
• Exploit human intelligence to solve
– Tasks simple for humans, complex for machines
– With a large number of humans (the Crowd)
– Small problems: micro-tasks (Amazon MTurk)
• Examples
– Wikipedia, Image tagging
• Incentives
– Financial, fun, visibility

Case-Study: Amazon MTurk
• Micro-task crowdsourcing marketplace
• On-demand, scalable, real-time workforce
• Different crowd motivation (not just money)
• Online since 2005 (still in “beta”)
• Currently the most popular platform
• Developer’s API as well as GUI
8Gianluca Demartini

Amazon MTurk
9Gianluca Demartini

A Task on MTurk

Amazon Mturk Workflow
• Requesters create tasks (HITs)
• Workers preview, accept, submit HITs
• Requesters approve, download results
11Gianluca Demartini

Example: Hybrid Image Search
Yan, Kumar, Ganesan, CrowdSearch: Exploiting Crowds for Accurate Real-time Image
Search on Mobile Phones, Mobisys 2010.

Not sure
Example: Hybrid Data Integration
paper conf
Data integration VLDB-01
Data mining SIGMOD-02
title author email
OLAP Mike mike@a
Social media Jane jane@b
 Generate plausible matches
– paper = title, paper = author, paper = email, paper = venue
– conf = title, conf = author, conf = email, conf = venue
 Ask users to verify
paper conf
Data integration VLDB-01
Data mining SIGMOD-02
title author email venue
OLAP Mike mike@a ICDE-02
Social media Jane jane@b PODS-05
Does attribute paper match attribute author?
NoYes
McCann, Shen, Doan: Matching Schemas in Online Communities. ICDE, 2008 13

Hybrid Systems: Key Issues
• The role of machine (i.e., algorithm) and
humans
– use only humans? both? who’s doing what?
• Quality control
• Payment
• Optimization: What to crowdsource
• Scalability: How much to crowdsource

http://dbpedia.org/resource/Facebook
http://dbpedia.org/resource/Instagram
fbase:Instagram
owl:sameAs
Google
Android
Facebook is not waiting for its initial
public offering to make its first big
purchase.In its largest
acquisition to date, the social network
has purchased Instagram, the popular
photo-sharing application, for about $1
billion in cash and stock, the company
said Monday.
<cit
e property=”rdfs:label">Facebook</cite> is not
waiting for its initial public offering to make its first
big purchase.In
its largest acquisition to date, the social network has
purchased <cite
property=”rdfs:label">Instagram</cite> , the popular
photo-sharing application, for about $1 billion in cash
and stock, the company said Monday.
RDFa
enrichment
HTML:

ZenCrowd
• Combine both algorithmic and manual linking
• Automate manual linking via crowdsourcing
• Dynamically assess human workers with a
probabilistic reasoning framework
17
Crowd
AlgorithmsMachines
Gianluca Demartini

ZenCrowd Architecture
Micro
Matching
Tasks
HTML
Pages
HTML+ RDFa
Pages
LOD Open Data Cloud
Crowdsourcing
Platform
Z enCrowd
Entity
Extractors
LOD Index Get Entity
Input Output
Probabilistic
Network
Decision Engine
Micro-
TaskManager
Workers Decisions
Algorithmic
Matchers
Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. ZenCrowd: Leveraging Probabilistic
Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking. In: 21st International Conference on
World Wide Web (WWW 2012).

Entity Factor Graphs
• Graph components
– Workers, links, clicks
– Prior probabilities
– Link Factors
– Constraints
• Probabilistic
Inference
– Select all links with
posterior prob >τ
w1
w2
l1
l2
pw1( ) pw2( )
lf1( ) lf2( )
pl1( ) pl2( )
l3
lf3( )
pl3( )
c11
c22
c12
c21
c13
c23
u2-3( )sa1-2( )
2 workers, 6 clicks, 3 candidate links
Link priors
Worker
priors
Observed
variables
Link
factors
SameAs
constraints
Dataset
Unicity
constraints

Experimental Evaluation
• Datasets
– 25 news articles from
• CNN.com (Global news)
• NYTimes.com (Global news)
• Washington-post.com (US local news)
• Timesofindia.indiatimes.com (India news)
• Swissinfo.com (Switzerland local news)
– 40M entities (Freebase, DBPedia, Geonames, NYT)

Worker Selection
Top US
Worker
0
0.5
1
0 250 500
WorkerPrecision
Number of Tasks
US Workers
IN Workers
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
0.76
0.78
0.8
1 2 3 4 5 6 7 8 9Precision
Top K workers

Lessons Learnt
• Crowdsourcing + Prob reasoning works!
• But
– Different worker communities perform differently
– Many low quality workers
– Completion time may vary (based on reward)
• Need to find the right workers for your task
(see WWW13 paper)

ZenCrowd Summary
• ZenCrowd: Probabilistic reasoning over automatic
and crowdsourcing methods for entity linking
• Standard crowdsourcing improves 6% over automatic
• 4% - 35% improvement over standard crowdsourcing
• 14% average improvement over automatic
approaches
http://exascale.info/zencrowd/

Blocking for Instance Matching
• Find the instances about the same real-world
entity within two datasets
• Avoid Comparison of all possible pairs
– Step 1: cluster similar items using a cheap
similarity measure
– Step 2: n*n comparison within the clusters with
an expensive measure

Three-stage blocking with the Crowd
for Data Integration
• 1. Cheap clustering/inverted index selection of
candidates
• 2. Expensive similarity measure
• 3. Crowdsource low confidence matches
Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. Large-Scale Linked
Data Integration Using Probabilistic Reasoning and Crowdsourcing. In: VLDB Journal, Volume 22,
Issue 5 (2013), Page 665-687, Special issue on Structured, Social and Crowd-sourced Data on the
Web. October 2013.

Improving Crowdsourcing
Platforms

Pull (Traditional) Crowdsourcing
• In MTurk HITs are published on the market
• The first worker willing to do it can take it
• Pro: Fast
• Con: Not necessarily optimal / not the best
worker for the task

Push Crowdsourcing
• Pick-A-Crowd: A system architecture that uses
Task-to-Worker matching:
– The worker’s social profile
– The task context
• Workers can provide higher quality answers
on tasks they relate to
28
Djellel Eddine Difallah, Gianluca Demartini, and Philippe Cudré-Mauroux. Pick-A-Crowd: Tell Me
What You Like, and I'll Tell You What to Do. In: 22nd International Conference on World Wide
Web (WWW 2013), Rio de Janeiro, Brazil, May 2013.

Matching Models–
Expert Finding
• Build an inverted index on the pages’ titles and description
• Use the title/description of the tasks as a key word query on the
inverted index and get a subset of pages
• Rank the workers by the number of liked pages in the subset
29

Discussion
• Pull vs. Push methodologies in Crowdsourcing
• Pick-A-Crowd system architecture with Task-
to-Worker recommendation
• Experimental comparison with AMT shows a
consistent quality improvement
“Workers Know what they Like”
31
www.openturk.com

OpenTurk
• Yet another a platform? Build on top of Mturk!
• Chrome Extension for push / notification
• 400+ users
• http://bit.ly/openturk-extension
• Open source:
https://github.com/openturk/extension

CrowdQ: Crowdsourced Query
Understanding

birthdate of the mayor of the capital city of italy

capital city of italy

mayor of rome

birthdate of ignazio marino

Motivation
• Web Search Engines can answer simple factual
queries directly on the result page
• Users with complex information needs are
often unsatisfied
• Purely automatic techniques are not enough
• We want to solve it with Crowdsourcing!

CrowdQ
• CrowdQ is the first system that uses
crowdsourcing to
– Understand the intended meaning
– Build a structured query template
– Answer the query over Linked Open Data
Gianluca Demartini, Beth Trushkowsky, Tim Kraska, and Michael Franklin. CrowdQ:
Crowdsourced Query Understanding. In: 6th Biennial Conference on Innovative Data Systems
Research (CIDR 2013).

Hybrid Human-Machine Pipeline
Q= birthdate of actors of forrest gump
Query annotation Noun Noun Named entity
Verification
Entity Relations
Is forrest gump this entity in the query?
Which is the relation between: actors and forrest gump starring
Schema element Starring <dbpedia-owl:starring>
Verification Is the relation between:
Indiana Jones – Harrison Ford
Back to the Future – Michael J. Fox
of the same type as
Forrest Gump - actors

Structured query generation
SELECT ?y ?x
WHERE { ?y <dbpedia-owl:birthdate> ?x .
?z <dbpedia-owl:starring> ?y .
?z <rdfs:label> ‘Forrest Gump’ }
Results from BTC09:
Q= birthdate of actors of forrest gump

Transactive Search

Transactive Search
• What if the data to answer your query is not
stored on any digital support?
• What if the data is just in people minds?
• Big Data No Data

Transactive Search
• Search using Transactive (group) Memories
• “Who attended the WWW 2014 conference?”
• Machines: Harvest the Web + Data Mining
• Crowd: Search twitter, look at event pictures
• Transactive Memories: Remember who I met
Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer, and
Philippe Cudré-Mauroux. Hippocampus: Answering Memory Queries using Transactive Search.
In: 23rd International Conference on World Wide Web (WWW 2014), Web Science Track. Seoul,
South Korea, April 2014.

Transactive Search (2)

Transactive Search (3)

Discussion
• Sometime data is not on the Web
• The right group of people can still answer
– Collaboratively
– Using Transactive Search
– Better than machines or anonymous crowds
• Open challenges
– Incentives
– Repeatability
– SNA

Research Directions for
Micro-task Crowdsourcing

State of Micro-task Crowdsourcing
• Platform side
– Pull platforms
– Batch processing
• Worker side
– Work flexibility
– Anonymity
• Requester side
– Web/API

The Future for Requesters
• Push Platforms
– RecSys, User Modeling, Trust
• Mobile Access
• Quality and Time guarantees
• Worker API (enable novel worker UI)

51
The Future of the Worker side
• Reputation system for workers
• More than financial incentives
• Recognize worker potential (badges)
– Paid for their expertise
• Train less skilled workers (tutoring system)
Aniket Kittur et al. The Future of Crowd Work.
CSCW 2013. Gianluca Demartini

Crowdsourcing Ethics
• People work full-time as crowd workers
• Chinese crowdsourcing platform with 5.5M workers
• Pros
– Help developing countries
– Provide cash fast to people == short-term satisfaction
– Job Flexibility
• Cons
– No job security
– No social security
– Long term satisfaction? Career plans?
Dagstuhl Seminar on “Crowdsourcing: From Theory to Practice and Long-Term Perspectives”,
September 2013.

Conclusions
• Structured Data makes the Web better
• It’s growing fast
– Large volume
– Large heterogeneity
• Crowds can help understanding data semantics
• Hybrid human-machine systems (ZenCrowd)
• Research opportunities:
– Exploit Human Intelligence at Scale (CrowdQ)
– Pick the right crowd (Pick-A-Crowd, Transactive Search)
gianlucademartini.net
demartini@exascale.infoGianluca Demartini 53

Human Computation for Big Data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (10)

Ähnlich wie Human Computation for Big Data

Ähnlich wie Human Computation for Big Data (20)

Mehr von eXascale Infolab

Mehr von eXascale Infolab (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Human Computation for Big Data

Hinweis der Redaktion