Data management efforts such as MDM are a popular approach for high quality enterprise data. However, MDM can be heavily centralized and labour intensive, where the cost and effort can become prohibitively high. The concentration of data management and stewardship onto a few highly skilled individuals, like developers and data experts, can be a significant bottleneck. This talk explores how to effectively involving a wider community of users within collaborative data management activities. The bottom-up approach of involving crowds in the creation and management of data has been demonstrated by projects like Freebase, Wikipedia, and DBpedia. The talk is discusses how collaborative data management can be applied within an enterprise context using platforms such as Amazon Mechanical Turk, Mobile Works, and internal enterprise human computation platforms.
Topics covered include:
- Introduction to Crowdsourcing and Human Computation for Data Management
- Crowds vs. Communities, When to use them and why
- Push vs. Pull methods of crowdsourcing data management
- Setting up and running a collaborative data management process
- Modelling the expertise of communities
2. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Problems with Data
¨ Master Data Management
n Crowdsourcing
n Collaborative Data Management
n Setting up a CDM Process
n Future Directions
Overview
3. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
The Problems with Data
Knowledge Workers need:
¨ Access to the right data
¨ Confidence in that data
Flawed data effects 25%
of critical data in world’s
top companies
Data quality role in recent
financial crisis:
¨ “Asset are defined differently
in different programs”
¨ “Numbers did not always add
up”
¨ “Departments do not trust
each other’s figures”
¨ “Figures … not worth the
pixels they were made of”
4. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Master Data Management is a process that
can improve data quality
n What is Data Quality?
¨ Desirable characteristics for information
resource
¨ Described as a series of quality dimensions
– Discoverability, Accessibility, Timeliness,
Completeness, Interpretation, Accuracy, Consistency,
Provenance & Reputation
Master Data Management
5. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Data Quailty
Master Data Management
Profile
Sources
Define
Mappings
Cleans Enrich
De-duplicate
Define
Rules
Master
Data
Data Developer
Data Steward
Data Governance
Business Users
Applications
Product DataProduct Data
6. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Data Quality
6
ID PNAME PCOLOR PRICE
APNR iPod Nano Red 150
APNS iPod Nano Silver 160
<Product
name=“iPod
Nano”>
<Items>
<Item
code=“IPN890”>
<price>150</price>
<genera?on>5</genera?on>
</Item>
</Items>
</Product>
Source A
Source B
Schema Difference?
Data Developer
APNR
iPod
Nano
Red
150
APNR
iPod
Nano
Silver
160
iPod
Nano
IPN890
150
5
Value Conflicts?
Entity Duplication?
Data Steward
Business Users
?
Technical Domain
(Technical)
Domain
7. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Pros
¨ Can create a single version of truth
¨ Standardized information creation and management
¨ Improves data quality
n Cons
¨ Significant upfront costs and efforts
¨ Participation limited to few (mostly) technical experts
¨ Difficult to scale for large data sources
– Extended Enterprise e.g. partner, data vendors
¨ Small % of data under management (i.e. CRM, Product, …)
Master Data Management
8. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Enterprise Data Landscape
The
Managed
8
Reference data managed
through well define policies
and governance council
Data directly
managed by
enterprise and
its departments
All data relevant to
enterprise and its
operationsThe
Reality
The
Known
MDM
Enterprise Data
Relevant External Data
10. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Crowdsourcing Industry
Landscape
11. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Coordinating a crowd (a large group of workers)to
do micro-work (small tasks) that solves problems
(that computers or a single user can’t)
n A collection of mechanisms and associated
methodologies for scaling and directing crowd
activities to achieve goals
n Related Areas
¨ Collective Intelligence
¨ Social Computing
¨ Human Computation
¨ Data Mining
Introduction to Crowdsourcing
12. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Maskelyne 1760
¨ Used human computers
to created almanac of
moon positions
– Used for shipping/
navigation
¨ Quality assurance
– Do calculations twice
– Compare to third verifier
When Computers Were Human
14. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Human
ü Visual perception
ü Visuospatial thinking
ü Audiolinguistic ability
ü Sociocultural
awareness
ü Creativity
ü Domain knowledge
Machine
ü Large-scale data
manipulation
ü Collecting and storing
large amounts of data
ü Efficient data movement
ü Bias-free analysis
Human vs Machine Affordances
15. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Computers cannot do the task
n Single person cannot do the task
n Work can be split into smaller tasks
When to Crowdsource?
19. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
ReCaptcha
n OCR
¨ ~ 1% error rate
¨ 20%-30% for 18th and
19th century books
n 40 million ReCAPTCHAs
every day” (2008)
¨ Fixing 40,000 books a
day
20. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Generic Architecture
Workers
Platform/Marketplace
(Publish Task, Task Management)
Requestors
1.
2.
4.
3.
23. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
COLLABORATIVE DATA
MANAGEMENT
24. • Collabora?ve
knowledge
base
maintained
by
community
of
web
users
• Users
create
en?ty
types
and
their
meta-‐data
according
to
guidelines
• Requires
administra?ve
approvals
for
schema
changes
by
end
users
25.
26.
27.
28. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Collaboratively built by large community
¨ More than 19,000,000 articles, 270+ languages,
3,200,000+ articles in English
¨ More than 157,000 active contributors
n Accuracy and stylistic formality are
equivalent to expert-based resources
¨ i.e. Columbia and Britannica encyclopedias
n WikiMeida
¨ Software behind Wikipedia
¨ Widely used inside organizations
¨ Intellipedia:16 U.S. Intelligence agencies
¨ Wiki Proteins: curated Protein data for
knowledge discovery
Wikipedia
29. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n DBPedia provides direct access to data
¨ Indirectly uses wiki as data curation platform
¨ Inherits massive volume of curated
Wikipedia data
¨ 3.4 million entities and 1 billion RDF triples
¨ Comprehensive data infrastructure
– Concept URIs
– Definitions
– Basic types
DBPedia Knowledge base
30.
31. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
A Bottom up Approach to MDM
Engage
More
Human
Workers
to
Collabora4vely
Manage
Enterprise
Data
31
of
50
Collaborative Enterprise
Data Management
10s-100s 10,000s-100,000sNumber of Participants
Data Control
Top-down
Bottom-up
MDM
32. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Emerging Enterprise Data
Landscape
The
Managed
8
Reference data managed
through well define policies
and governance council
Data directly
managed by
enterprise and
its departments
All data relevant to
enterprise and its
operationsThe
Reality
The
Known
Enterprise Data
Relevant External Data
Collaboratively
Managed
MDM
33. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Clean Data
Algorithm + Crowd
Developers Data Governance
Internal Community
External Crowd
Data
Sources
Data Quality
Algorithms
Human
Computation
34. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Examples of CDM Tasks
n Understanding customer sentiment for
launch of new product around the world.
n Implemented 24/7 sentiment analysis
system with workers from around the
world.
n Categorize millions of products on eBay’s
catalog with accurate and complete
attributes
n Combine the crowd with machine learning to
create an affordable and flexible catalog
quality system
35. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Natural Language Processing
¨ Dialect Identification, Spelling Correction, Machine
Translation, Word Similarity
n Computer Vision
¨ Image Similarity, Image Annotation/Analysis
n Classification
¨ Data attributes, Improving taxonomy, search results
n Verification
¨ Entity consolidation, de-duplicate, cross-check, validate
data
n Enrichment
¨ Judgments, annotation
Examples of CDM Tasks
37. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Core Design Questions of CDM
Goal
What
Why IncentivesWhoWorkers
How
Process
Malone, T. W., Laubacher, R., & Dellarocas, C. N.
Harnessing crowds: Mapping the genome of collective intelligence. MIT Sloan Research Paper 4732-09, (2009).
38. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Hierarchy (Assignment)
¨ Someone in authority assigns a particular person
or group of people to perform the task
¨ Within the Enterprise
n Crowd (Choice)
¨ Anyone in a large group who choses to do so
¨ Internal or External Crowds
Who is doing it? (Workers)
39. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Motivation
¨ Money ($$££)
¨ Glory (reputation/prestige)
¨ Love (altruism, socialize, enjoyment)
¨ Unintended by-product (e.g. re-Captcha, captured in workflow)
¨ Self-serving resources (e.g. Wikipedia, product/customer data)
n Determine pay and time for each task
¨ Marketplace: Delicate balance
– Money does not improve quality but can increase participation
¨ Internal Hierarchy: Engineering opportunities for recognition
– Performance review, prizes for top contributors, badges,
leaderboards, etc.
Why are they doing it? (Incentives)
40. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Effect of Payment on Quality
n Cost does not affect quality [Mason and Watts, 2009, AdSafe]
n Similar results for bigger tasks [Ariely et al, 2009]
[Panos Ipeirotis. WWW2011 tutorial]
41. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Creation Tasks
¨ Create/Generate
¨ Find
¨ Improve/ Edit / Fix
n Decision (Vote) Tasks
¨ Accept / Reject
¨ Thumbs up / Thumbs Down
¨ Vote for Best
What is being done? (Goal)
42. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Tasks integrated in normal workflow of
those creating and managing data
¨ Simple as vetting or “rating” results of algorithm
n Task Design
¨ Task Interface
¨ Task Assignment/Routing
¨ Task Quality Assurance
How is it being done? (How)
43. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Task Design
43
* Edith Law and Luis von Ahn, Human Computation - Core Research Questions and State of the Art
Input Output
Task Router
before computation
Output Aggregation
after computation
Task Interface
during computation
44. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Pull Routing
n Workers seek tasks and assign to themselves
¨ Search and Discovery of tasks support by platform
¨ Task Recommendation
¨ Peer Routing
Workers
Tasks Select
Result
Algorithm
Search & Browse Interface
Result
45. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Push Routing
n System assigns tasks to workers based on:
¨ Past performance
¨ Expertise
¨ Cost
¨ Latency
45
Workers
Tasks
Assign
Result
Assign
Algorithm
Task Interface
* www.mobileworks.com
Result
46. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Redundancy: Quorum Votes
¨ Replicate the task (i.e. 3 times)
¨ Use majority voting to determine right value (% agreement)
¨ Weighted majority vote
n Gold Data / Honey Pots
¨ Inject trap question to test quality
¨ Worker fatigue check (habit of saying no all the time)
n Estimation of Worker Quality
¨ Redundancy plus gold data
n Qualification Test
¨ Use test tasks to determine users ability for such tasks
Managing Task Quality Assurance
47. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Task Management
¨ Task assignment, payment, routing
– Optimizing for Cost, Quality, Completion Time
n Human–Computer Interaction
¨ Payment / incentives
¨ User interface and interaction design
¨ Worker reputation, recruitment, retention
n Quality Control
¨ Trust, reliability, spam detection, consensus
Future Directions
48. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Collaborative Data Management
¨ Emerging trend for data management in the Enterprise.
¨ Crowdsourcing + Micro Tasks
¨ A number of emerging platform to assist
Summary
Data Quality
Algorithms
Human
Computation Clean DataDirty Data
49. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
Edward is a research scientist at the Digital Enterprise Research
Institute. His areas of research include green IT/IS, energy informatics,
linked data, integrated reporting, and cloud computing.
He has worked extensively with industry and government advising on
the adoption patterns, practicalities and benefits of new technologies.
He has published in leading journals and books, and has spoken at
international conferences including the MIT CIO Symposium.
About the Presenter
URL: www.edwardcurry.org
Email: edcurry@acm.org
Twitter: @EdwardACurry
Slides: slideshare.net/edwardcurry
50. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Big Data & Data Quality
¨ S. Lavalle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data,
Analytics and the Path from Insights to Value,” MIT Sloan Management Review, vol.
52, no. 2, pp. 21–32, 2011.
¨ A. Haug and J. S. Arlbjørn, “Barriers to master data quality,” Journal of Enterprise
Information Management, vol. 24, no. 3, pp. 288–303, 2011.
¨ R. Silvola, O. Jaaskelainen, H. Kropsu-Vehkapera, and H. Haapasalo, “Managing one
master data – challenges and preconditions,” Industrial Management & Data
Systems, vol. 111, no. 1, pp. 146–162, 2011.
¨ E. Curry, S. Hasan, and S. O’Riain, “Enterprise Energy Management using a Linked
Dataspace for Energy Intelligence,” in Second IFIP Conference on Sustainable
Internet and ICT for Sustainability, 2012.
¨ D. Loshin, Master Data Management. San Francisco, CA, USA: Morgan Kaufmann
Publishers Inc., 2008.
¨ B. Otto and A. Reichert, “Organizing Master Data Management: Findings from an
Expert Survey,” in Proceedings of the 2010 ACM Symposium on Applied Computing
- SAC ’10, 2010, pp. 106–110.
Selected References
50
51. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Collective Intelligence, Crowdsourcing & Human Computation
¨ A. Doan, R. Ramakrishnan, and A. Y. Halevy, “Crowdsourcing systems on the World-
Wide Web,” Communications of the ACM, vol. 54, no. 4, p. 86, Apr. 2011.
¨ E. Law and L. von Ahn, “Human Computation,” Synthesis Lectures on Artificial
Intelligence and Machine Learning, vol. 5, no. 3, pp. 1–121, Jun. 2011.
¨ M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin, “CrowdDB : Answering
Queries with Crowdsourcing,” in Proceedings of the 2011 international conference
on Management of data - SIGMOD ’11, 2011, p. 61.
¨ P. Wichmann, A. Borek, R. Kern, P. Woodall, A. K. Parlikad, and G. Satzger,
“Exploring the ‘Crowd’ as Enabler of Better Information Quality,” in Proceedings of
the 16th International Conference on Information Quality, 2011, pp. 302–312.
¨ Winter A. Mason, Duncan J. Watts: Financial incentives and the "performance of
crowds". SIGKDD Explorations (SIGKDD) 11(2):100-108 (2009)
¨ Panos Ipeirotis. Managing Crowdsourced Human Computation, WWW2011 Tutorial
¨ O. Alonso & M. Lease. Crowdsourcing 101: Putting the WSDM of Crowds to Work for
You, WSDM Hong Kong 2011.
¨ When Computers Were Human: http://www.youtube.com/watch?v=YwqltwvPnkw
Selected References
51
52. Digital Enterprise Research Institute www.deri.ie
Enabling networked knowledge
n Collaborative Data Management
¨ E. Curry, A. Freitas, and S. O. Riain, “The Role of Community-Driven Data Curation
for Enterprises,” in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US,
2010, pp. 25–47.
¨ ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Towards Expertise Modelling for
Routing Data Cleaning Tasks within a Community of Knowledge Workers,” In 17th
International Conference on Information Quality (ICIQ 2012), Paris, France.
¨ ul Hassan, U., O’Riain, S., and Curry, E. 2013. “Effects of Expertise Assessment on
the Quality of Task Routing in Human Computation,” In 2nd International Workshop
on Social Media for Crowdsourcing and Human Computation, Paris, France.
¨ ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Leveraging Matching Dependencies
for Guided User Feedback in Linked Data Applications,” In 9th International
Workshop on Information Integration on the Web (IIWeb2012) Scottsdale, Arizona,:
ACM.
Selected References
52