This presentation describes two modes of web-based knowledge acquisition in the domain of bioinformatics. "Pull" models such as social tagging systems that engage passive altruism and "push" models such as the Mechanical Turk that actively guide and incentivise the knowledge acquisition process.
1. bioLogical
mass collaboration
Benjamin Good
University of British Columbia
Symposium on (Bio)semantics for complex
systems biology, Leiden University Medical Center
12 March 2009.
9. More data captured
http://upload.wikimedia.org/wikipedia/commons/c/c9/Hippocampus-mri.jpg
Resource Tagged
Tagging
Tagger
JaneTagger 2007-8-29
Event
Tagging Context
Associated Tags
hippocampus mri image wikipedia
10. Tags
• Not the same as either professionally
or automatically generated keywords.
- (Al-Khalifa & Davis 2007)
• Can be used to improve Web search
- (Morrison 2008)
11. Tagging in science?
• How does social tagging compare to
professional indexing in the life
sciences?
• (Good, Tennis, Wilkinson in
preparation)
12. “Tuned responses of astrocytes and their influence on
hemodynamic signals in the visual cortex”
16. open social tagging -
in science
➡ low numbers of tags per post
➡ low numbers of posts per document
➡ low value of tags as descriptors..
17. adding value to each tag
• social semantic tagging,
➡ tagging with encoded concepts
instead of strings of letters
➡ = the Entity Describer (E.D.)
Good, Kawas, Wilkinson (2007) Bridging the gap between social
tagging and semantic annotation. Nature Precedings
24. E.D. can be customized
• Tag with:
genes, gene ontology terms, terms from
OWL ontologies
• Recently used to conduct a successful
experiment in BioMoby Web service
annotation
25. but!
• Does not address the volume problem -
more participation is needed to make
social tagging a useful source of
bioLogical knowledge.
26. The plan for today
Mostly-manual strategies for creating
bioLogical knowledge
• pull
➡ social tagging
• push
➡ frames and games
27. push
• Key difference from pull model is
that system designers push specific
requests to users
• many incentive options:
financial, psychological...
28. Pushy pattern
1. design frame for knowledge to be
collected ?
? ?
2. choose incentive system
3. design interface
4. collect knowledge
5. aggregate knowledge
29. Mechanical Turk:
pushing with money
• A “marketplace for work”
hosted by Amazon Inc.
“artificial artificial
intelligence”
30. Mechanical Turk and
NLP
• Snow et al (2008)
- used workers on the AMT to label
text for use in training/testing NLP
algorithms.
- word sense disambiguation, affect
recognition and several more.
Snow et al (2008) Cheap and Fast—But is it Good? Evaluating Non-Expert Annotations for
Natural Language Tasks, In Empirical Methods in Natural Language Processing, p 254--263
31. Snow et al (2008) cont.
Results for affect recognition
• labels = 7000
• cost = $2
• time = 5.9 hours
• when aggregated, results equal or better
than expert labelers in most cases.
Snow et al (2008) Cheap and Fast—But is it Good? Evaluating Non-Expert Annotations for
Natural Language Tasks, In Empirical Methods in Natural Language Processing, p 254--263
32. ESP game, pushing with fun
Von Ahn and Dabbish (2004) Labeling Images with a Computer Game
http://www.cs.cmu.edu/~biglou/ESP.pdf
33. ESP game results (2004)
• >4 million images labeled
• >23,000 players
• Given 5,000 players online
simultaneously, could label all of the
images accessible to Google in a month
• (See the “Google image labeling
game”…)
34. iCAPTURer: assessing
push for bioLogical
knowledge
• Can we acquire bio-ontological
knowledge from untrained volunteers
in a scalable, Web-based manner?
• 2 experiments in the context of
scientific conferences
Good et al. 2006. Fast, cheap, and out of control: a zero-curation model for ontology development.
Good and Wilkinson 2007. Ontology engineering using volunteer labor
35. iCAPTURer 1
Goals
1. Identify concepts from text
2. Link concepts to synonyms and to
hyponyms (‘x is_a y’) rooted in the
UMLS Semantic Network
Good et al. 2006. Fast, cheap, and out of control: a zero-curation model for ontology development.
44. Initial acquisition verse
evaluation
11,000
Number of
assertions
gathered
1,000
Knowledge capture Evaluation conducted
at YI forum via email request
45. Initial acquisition verse
evaluation
11,000
“I assert that t cell “I agree that t cell
Number of
activation is a kind of activation is a kind of
assertions
immune response” immune response”
gathered
1,000
Knowledge capture Evaluation conducted
at YI forum via email request
• Multiple choice (voting)
• Forms
• Tree navigation
• Home setting
• Conference setting
• 3 days
• 2 days
• 68 people
• 65 people
46. iCAPTURer 2 pattern
1. Infer complete ontology
2. Present each edge as a multiple choice
question {true, false, I don’t know}
3. Aggregate votes to decide on each
triple
48. iCAPTURer2 results
1.2
• Same pattern of
1
fraction subclass judgments made
0.8
participation 0.6
0.4
• Only 66% correct 0.2
overall in 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Volunteer
assessing subClass
assertions
• highly biased
towards saying
‘yes’.
49. iCAPTURer summary
• Scientifically relevant tasks are harder
- the population pool is smaller, but - in
my experience generally very willing.
• Engaging the competitive instinct was
helpful in obtaining the responses we
did.
• Much room for further investigation.
51. Filling in Freebase with
Typewriter
? is a ?
X Y
http://typewriter.freebaseapps.com/
March 9, 2009
52. Filling in Freebase with
Typewriter
? is a ?
X Y
http://typewriter.freebaseapps.com/
March 9, 2009
53. To achieve mass collaborative bioLogical
knowledge assembly, make it possible for
people to contribute in multiple modes
- as creators
- as evaluators
- as system builders (open APIs are crucial)
and for multiple reasons
- personal information management
- fun, competition
- finance
R
X Y
X Y
X Y
X Y
56. “...how you envision future
developments...”
+
Automation Human computation
= increasingly high-throughput
bioLogical knowledge representation
57. “...how your own expertise would fit into
this realm...”
more
requires
bioLogical
knowledge representation
analyses
machine learning
knows a bit about community action
ben
http://biordf.net/~bgood/
58. Thanks to
• developers: Eddie Kawas, Paul Lu
• advisor: Mark Wilkinson
• Barend Mons for the invitation and
Marco Roos for the accommodation!
http://biordf.net/~bgood/