Keynote presentation at the ISMB Bio-ontologies SIG (Vienna, Austria) on July 15, 2011.
(Apologies, I occasionally use animations that obscures some slide content, so feel free to download the PowerPoint version to see what's underneath...)
6. 6 0 Sooner or later, the research community will need to be involved in the annotation effort to scale up to the rate of data generation.
7. The Long Tail is a prolific source of content 7 Short Head Content produced Long Tail Contributors (sorted) Publishing: Video: Product reviews: Food reviews: Judging: Newspapers TV/Hollywood Consumer reports Food critics Olympics Blogs YouTube Amazon reviews Yelp American Idol
9. Wikipedia has breadth and depth 9 Articles Words (millions) Words/ article Wikipedia Britannica Online http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
10. 10 We can harness the Long Tail of scientists to directly participate in the gene annotation process.
11. 10,000 gene “stubs” within Wikipedia 11 Protein structure Gene summary Symbols and identifiers Gene Ontology annotations Protein interactions Tissue expression pattern Linked references Links to structured databases
12. Wiki success depends on a positive feedback 12 Gene wiki page utility 1 100 2 200 Number of users Number of contributors
14. A review article for every gene is powerful 14 Reelin: 68 editors, 543 edits since July 2002 Heparin: 175 editors, 320 edits since June 2003 AMPK: 44 editors, 84 edits since March 2004 RNAi: 232 editors, 708 edits since October 2002 References to the literature Hyperlinks to related concepts
15. Gene Wiki has a diverse critical mass of readers 15 Utility Rank 101-110: Scientists Tau protein Interleukin 10 APC C-Met Factor V Interleukin 8 CD44 Histamine H1 receptor Kappa Opioid receptor Dihydrofolatereductase Rank 1001-1010: Specialists CSDA CNTNAP2 IGSF8 Adenosine A3 receptor RYR1 ETV6 Small heterodimer partner 5-HT1D receptor TRPC6 Interleukin-6 receptor Users Contributors Rank 1-10: General society Insulin Titin Human chorionic gonadotropin Vasopressin ANKH CLOCK Catalase Erythropoietin Glucagon Parathyroid hormone Total: 5.0 million views / month
17. The Gene Wiki has a critical mass of editors 17 Utility Users Contributors Editors Editor count Edit count Edits In Jan – Jun 2010 … … 7474 edits were made by 2109 unique users … total increase in text ≈ 20 PLoS Biology research articles
18. Making the Gene Wiki more reliable 18 The company name is derived from old Greek, and means "destroyer of birds". Novartis is a multinational pharmaceutical company based in Basel, Switzerland that manufactures drugs such as clozapine (Clozaril), diclofenac (Voltaren), … 2 2
19. Making the Gene Wiki more reliable 19 The company name is derived from old Greek, and means "destroyer of birds". Novartis is a multinational pharmaceutical company based in Basel, Switzerland that manufactures drugs such as clozapine (Clozaril), diclofenac (Voltaren), … 2 36211 total edits 36 total edits * * * * * * * * * * * * * * High-trust author Low-trust author http://www.wikitrust.net/
20. Making the Gene Wiki more computable 20 Structured annotations Free text !
21. Example text from 5-HT1A receptor Agonists Heart rate Receptor Blood pressure Snippet from article on 5-HT1A receptor: Snippet from article on 5-HT1A receptor: “…5-HT1A receptor agonistsdecrease blood pressureand heart rateor cause hypotension via a central mechanism, by inducing peripheral vasodilation, and by stimulating the vagus nerve…” “…5-HT1A receptor agonists decrease blood pressure and heart rate or cause hypotension via a central mechanism, by inducing peripheral vasodilation, and by stimulating the vagus nerve…” Vasodilation Hypotension Vagus nerve
22. Example text from 5-HT1A receptor Agonists Heart rate Receptor Blood pressure 5-HT1A receptor Vasodilation Hypotension Vagus nerve
24. Re-discovering common knowledge 24 NCBI Entrez Gene: 3362 Wikilink Candidate assertion GO:0004993 GO exact synonym Gene Wiki mapping
25. Mining the most recent literature 25 NCBI Entrez Gene: 57620 Wikilink Candidate assertion GO:0030154 GO related concept Gene Wiki mapping
26. Filling the gaps in gene annotation 26 NCBI Entrez Gene: 334 Wikilink Candidate assertion GO:0006897 GO exact match Gene Wiki mapping
27. Disease associations mined from the Gene Wiki Gene Wiki Articles (10,271) 23% exact match Filter out seeded text 5% match parent 2% match child NCBO Annotator 70% have no match Compare to DO database Matched Disease Ontology terms (2983) 2147 candidate annotations
28. Disease associations mined from the Gene Wiki Expert curation Correct Maybe Incorrect 86% 10% Overall specificity: 90-93% 4%
29. GO associations mined from the Gene Wiki Gene Wiki Articles (10,271) 17% exact match Filter out seeded text 26% match parent NCBO Annotator 55% have no match 2% match child Compare to GO database Matched Gene Ontology terms (11,022) 6319 candidate annotations
30. GO associations mined from the Gene Wiki Expert curation Correct Maybe Incorrect 14% 26% Overall specificity: 48-64% 60%
31. Common sources of error in GO associations 31 1) Incorrect concept recognition OR2F1: “Olfactory receptors … are responsible for the recognition and G protein-mediated transductionof odorant signals.” Transduction (GO:0009293) The transfer of genetic information to a bacterium from a bacteriophage or between bacterial or yeast cells mediated by a phage vector. Signal transduction (GO:0007165) The cellular process in which a signal is conveyed to trigger a change in the activity or state of a cell. Signal transduction begins with reception of a signal, e.g. a ligand binding to a receptor or receptor activation by a stimulus such as light, and ends with regulation of a downstream cellular process…
32. Common sources of error in GO associations 32 Dephosphorylation Excretion Gene expression Glycosylation Localization Methylation Proteolysis Secretion Transport Transcription Translation 2) Incorrect sentence context Phosporylation MEF2C: “Several post translational modifications have been identified including phosphorylation on serine-59 …” MEF2C Neurogenesis Myelination
33. Is 48 – 64 % specificity useful? 33 Enrichment analysis muscle contraction (GO:0006936) GO term 5449 articles Concept recognition PubMed abstracts Gene list 87 genes + Gene Wiki 87 articles GO:0006936 GO:0006936 Linked genes by PubMed only Linked genes by PubMed + Gene Wiki P = 1.0 P = 1.22 E-09
35. 35 “Like the image of the [mammoth] hairball, it is equally unhelpful in understanding the object’s properties. You can guess that the network is large and its connectivity is complex, but not more. At best, the visualization is merely decorative.” - Martin Krzywinski http://mkweb.bcgsc.ca/linnet/talks/linnet-informatics2010.pdf
38. Semantic representation From text mining to a Semantic Gene Wiki 38 Community contributions Semantics Semantic querying û ü ü Home-grown wiki ü ü û ? Gene Wiki/ Wikipedia ü ü – Semantic Gene Wiki
39. Semantic Wiki Links 39 Semantic Gene Wiki Rendered text Gene Wiki Based on Semantic MediaWiki (SMW) Based on MediaWiki apoptosis apoptosis apoptosis Mirror and translate apoptosis [[apoptosis]] [[apoptosis]] [[repress::apoptosis]] {{SWL|target=apoptosis|type=promotes}} apoptosis [[promote::apoptosis]] [[modulate::apoptosis]] Semantic queries, RDF, etc
40. For community-based science, data is king 40 Data without structure is valuable, but structure without data is not.
41. For community-based science, data is king 41 Data without structure is valuable, but structure without data is not. X X Wikipedia WP:MCB, Boghog Artists and illustrators Wiki links, infoboxes DOI bot, CitationBot WikiTrust Copy-editing Figures Structure Citations Provenance = X Domain expert Information scientist
42. The Gene Wiki successfully harnesses the Long Tail of scientists for community annotation of gene function 42
43. 43 Collaborators Group members Doug Howe, ZFIN Salvatore Loguercio (*), TU Dresden John Hogenesch, U Penn Jon Huss, GNF Angel Pizzaro, U Penn Faramarz Valafar, SDSU Pierre Lindenbaum, FondationJean Dausset Michael Martone, Rush Konrad Koehler, Karo Bio Warren Kibbe, Simon Lim, Northwestern Many Wikipedia editors WP:MCB Project Erik Clarke Ben Good (*) Ian Macleod ChunleiWu (*) See talk on SNPediamashup at 1:55 PM WikiTrust (UCSC) Luca de Alfaro Bo Adler Ian Pye Contact http://sulab.org asu@scripps.edu @andrewsu +Andrew Su ISMB travel support Funding and Support (BioGPS: GM83924, Gene Wiki: GM089820)
Hinweis der Redaktion
We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discovery59% have 5 or fewer references38% have one or no references
If you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization