3. What are the
functions of
Fibronectin?
37186 articles
What are the functions of
the 238 ‘significant’ genes
that came up in my high
throughput screen??
4. What are the
functions of
Fibronectin?
37186 articles
…
Gene Property Value
Fibronectin Biological
Process
Angiogenesis
Fibronectin Cellular
Localization
Extracellular
matrix
Fibronectin Related
Disease
Glomerulopathy
“knowledge integration”
“curation”
“knowledge base”
Answers
15. We don’t know what we are missing
15
inflammatory
response
defense
response
Serotonin
receptor
activity?
?
response to
wounding
immune
response
Interesting Gene List
16. “Gene Ontology, its great right ?”
• “It sucks”
• “I only use it out of desperation”
18. Process of building knowledge bases
1. do science 2. publish it 3. Manually extract
the knowledge
Gene Property Value
Fibronectin Biological
Process
Angiogenesis
Fibronectin Cellular
Localization
Extracellular
matrix
Fibronectin Related
Disease
Glomerulopathy
22. Professional biocuration does not scale
up to the rate of production
1. do science 2. publish it 3. Manually extract
the knowledge
Gene Property Value
Fibronectin Biological
Process
Angiogenesis
Fibronectin Cellular
Localization
Extracellular
matrix
Fibronectin Related
Disease
Glomerulopathy
28. Global Knowledge Platform
What would happen if everyone
was literally working on the same
database?
1. Split up work more effectively
2. Make integration the default
behavior
29. Is to data
as Wikipedia is to text
“Giving more people more access to more knowledge”
A free and open repository of knowledge
Managed by the MediaWiki foundation
that operates Wikipedia
33. Item: Q414043
RELN
Encodes: Reelin (protein) Stated in:
NCBI homo sapiens
annotation release 107
Retrieved:
19 January 2016
Value (item)
Property
Claim Qualifiers
References
https://www.wikidata.org/wiki/Q414043
Statement
34. A Giant Global Graph
These statements link together into a queryable graph
https://query.wikidata.org
35. We are seeding it with
biomedical data
• All human, mouse genes
and proteins
• All Gene Ontology terms
• All FDA approved drugs
• 9,000+ human diseases
Burgstaller et al (2016) Database (preprint in BioRxiv)
Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv)
36. Our seeds are largely
concepts linked to many
identifier systems
N identifiers per item
• Genes: 8
• Drugs: 18
• Diseases: 11
Burgstaller et al (2016) Database (preprint in BioRxiv)
Mitraka et al (2015) Semantic Web Applications for the Life Sciences (best paper) (preprint in BioRxiv)
Facilitate
integration
with key
external
knowledge
bases
38. A Platform for knowledge integration and curation
38
Open data
Wikipedia(s)
Your Apps
Here!
Your Apps
Here!
Your Apps
Here!
Your Apps
Here!
39. Application #1 (of many)
Burgstaller et al (2016) Database (preprint in BioRxiv)
40. Impact of wikidata on Wikipedia
Gene Wiki
Version 1.
{{GNF_Protein_box | Name = Reelin| image = |
image_source = | PDB = {{PDB2|4AD9}} | HGNCid = 18512 |
MGIid = | Symbol = LACTB2 | AltSymbols =; CGI-83 |
IUPHAR = | ChEMBL = | OMIM = None | ECnumber = |
Homologene = 9349 | GeneAtlas_image1 = |
GeneAtlas_image2 = | GeneAtlas_image3 = |
Protein_domain_image = | Function =
{{GNF_GO|id=GO:0005515 |text = protein binding}}
{{GNF_GO|id=GO:0016787 |text = hydrolase activity}}
{{GNF_GO|id=GO:0046872 |text = metal ion binding}} |
Component = {{GNF_GO|id=GO:0005739 |text =
mitochondrion}} | Process = {{GNF_GO|id=GO:0008152
|text = metabolic process}} | Hs_EntrezGene = 51110 |
Hs_Ensembl = ENSG00000147592 | Hs_RefseqmRNA =
NM_016027 | Hs_RefseqProtein = NP_057111 |
Hs_GenLoc_db = hg38 | Hs_GenLoc_chr = 8 |
Hs_GenLoc_start = 70635318 | Hs_GenLoc_end = 70669174
| Hs_Uniprot = Q53H82 | Mm_EntrezGene = 212442 |
Mm_Ensembl = ENSMUSG00000025937 |
Mm_RefseqmRNA = NM_145381 | Mm_RefseqProtein =
NP_663356 | Mm_GenLoc_db = mm10 | Mm_GenLoc_chr =
1 | Mm_GenLoc_start = 13623330 | Mm_GenLoc_end =
13660546 | Mm_Uniprot = Q99KR3 | path = PBB/51110}}
=
Gene Wiki
Version 2.
{{Infobox gene}}
• All data in
Wikidata
• 1 Lua script works
for all genes
=
(1 of these for every gene)
41. Application #2 Web Apollo Genome Browser
41
• Genome annotation data retrieved
from wikidata via SPARQL queries
to https://query.wikidata.org
• Prototype achieved at recent San
Diego hackathon
1 Putman et al (2016) (under review) (preprint in BioRxiv)
42. Microbial Genetic Data
•Widely Distributed
•Difficult to query
•Not structured in meaningful way
•A lot of interest from this
community !
44. Microbial genomes in Wikidata
• Loading genes,
proteins,
annotations for
120 reference
genomes.
• Completed 21
genomes so far
Putman et al (2016) (under review) (preprint in BioRxiv)
45. Microbiome modeling in Wikidata
Putman et al (2016) (under review) (preprint in BioRxiv)
47. Centralizing content while distributing labor
47
Open data
Your Apps
Here!
Wikipedia(s)
Your Apps
Here!
Your Apps
Here!
Your Apps
Here!
48. Thanks!
Gene Wikidata Team
Andra Waagmeester (Micelio)
* Sebastian Burgstaller (Scripps)
* Tim Putman (Scripps)
* Elvira Mitraka (U Maryland)
Julia Turner (Scripps)
Justin Leong (UBC)
Lynn Schriml (U Maryland)
Paul Pavlidis (UBC)
Andrew Su (Scripps)
Ginger Tsueng (Scripps)
Contact
bgood@scripps.edu* First author on manuscript cited in this presentation
Ben Tim
Andra
Elvira
Sebastian
Some Gene Wiki team members
enjoying their best paper award
at SWAT4LS, Dec. 2015
Adapted logo
Hinweis der Redaktion
Databases. Obviously much more flexible. You can ask them questions.. (and make pretty pictures that are dynamic)
“known unknowns” ??
If I want X, what Y should I test?
Though it is a child of the more generic GO annotation to ‘G protein coupled receptor activity’
Kohen 1996, J Neurochem.
Given a list of active genes produced from an experiment
what key biological processes are happening in the cells?
what diseases are these genes associated with?
Given a list of genetic variations
what diseases is a patient more susceptible to?
what drugs should they take/avoid?
etc.
Given a list of active genes produced from an experiment
what key biological processes are happening in the cells?
what diseases are these genes associated with?
Given a list of genetic variations
what diseases is a patient more susceptible to?
what drugs should they take/avoid?
etc.
Knowledge is either not shared (stuck in your head or your notebook) or it is shared as text and images in journal articles.
There are more than 1 million articles added to PubMed each year
Given a list of active genes produced from an experiment
what key biological processes are happening in the cells?
what diseases are these genes associated with?
Given a list of genetic variations
what diseases is a patient more susceptible to?
what drugs should they take/avoid?
etc.
Divide and conquer algorithm for creating the knowledge base of everything. Splitting is hard because its very hard to know what other groups are doing, there is no centralized coordination, and decisions about what should be curated are made based on what gets funded rather than what is mist useful for the collective.
The principle problem of knowledge integration is establishing which entities are shared between different systems
Methadone
N0000002109
(Opioid-Related Disorders)
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3422823/
It would be much easier to see what other people were doing
By operating in the same database, it is much more likely that you will end up re-using entities that already exist rather than creating new ones and merging them later. Just like in your own local database.
This is the first application of the work that we have done