SlideShare a Scribd company logo
1 of 18
EFO tools – the good, the great, and the evil Tomasz Adamusiak MD PhD
Huge ontology developed by a tiny team
We have means to assign blame when things go wrong (definition_editor)
We need richness and consistency for EFO based query expansion
New terms come from GXA and external users GXA Zooma OLS BioPortal similarity_match.pl 	OWL::Simple::Parser 	MeSH::Parser::ASCII
Xrefs are acquired by lexical cross-match to other ontologies similarity_match.pl 	OWL::Simple::Parser 	MeSH::Parser::ASCII
Definitions and synonyms are pulled in from external ontologies via NCBO BioPortal + provenance  BioPortal metadata xrefs BioportalImporter
Regression testing is essential as these are massive updates
We need better concept recognition because clean_ontology_terms.plis evil
We need fuzzines, because input data is extremely dirty
There are different levels of fuzziness similarity_match.pl metaphone & double metaphone Levenhstein distance n-grams clean_ontology_terms.pl
N-grams is a simple and relatively unknown method of string approximation
N-grams are extremely effective in practice Thequickbrown fox A. brownquickThe fox B. The quiet swine flu 18%                                                                    90% 19%                                                                    40%
The King is dead. Long live the Queen. 
OntoCAT is a great success and generated a lot of interest within the community
Natalja & Misha hit the mother lode
Which diseases affect heart components? Kurbatova N et al. Bioinformatics 2011;27:2468-2470
Acknowledgments Morris A. Swertz’s group at the Genomics Coordination Center (GCC), University of Groningen K Joeri van derVelde DespoinaAntonakaki Dasha Zhernakova James Malone Helen Parkinson Emma Hastings NiranAbeygunawardena Ele Holloway Tim Rayner Zooma: Tony Burdett Bioconductor/R package: Natalja Kurbatova, Pavel Kurnosov, Misha Kapushesky This work was supported by the European Community's Seventh Framework Programmes GEN2PHEN [grant number 200754], SLING [grant number 226073], and SYBARIS [grant number 242220], the European Molecular Biology Laboratory, the Netherlands Organisation for Scientific Research [NWO/Rubicon grant number 825.09.008], and the Netherlands Bioinformatics Centre [BioAssist/Biobanking platform and BioRange grant SP1.2.3] OntoCAT logo courtesy of Eamonn Maguire Special thanks go to NCBO BioPortal and EBI OLS support teams for all the comprehensive help they provide

More Related Content

What's hot

Gcc talk baltimore july 2014
Gcc talk baltimore july 2014Gcc talk baltimore july 2014
Gcc talk baltimore july 2014
pratikomics
 

What's hot (17)

Museum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on themMuseum impact: linking-up specimens with research published on them
Museum impact: linking-up specimens with research published on them
 
Connected Data for Machine Learning | Paul Groth
Connected Data for Machine Learning | Paul GrothConnected Data for Machine Learning | Paul Groth
Connected Data for Machine Learning | Paul Groth
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
When the world beats a path to your door. Collaboration in the era of big data
When the world beats a path to your door. Collaboration in the era of big dataWhen the world beats a path to your door. Collaboration in the era of big data
When the world beats a path to your door. Collaboration in the era of big data
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
Text and Data Mining explained at FTDM
Text and Data Mining explained at FTDMText and Data Mining explained at FTDM
Text and Data Mining explained at FTDM
 
Workshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data miningWorkshop 5: Uptake of, and concepts in text and data mining
Workshop 5: Uptake of, and concepts in text and data mining
 
Aspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth ScienceAspects of Reproducibility in Earth Science
Aspects of Reproducibility in Earth Science
 
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object FrameworksResults Vary: The Pragmatics of Reproducibility and Research Object Frameworks
Results Vary: The Pragmatics of Reproducibility and Research Object Frameworks
 
Gcc talk baltimore july 2014
Gcc talk baltimore july 2014Gcc talk baltimore july 2014
Gcc talk baltimore july 2014
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
 
Looking for Data: Finding New Science
Looking for Data: Finding New ScienceLooking for Data: Finding New Science
Looking for Data: Finding New Science
 
schema.org and biomedical ontologies
schema.org and biomedical ontologies schema.org and biomedical ontologies
schema.org and biomedical ontologies
 
Science Commons Open Notebook Science Talk
Science Commons Open Notebook Science TalkScience Commons Open Notebook Science Talk
Science Commons Open Notebook Science Talk
 
EMBL Australia Bioinformatics Resource BioInfoSummer 2016
EMBL Australia Bioinformatics Resource BioInfoSummer 2016EMBL Australia Bioinformatics Resource BioInfoSummer 2016
EMBL Australia Bioinformatics Resource BioInfoSummer 2016
 
Use of data
Use of dataUse of data
Use of data
 
Opportunities in chemical structure standardization
Opportunities in chemical structure standardizationOpportunities in chemical structure standardization
Opportunities in chemical structure standardization
 

Similar to EFO tools - the good, the great, and the evil

2011 12 08 - LOINC Introduction
2011 12 08 - LOINC Introduction2011 12 08 - LOINC Introduction
2011 12 08 - LOINC Introduction
dvreeman
 
The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960
mare34
 
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
Human Variome Project
 
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesApollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Monica Munoz-Torres
 

Similar to EFO tools - the good, the great, and the evil (20)

Center for Clinical Genomics and Personalized Medicine, Hungary
Center for Clinical Genomics and Personalized Medicine, HungaryCenter for Clinical Genomics and Personalized Medicine, Hungary
Center for Clinical Genomics and Personalized Medicine, Hungary
 
2011 12 08 - LOINC Introduction
2011 12 08 - LOINC Introduction2011 12 08 - LOINC Introduction
2011 12 08 - LOINC Introduction
 
HEVnet: Sharing sequences & metadata of hepatitis E virus
HEVnet: Sharing sequences & metadata of hepatitis E virus HEVnet: Sharing sequences & metadata of hepatitis E virus
HEVnet: Sharing sequences & metadata of hepatitis E virus
 
OntoCAT - integrated programming toolkit for common ontology application task...
OntoCAT - integrated programming toolkit for common ontology application task...OntoCAT - integrated programming toolkit for common ontology application task...
OntoCAT - integrated programming toolkit for common ontology application task...
 
A moqrich
A moqrichA moqrich
A moqrich
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
 
Ontologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlinOntologies neo4j-graph-workshop-berlin
Ontologies neo4j-graph-workshop-berlin
 
Scott Edmunds talk at ODHK.meet.26: Open Science Data = Open Data (a rant in ...
Scott Edmunds talk at ODHK.meet.26: Open Science Data = Open Data (a rant in ...Scott Edmunds talk at ODHK.meet.26: Open Science Data = Open Data (a rant in ...
Scott Edmunds talk at ODHK.meet.26: Open Science Data = Open Data (a rant in ...
 
The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of Bioinformatics
 
The Ondex Data Integration Framework
The Ondex Data Integration FrameworkThe Ondex Data Integration Framework
The Ondex Data Integration Framework
 
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
MseqDR consortium: a grass-roots effort to establish a global resource aimed ...
 
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
 
AJH CV sept2016
AJH CV sept2016AJH CV sept2016
AJH CV sept2016
 
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesApollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
 
Connecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics InstituteConnecting life sciences data at the European Bioinformatics Institute
Connecting life sciences data at the European Bioinformatics Institute
 
Facilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-juppFacilitating semantic alignment.-biohackathon-jupp
Facilitating semantic alignment.-biohackathon-jupp
 
G03-SemanticWeb-OntoCAT
G03-SemanticWeb-OntoCATG03-SemanticWeb-OntoCAT
G03-SemanticWeb-OntoCAT
 
Pathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & BlockchainPathology is being disrupted by Data Integration, AI & Blockchain
Pathology is being disrupted by Data Integration, AI & Blockchain
 

More from Tomasz Adamusiak

EHR-based Phenome Wide Association Study in Pancreatic Cancer
EHR-based Phenome Wide Association Study in Pancreatic CancerEHR-based Phenome Wide Association Study in Pancreatic Cancer
EHR-based Phenome Wide Association Study in Pancreatic Cancer
Tomasz Adamusiak
 
Next-generation phenotyping using UMLS and Meaningful Use ontologies: SNOMED ...
Next-generation phenotyping using UMLS and Meaningful Use ontologies: SNOMED ...Next-generation phenotyping using UMLS and Meaningful Use ontologies: SNOMED ...
Next-generation phenotyping using UMLS and Meaningful Use ontologies: SNOMED ...
Tomasz Adamusiak
 
Quality Assurance in LOINC® using Description Logic
Quality Assurance in LOINC® using Description LogicQuality Assurance in LOINC® using Description Logic
Quality Assurance in LOINC® using Description Logic
Tomasz Adamusiak
 
Unifying ontology services for functional genomic annotations
Unifying ontology services for functional genomic annotationsUnifying ontology services for functional genomic annotations
Unifying ontology services for functional genomic annotations
Tomasz Adamusiak
 

More from Tomasz Adamusiak (12)

Accelerate AI | Knowledge Graphs in Financial Technology - Future or Hype
Accelerate AI | Knowledge Graphs in Financial Technology - Future or HypeAccelerate AI | Knowledge Graphs in Financial Technology - Future or Hype
Accelerate AI | Knowledge Graphs in Financial Technology - Future or Hype
 
Healthcare Standards? What a Concept!
Healthcare Standards? What a Concept!Healthcare Standards? What a Concept!
Healthcare Standards? What a Concept!
 
Connecting the dots: drug information and Linked Data
Connecting the dots: drug information and Linked DataConnecting the dots: drug information and Linked Data
Connecting the dots: drug information and Linked Data
 
EHR-based Phenome Wide Association Study in Pancreatic Cancer
EHR-based Phenome Wide Association Study in Pancreatic CancerEHR-based Phenome Wide Association Study in Pancreatic Cancer
EHR-based Phenome Wide Association Study in Pancreatic Cancer
 
Creating Dynamic Groupers Using Overrepresentation of Clinical Terms
Creating Dynamic Groupers Using Overrepresentation of Clinical TermsCreating Dynamic Groupers Using Overrepresentation of Clinical Terms
Creating Dynamic Groupers Using Overrepresentation of Clinical Terms
 
Semantic Interoperability in Health Information Exchange
Semantic Interoperability in Health Information ExchangeSemantic Interoperability in Health Information Exchange
Semantic Interoperability in Health Information Exchange
 
Re-identification of de-identified PHI date elements
Re-identification of de-identified PHI date elementsRe-identification of de-identified PHI date elements
Re-identification of de-identified PHI date elements
 
Medication Reconciliation in Electronic Health Information Exchange
Medication Reconciliation in Electronic Health Information ExchangeMedication Reconciliation in Electronic Health Information Exchange
Medication Reconciliation in Electronic Health Information Exchange
 
Integrating SNOMED CT with other Meaningful Use vocabulary standards (LOINC, ...
Integrating SNOMED CT with other Meaningful Use vocabulary standards (LOINC, ...Integrating SNOMED CT with other Meaningful Use vocabulary standards (LOINC, ...
Integrating SNOMED CT with other Meaningful Use vocabulary standards (LOINC, ...
 
Next-generation phenotyping using UMLS and Meaningful Use ontologies: SNOMED ...
Next-generation phenotyping using UMLS and Meaningful Use ontologies: SNOMED ...Next-generation phenotyping using UMLS and Meaningful Use ontologies: SNOMED ...
Next-generation phenotyping using UMLS and Meaningful Use ontologies: SNOMED ...
 
Quality Assurance in LOINC® using Description Logic
Quality Assurance in LOINC® using Description LogicQuality Assurance in LOINC® using Description Logic
Quality Assurance in LOINC® using Description Logic
 
Unifying ontology services for functional genomic annotations
Unifying ontology services for functional genomic annotationsUnifying ontology services for functional genomic annotations
Unifying ontology services for functional genomic annotations
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 

EFO tools - the good, the great, and the evil

  • 1. EFO tools – the good, the great, and the evil Tomasz Adamusiak MD PhD
  • 2. Huge ontology developed by a tiny team
  • 3. We have means to assign blame when things go wrong (definition_editor)
  • 4. We need richness and consistency for EFO based query expansion
  • 5. New terms come from GXA and external users GXA Zooma OLS BioPortal similarity_match.pl OWL::Simple::Parser MeSH::Parser::ASCII
  • 6. Xrefs are acquired by lexical cross-match to other ontologies similarity_match.pl OWL::Simple::Parser MeSH::Parser::ASCII
  • 7. Definitions and synonyms are pulled in from external ontologies via NCBO BioPortal + provenance BioPortal metadata xrefs BioportalImporter
  • 8. Regression testing is essential as these are massive updates
  • 9. We need better concept recognition because clean_ontology_terms.plis evil
  • 10. We need fuzzines, because input data is extremely dirty
  • 11. There are different levels of fuzziness similarity_match.pl metaphone & double metaphone Levenhstein distance n-grams clean_ontology_terms.pl
  • 12. N-grams is a simple and relatively unknown method of string approximation
  • 13. N-grams are extremely effective in practice Thequickbrown fox A. brownquickThe fox B. The quiet swine flu 18% 90% 19% 40%
  • 14. The King is dead. Long live the Queen. 
  • 15. OntoCAT is a great success and generated a lot of interest within the community
  • 16. Natalja & Misha hit the mother lode
  • 17. Which diseases affect heart components? Kurbatova N et al. Bioinformatics 2011;27:2468-2470
  • 18. Acknowledgments Morris A. Swertz’s group at the Genomics Coordination Center (GCC), University of Groningen K Joeri van derVelde DespoinaAntonakaki Dasha Zhernakova James Malone Helen Parkinson Emma Hastings NiranAbeygunawardena Ele Holloway Tim Rayner Zooma: Tony Burdett Bioconductor/R package: Natalja Kurbatova, Pavel Kurnosov, Misha Kapushesky This work was supported by the European Community's Seventh Framework Programmes GEN2PHEN [grant number 200754], SLING [grant number 226073], and SYBARIS [grant number 242220], the European Molecular Biology Laboratory, the Netherlands Organisation for Scientific Research [NWO/Rubicon grant number 825.09.008], and the Netherlands Bioinformatics Centre [BioAssist/Biobanking platform and BioRange grant SP1.2.3] OntoCAT logo courtesy of Eamonn Maguire Special thanks go to NCBO BioPortal and EBI OLS support teams for all the comprehensive help they provide

Editor's Notes

  1. Experimental Factor Ontology is a great application ontology, hugely popular among internal and external collaborators and featured among the top 10 most accessed ontologies within NCBO BioPortal, which provides access to hundreds of different ontology resources. It is a pleasure to be involved in this project.
  2. I joined the EFO teamaround January 2008 working in parallel to GEN2PHEN, to which some of this work was fed back into. My first task was designing and implementing a workflow for pulling in metadata (synonyms & definitions) from for xrefed ontology terms in external ontologies. We now have nearly 5,000 classes and 20,000 synonyms and there’s steady continuing growth.
  3. Venn diagram representing who edited/added which class. In cases where it overlaps, the same class was touched by more than one person. Three people directly interact with the ontology Helen, James and I. Ele and Jie would submit large term requests, so added those classes indirectly through any of us.
  4. This is how we’re leveraging all the rich metadata within the ontology. Here is an example of querying ArrayExpress http://www.ebi.ac.uk/arrayexpress/ for CML, and getting all experiments also annotated with chronic myloidleukemia and chronic myelogenousleukemia. Querying for leukemia or blood cancer would also give you this results. Anything inconsistent in the ontology would negatively influence this outcome.
  5. Here’s a typical workflow. Annotations unmapped to EFO in the Gene Expression Atlas (http://www.ebi.ac.uk/gxa/) are discovered by Zooma (zooma.sf.net). Zooma in turn verifies whether there is a pre-existing mapping within the Atlas already, if not tries to map it to EFO or other ontologies in OLS and BioPortal via OntoCAT. The output is the fed into similarity_match.pl script to double check that no similar terms are in EFO already (as Zooma performs only exact matching) and the vetted terms are finally added to EFO via James’ tab_to_owl script or manually.Another sources of new terms is external users requests. They usually supply a flat list of terms they would like to see within the ontology. These are then mapped via similarity_match.pl to check whether they’re already in EFO, and the added.similarlity_match.pl has custom dedicated dependencies for parsing OWL ontologies and MeSH.
  6. Before metadata from external resources can be imported into EFO we need to add appropriate xrefs. These are stored in a dedication annotation ‘definition_citation’ on the mapped term within EFO. The xrefs are added discovered by using similarity_match.pl to align other ontologies (e.g. MeSH, OMIM, NCI Thesaurus, Brenda, Cell Type, etc.) lexically to EFO. Note other tools exist in this domain that would rely on information content to align the ontologies. As far as I know they use exact matching only, so our approach could in fact be more efficient and in my experience the information content approach is not adding much value to the alignment.
  7. Once we have the xrefs in, we can use a separate application BioportalImporter which will follow all the xrefs into respective external terms via BioPortal and import all the missing synonyms and definitions into EFO recording the source in a dedicated ‘bioportal_provenance’ annotation. With OWL2 it would be also possible to annotate the annotation directly.
  8. Part of the BioportalImporter code base is consistency checking which performs 13 different tests once the import is completed. Most importantly it will report if there were any changes in external resources by cross-referencing provenance information between two versions of the import, and also alert on any potentially duplicated terms, by verifying shared metadata between two distinct terms within EFO.BioportalImporter is not in public domain as it’s tied quite heavily into EFO specifics, but most of the ontology handling code is actually in OntoCAT.Overview of the tests:Malformed efourisChanged ontology annoationsChanged classesObsoleted classesRenamed classesDuplicated xrefsDuplicated synonyms or labelsDupplicated xrefs same as URILocal efoURIs on external classesChanged featuresChanged external classesCircular referencesNon-english characters in annotations
  9. Clean_ontology_terms.pl relies on the metaphone and double metaphone algorithms. Metaphone was developed by Lawrence Philips as a response to deficiencies in the Soundex algorithm. It uses a larger set of rules for English pronunciation. The aim of Metaphone is to match words or names that are pronounced similarly, according to the criteria of similarity which ignores any non-initial vowels and treats voiced and unvoiced versions of consonants as the same. Its latest versionMetaphone 3achieves an unparalleled level of accuracy in producing correct lookup keys for English words, non-English words familiar to English speakers, and names commonly found in the United States, within the criterion of similarity as defined above, but it is not designed to match words which are clearly pronounced differently. Recently publishedAnatomy ontologies and potential users: bridging the gap, Ravensara S Travillian1*, Tomasz Adamusiak1, Tony Burdett1, Michael Gruenberger2,John Hancock3, Ann-Marie Mallon3, James Malone1, Paul Schofield2 and Helen Parkinson1While the original aim of the article was to show how difficult it is to align the two anatomy ontologies: FMA and Uberon, the other conclusion that can be reached is that metaphone algorithms are inapplicable to this particular use case. Mostly importantly clean_ontology_terms.pl performed only marginally better than Zooma doing exact matching, with an enormous hit to precision (~0.07) as the script for lack of better matches would present all the phrases just starting with the same word (a side effect of double metaphonemisapplied on a whole phrase rather than individual words, this is a different behaviour from classic metaphone).
  10. Our input data is rarely about differences in spelling such as British tumour and American tumor, but ratherdifferent grammatical number (cell vs. cells), digits, typos, and differently ordered words in similar phrases.Here left column shows an example unmapped annotations from the Atlas. Right-hand column existing terms in EFO that we would like to semi-automatically map to EFO. The ontology is too big to handle manually and it is impossible to remember anymore whether a particular term has already been added, that’s why we need to automate this.
  11. First of all clean_ontology_terms.pl is not that fuzzy at all.Tim Rayner the original developer of clean_ontology_terms.pl already considered a more fuzzy approach, and there is a comment in the code suggesting the use of Levenhsteindistance. Rather than extending the script further, rewrote it from scratch into similarity_match.plThe Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.This algorithm, an example of bottom-up dynamic programming, which is is a method for solving complex problems by breaking them down into simpler subproblems. Similar approaches have already been extensively studied in DNA sequence alignment, and the edit distance approach is further generalised by local and global alignment algorithms: Smith–Waterman and Needleman-Wunsch, but they don’t offer much improvement for transpositions, i.e. different ordering of words in a phrase.And this is where n-grams excel.
  12. An n-gram is basically a fragment of n length from a given sequence.This idea can be traced to Claude Shannon's work in information theory in the 1900s, but it was Gravano et al. Who first suggested it for string querying in database applications.
  13. N-grams work particularly well for transpositions. This surprisingly simple and easy to implement approach allows some powerful fuzzy matching.The general idea is that you split the two strings in question into all the possible 2-character fragments (2-grams) and treat the number of shared n-grams between the two strings as their similarity metric. This can be easily normalised by dividing the shared number by the total number of n-grams in the longer string.Here we have three strings 19 characters long. The two suprsing things about using Levenshtein distance in this case is that not only both strings are quite low on the similarity, but also the completely different one is actually more similar. N-grams on the other hand deliver exactly the result that we’re expecting, with the sentence A being the most similar to the template, almost identical sharing 18 out of 20 possible 2-grams.Note there is a variation of Levenshtein distance called Damerau–Levenshtein, but it only allows for  transposition of two adjacent characters.
  14. clean_ontology_terms.pl is being retired in place of similarity_match.pl Emma (emma@ebi.ac.uk) refactored all the code and repackaged it for easier integration and reuse into a dedicated set of modules EBI::FGPT::FuzzyRecogniser (http://search.cpan.org/dist/EBI-FGPT-FuzzyRecogniser/) available on CPAN.
  15. Blowing my own trumpet here. The OntoCAT’sarticle was featured in the top 10 most accessed articles at BMC Bioinformatics a few months ago. The website (http://www.ontocat.org) sees about 1,000 pageviews monthly.
  16. But it was Natalja and Misha who stole the show with the ontocat R package included in Bioconductor. Googling for ‘ontology R’ will return the wiki page for the package as first hit, and the actual article as fourth. This is no small feat considering the prevalence of dedicated Gene Ontology R packages that otherwise predominate this space.
  17. An example of a directed acyclic graph representing all the relationships in an ontology for a particular EFO ontology term ‘EFO_0000815’ (heart). Edgesare labelled according to the relationship. Organism part classes are represented as ellipses and disease classes are shown as rectangles. The ontoCATpackage was used to compute the relationships which were later processed in Cytoscape (Cline et al., 2007).Converting the whole ontology to what is effectively RDF triples is a computationally intensive tasks, and takes about 30 minutes when run on 200 cluster nodes and parallelised by multiprocessing. It is demonstrated in Example 16 in the online documentation (http://www.ontocat.org/browser/trunk/ontoCAT/src/uk/ac/ebi/ontocat/examples/Example16.java)