SlideShare ist ein Scribd-Unternehmen logo
1 von 33
The Language of the Gene Ontology
Robert Stevens
Bio-Health Informatics Group
The University of Manchester
Manchester
United Kingdom
Robert.Stevens@manchester.ac.uk
Overview
• Annotation of biological data using ontologies;
• Communicating between annotator and user
• Least effort, power laws and quality metrics
• The analysis of GOA corpora
• Interpreting the results
Names in biology
Some category of protein
U2-type nuclear mRNA 5' splice site recognition
spliceosomal E complex formation
spliceosomal E complex biosynthesis
spliceosomal CC complex formation
U2-type nuclear mRNA 5'-splice site recognition
De facto Integration with Ontologies
• Agreement on the entities in biology to be
described
• Describe those entities with ontologies
• Label those data entities with those ontology’s
terms
• Ontology building now a mainstream activity
within biology
• Ontologies built by and for biologists
Genotype Phenotype
Sequence
Proteins
Gene products Transcript
Pathways
Cell type
BRENDA tissue /
enzyme source
Development
Anatomy
Phenotype
Plasmodium
life cycle
-Sequence types
and features
-Genetic Context
- Molecule role
- Molecular Function
- Biological process
- Cellular component
-Protein covalent bond
-Protein domain
-UniProt taxonomy
-Pathway ontology
-Event (INOH pathway
ontology)
-Systems Biology
-Protein-protein
interaction
-Arabidopsis development
-Cereal plant development
-Plant growth and developmental stage
-C. elegans development
-Drosophila development FBdv fly
development.obo OBO yes yes
-Human developmental anatomy, abstract
version
-Human developmental anatomy, timed version
-Mosquito gross anatomy
-Mouse adult gross anatomy
-Mouse gross anatomy and development
-C. elegans gross anatomy
-Arabidopsis gross anatomy
-Cereal plant gross anatomy
-Drosophila gross anatomy
-Dictyostelium discoideum anatomy
-Fungal gross anatomy FAO
-Plant structure
-Maize gross anatomy
-Medaka fish anatomy and development
-Zebrafish anatomy and development
-NCI Thesaurus
-Mouse pathology
-Human disease
-Cereal plant trait
-PATO PATO attribute and value.obo
-Mammalian phenotype
-Habronattus courtship
-Loggerhead nesting
-Animal natural history and life history
eVOC (Expressed
Sequence Annotation
for Humans)
Gene Ontology http://www.geneontology.org
 “a dynamic controlled vocabulary that
can be applied to all eukaryotes”
 Built by the community for the
community.
 Three organising principles:
 Molecular function, Biological
process, Cellular component
 Describes kinds of things and parts of
things
 Describes ~25,000 things
Annotating Biological Data
• Some 40 species genome DB now annotated
with GO
• http://
www.geneontology.org/GO.current.annotations.sh
• 395173 species specific, non-redundant
genes/gene products annotated
• 7718253 annotations in total
GO associations
• CYP 51
GO associations
CYP 51CYP 51
GO:0020037 :
heme binding
GO:0020037 :
heme binding
GO:0005506 :
Iron ion binding
GO:0005506 :
Iron ion binding
GO:0004497 :
monooxygenase activity
GO:0004497 :
monooxygenase activity
GO Evidence Codes
• Each annotation given an evidence code
• Broadly divide in to “computational inference”
and “experimental inference”
• Can partition GO annotated data in to “high”
and “low” confidence anotations
• Not directly quality
Zipf’s Law (1934, 1949)
• Frequency of a word in a corpus inversely
proportional to rank
Most popular word occurs twice as frequently as
next most popular word, which itself occurs twice as
frequently as the fourth
Power law distributions seen in many natural and
social situations
• This distribution is a characteristic of human
language
Plot of log frequency against log rank
The slope β gives information about the language used
in the corpus
The Communication Process
Biology
Encoded Message
Encoding Channel Decoding ReceiverSource
Decoded Message
Source=Annotators
Receiver= User of Annotation
Principle of Least Effort
• In the process of message passing from encoder to decoder
effort is expended
• Maximum information transfer with minimum effort
• A rich language precisely defining the message is hard work to
encode and should be easier to decode
• The steeper the slope (β) the richer the message and the
more effort involved
• Values for β of 2 is about optimum
• Does GO annotation behave like messages in a language?
• Looking at β might tell us about annotation quality – how
well is the message transfered
Listener Speaker
Effort
High
Low
Integrin Complex
Cell
Effort in Encoding and Decoding Annotation for
Intergrin Alpha8 Protein
Cell
Integrin Complex
Values of Power Law Exponent
• single author sources in English β is about 2 Ferrer i Cancho
and Sole, 2001
• For young children β is around 1.6 Piotrowski and
Pashkovskii, 1994
• β > 2 for sets of nouns in siphisticated, single authored texts
Balasubrahmanyan and Naranan, 1996
• English texts in the range 1.6 < β < 2.4 Ferrer i Cancho, 2005b
• Low values favour the speaker and is low effort for the
speaker
• High values favour the listener and are high effort for the
speaker
The Questions
Does the Gene Ontology act like a language?
Are GO annotations utterances in that
language?
That is, do GO annotations follow a power law?
What is the quality of that communication?
What is the exponent
What is the effort involved in that
communication? what is the effort involved in
encoding and decoding?
Materials and Methods
• GOA and ENSEMBL annotations
• For species: Human, gorilla, mouse, rat, yeast,
fly, cow, fish
• Divided in to “high” and “low” confidence
using evidence codes
• Plot log cumulative frequence against log rank
• Fit to power law (Clauset, et al., 2009)
• Look at exponent of lines for various samples
The Equation….
If is the proportion of words in a text
with frequency f, the Zipf Law is given as:
Where refers to the frequency of word and
indicates the exponent or scaling parameter of power law
model
( )P f f β−
:
( )P f
f β
Power law behavior for GO gene annotation of Biological Process
within Human GOA
Power law behavior for GO gene annotation of Molecular Function
within Human GOA
Power law behavior for the GO gene annotation of Cellular
Component within Human GOA
Values for β
• For human GOA:
– Biological process 2.04
– Molecular function 1.83
– Cellular component 1.73
Across species, most fit
1.6< β < 2.4 which is normal for language
Mean for BP 2.14
Mean for MF 1.80
Mean for CC 1.75
BP different from MF and CC, which do not differ
Species Sub- Ontology GOA Ensembl
β P-value β P-value
Hs CC 1.73 0.63 1.73 0.61
MF 1.83 0.55 1.68 0
BP 2.04 0.65 2 0.2
Mm CC 1.69 0.74 1.73 0.38
MF 1.76 0.36 1.79 0.29
BP 2.08 0.97 2.12 0.46
Dr CC 1.62 0.74 1.73 0.93
MF 1.69 0.91 1.82 0.84
BP 1.88 0.11 1.88 0.67
Bt CC 1.72 0.25 1.75 0.33
MF 1.72 0.36 1.71 0.01
BP 2.04 0.56 2.11 0.89
Sc CC 1.86 0.29 1.89 0.89
MF 1.88 0.78 1.81 0.79
BP 2.27 0.42 2.26 0.78
Rn CC 1.68 0.24 1.76 0.58
MF 1.91 0.85 1.71 0
BP 2.38 0.76 2.07 0.17
Dm CC 1.94 0.13 1.8 0.61
MF 1.84 0.01 1.69 0.01
BP 2.31 0.06 2.13 0.58
Following table shows the results obtained from the power law analysis of each of the data sets characterized
in supplementary table 1. β is the Zipf’s law exponent and P-value is a statistic used to determine how good a
model the power law is of the data. Statistically significant values are denoted in bold. H. sapiens (Hs), M.
musculus (Mm), D. rerio (Dr), B. taurus (Bt), S. cerevisiae (Sc), R. norvegicus (Rn), D. melanogaster (Dm)
Species
Sub-
Ontology
GOA Ensembl
HC LC HC LC
β P-Value β P-Value β P-Value β P-Value
Hs CC 1.88 0.37 1.62 0.11 1.89 0.3 1.87 0.64
MF 2.05 0.18 1.75 0.16 2.06 0.83 1.77 0.02
BP 2.12 0.37 2.04 0.62 2.11 0.34 1.86 0.04
Mm CC 1.9 0.43 1.5 0.71 1.91 0.2 1.6 0.86
MF 2.15 0.65 1.67 0.03 2.15 0.65 1.8 0.00
BP 2.6 0.61 1.62 0.00 2.62 0.3 1.8 0.08
Table below shows the results obtained from power law analysis of each of the data
sets characterized in supplementary Table 2. β is the Zipf’s law exponent and P-value
is a statistic used to determine how good a model the power law is of the data.
Statistically significant values are denoted in bold. The GO evidence codes used to
define the High Confidence (HC) and Low Confidence (LC) data sets are described in
the materials and methods
0.0
0.5
1.0
1.5
2.0
2.5
Hs Mm Dr Bt Sc Rn Dm
Species
β
BP
MF
CC
Comparison of calculated Zipf’s Law exponents for various sub-ontologies
chosen from GOA for different species: H. sapiens (Hs), M. musculus (Mm),
D. rerio (Dr), B. taurus (Bt), S. cerevisiae (Sc), R. norvegicus (Rn), D.
melanogaster (Dm)
Power law behavior for the GO gene annotation of Molecular
Function within Human Ensembl (with two-regime behavior)
Findings
• Most annotations with GO behave like utterences in a
language
• We can say something about the quality of those utterences
• Non-fitting to power law suggests non-language like
communication
• Low confidence data fits less well to power law
– Utterences in biological process about right (2.14)
– Utterences in cell component and molecular function biased towards
speaker
Why might this be? – the quality is lower (1.7 and 1.8)
Is it just because BP is that much bigger and thus it is easier to be more
specific?
Bias towards speaker for lower quality annotations
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
0 1000 2000 3000 4000 5000 6000
Distinct GO identifier
β
CC
MF
BP
The power law exponent, β, as a function of the total number of
distinct GO identifier in each of the GO sub-ontologies
More Findings
• This effect is independent of size
• Why are BP and MF/CC different?
• Can speculate about what we know: We know
a lot about processes, much less about
specific functions of proteins; there is simply
much less to know about components and
location is also a bit “tricky”
Power law behaviour in dataset 1 from EFO (Experimental Factor Ontology)
Power law behaviour in dataset 2 from EFO (Experimental Factor Ontology)
Conclusions and the Future
• Rapid assessment of language like qualities of
GO annotations
• Gives some idea of the quality of those
annotations – what effort is involved
• Need to make it more analytic
• Look at many more annotations
Acknowledgements
• Leila R. Kalankesh
• Andy brass
• Robert Stevens

Weitere ähnliche Inhalte

Was ist angesagt?

UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOA
EBI
 
Integrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity ModelsIntegrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity Models
Benjamin Good
 

Was ist angesagt? (20)

Ontologies for life sciences: examples from the gene ontology
Ontologies for life sciences: examples from the gene ontologyOntologies for life sciences: examples from the gene ontology
Ontologies for life sciences: examples from the gene ontology
 
Gene Ontology WormBase Workshop International Worm Meeting 2015
Gene Ontology WormBase Workshop International Worm Meeting 2015Gene Ontology WormBase Workshop International Worm Meeting 2015
Gene Ontology WormBase Workshop International Worm Meeting 2015
 
The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...The roles communities play in improving bioinformatics: better software, bett...
The roles communities play in improving bioinformatics: better software, bett...
 
UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOA
 
TAIR -Using biological ontologies to accelerate progress in plant biology res...
TAIR -Using biological ontologies to accelerate progress in plant biology res...TAIR -Using biological ontologies to accelerate progress in plant biology res...
TAIR -Using biological ontologies to accelerate progress in plant biology res...
 
SooryaKiran Bioinformatics
SooryaKiran BioinformaticsSooryaKiran Bioinformatics
SooryaKiran Bioinformatics
 
Metabolic Network Analysis
Metabolic Network AnalysisMetabolic Network Analysis
Metabolic Network Analysis
 
Representing and reasoning with biological knowledge
Representing and reasoning with biological knowledgeRepresenting and reasoning with biological knowledge
Representing and reasoning with biological knowledge
 
Genome annotation with open source software: Apollo, Jbrowse and the GO in Ga...
Genome annotation with open source software: Apollo, Jbrowse and the GO in Ga...Genome annotation with open source software: Apollo, Jbrowse and the GO in Ga...
Genome annotation with open source software: Apollo, Jbrowse and the GO in Ga...
 
Integrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity ModelsIntegrating Pathway Databases with Gene Ontology Causal Activity Models
Integrating Pathway Databases with Gene Ontology Causal Activity Models
 
Mining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity DataMining Drug Targets, Structures and Activity Data
Mining Drug Targets, Structures and Activity Data
 
Ontology Development Kit: Bio-Ontologies 2019
Ontology Development Kit: Bio-Ontologies 2019Ontology Development Kit: Bio-Ontologies 2019
Ontology Development Kit: Bio-Ontologies 2019
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
Experiences in the biosciences with the open biological ontologies foundry an...
Experiences in the biosciences with the open biological ontologies foundry an...Experiences in the biosciences with the open biological ontologies foundry an...
Experiences in the biosciences with the open biological ontologies foundry an...
 
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
Martin Ringwald, Mouse Gene Expression DB, fged_seattle_2013
 
2015 bioinformatics go_hmm_wim_vancriekinge
2015 bioinformatics go_hmm_wim_vancriekinge2015 bioinformatics go_hmm_wim_vancriekinge
2015 bioinformatics go_hmm_wim_vancriekinge
 
Advanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven ResearchAdvanced Bioinformatics for Genomics and BioData Driven Research
Advanced Bioinformatics for Genomics and BioData Driven Research
 
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
2016 bioinformatics i_bio_cheminformatics_wimvancriekinge
 
Pathogen Genome Data
Pathogen Genome DataPathogen Genome Data
Pathogen Genome Data
 
Dictionary Based Approaches in Protein Name Recognition
Dictionary Based Approaches in Protein Name RecognitionDictionary Based Approaches in Protein Name Recognition
Dictionary Based Approaches in Protein Name Recognition
 

Andere mochten auch (7)

Keeping ontology development Agile
Keeping ontology development AgileKeeping ontology development Agile
Keeping ontology development Agile
 
วันพ่อแห่งชาติ
วันพ่อแห่งชาติวันพ่อแห่งชาติ
วันพ่อแห่งชาติ
 
Abstracting and Generalising the Foundational Model of Anatomy
Abstracting and Generalising the Foundational Model of Anatomy Abstracting and Generalising the Foundational Model of Anatomy
Abstracting and Generalising the Foundational Model of Anatomy
 
วันพ่อแห่งชาติ
วันพ่อแห่งชาติวันพ่อแห่งชาติ
วันพ่อแห่งชาติ
 
The Big Picture: The Industrial Revolutiona talk in berlin, 2008, about indus...
The Big Picture: The Industrial Revolutiona talk in berlin, 2008, about indus...The Big Picture: The Industrial Revolutiona talk in berlin, 2008, about indus...
The Big Picture: The Industrial Revolutiona talk in berlin, 2008, about indus...
 
Genome annotation 2013
Genome annotation 2013Genome annotation 2013
Genome annotation 2013
 
Introduction to bioinformatics
Introduction to bioinformaticsIntroduction to bioinformatics
Introduction to bioinformatics
 

Ähnlich wie The Language of the Gene Ontology

Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems level
Lars Juhl Jensen
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics Presentation
Zhenhong Bao
 
CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013
Iddo
 
Systems biology - Bioinformatics on complete biological systems
Systems biology - Bioinformatics on complete biological systemsSystems biology - Bioinformatics on complete biological systems
Systems biology - Bioinformatics on complete biological systems
Lars Juhl Jensen
 

Ähnlich wie The Language of the Gene Ontology (20)

DextMP: Text mining for finding moonlighting proteins
DextMP: Text mining for finding moonlighting proteinsDextMP: Text mining for finding moonlighting proteins
DextMP: Text mining for finding moonlighting proteins
 
Chicago stats talk
Chicago stats talkChicago stats talk
Chicago stats talk
 
Formal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural GenomesFormal languages to map Genotype to Phenotype in Natural Genomes
Formal languages to map Genotype to Phenotype in Natural Genomes
 
lecture 1.pptx
lecture 1.pptxlecture 1.pptx
lecture 1.pptx
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
 
Domains of unknown function are essential in yeast
Domains of unknown function are essential in yeastDomains of unknown function are essential in yeast
Domains of unknown function are essential in yeast
 
2015 bioinformatics protein_structure_wimvancriekinge
2015 bioinformatics protein_structure_wimvancriekinge2015 bioinformatics protein_structure_wimvancriekinge
2015 bioinformatics protein_structure_wimvancriekinge
 
Seminario sobre la Aplicación "Expression2Kinases"
Seminario sobre la Aplicación "Expression2Kinases"Seminario sobre la Aplicación "Expression2Kinases"
Seminario sobre la Aplicación "Expression2Kinases"
 
PomBase conventions for improving annotation depth, breadth, consistency and ...
PomBase conventions for improving annotation depth, breadth, consistency and ...PomBase conventions for improving annotation depth, breadth, consistency and ...
PomBase conventions for improving annotation depth, breadth, consistency and ...
 
Identification, annotation and visualisation of extreme changes in splicing w...
Identification, annotation and visualisation of extreme changes in splicing w...Identification, annotation and visualisation of extreme changes in splicing w...
Identification, annotation and visualisation of extreme changes in splicing w...
 
UKSG 2023 - Will artificial intelligence change how readers use the research ...
UKSG 2023 - Will artificial intelligence change how readers use the research ...UKSG 2023 - Will artificial intelligence change how readers use the research ...
UKSG 2023 - Will artificial intelligence change how readers use the research ...
 
GENOMIC SIGNAL PROCESSING
GENOMIC SIGNAL PROCESSINGGENOMIC SIGNAL PROCESSING
GENOMIC SIGNAL PROCESSING
 
2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekinge2016 bioinformatics i_proteins_wim_vancriekinge
2016 bioinformatics i_proteins_wim_vancriekinge
 
Utilizing literature for biological discovery
Utilizing literature for biological discoveryUtilizing literature for biological discovery
Utilizing literature for biological discovery
 
Systems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems levelSystems biology - Understanding biology at the systems level
Systems biology - Understanding biology at the systems level
 
Functional annotation of invertebrate genomes
Functional annotation of invertebrate genomesFunctional annotation of invertebrate genomes
Functional annotation of invertebrate genomes
 
Bioinfomatics Presentation
Bioinfomatics PresentationBioinfomatics Presentation
Bioinfomatics Presentation
 
Molecular basis of evolution and softwares used in phylogenetic tree contruction
Molecular basis of evolution and softwares used in phylogenetic tree contructionMolecular basis of evolution and softwares used in phylogenetic tree contruction
Molecular basis of evolution and softwares used in phylogenetic tree contruction
 
CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013CAFA poster presented at CSHL Genome Informatics 2013
CAFA poster presented at CSHL Genome Informatics 2013
 
Systems biology - Bioinformatics on complete biological systems
Systems biology - Bioinformatics on complete biological systemsSystems biology - Bioinformatics on complete biological systems
Systems biology - Bioinformatics on complete biological systems
 

Mehr von robertstevens65

Ontology learning from text
Ontology learning from textOntology learning from text
Ontology learning from text
robertstevens65
 

Mehr von robertstevens65 (20)

Ontologies: Necessary, but not sufficient
Ontologies: Necessary, but not sufficientOntologies: Necessary, but not sufficient
Ontologies: Necessary, but not sufficient
 
The Pragmatics and Formality of Authoring OntologiesOdsl 2016
The Pragmatics and Formality of Authoring OntologiesOdsl 2016The Pragmatics and Formality of Authoring OntologiesOdsl 2016
The Pragmatics and Formality of Authoring OntologiesOdsl 2016
 
OBOPedia: An Encyclopaedia of Biology Using OBO OntologiesObopedia swat4ls-20...
OBOPedia: An Encyclopaedia of Biology Using OBO OntologiesObopedia swat4ls-20...OBOPedia: An Encyclopaedia of Biology Using OBO OntologiesObopedia swat4ls-20...
OBOPedia: An Encyclopaedia of Biology Using OBO OntologiesObopedia swat4ls-20...
 
The Quality of Method Reporting in
The Quality of Method Reporting in The Quality of Method Reporting in
The Quality of Method Reporting in
 
The Semantics of Genomic Analysis
The Semantics of  Genomic AnalysisThe Semantics of  Genomic Analysis
The Semantics of Genomic Analysis
 
Issues and activities in authoring ontologies
Issues and activities in authoring ontologiesIssues and activities in authoring ontologies
Issues and activities in authoring ontologies
 
The state of the nation for ontology development
The state of the nation for ontology developmentThe state of the nation for ontology development
The state of the nation for ontology development
 
Building and Using Ontologies to do biology
Building and Using Ontologies to do biologyBuilding and Using Ontologies to do biology
Building and Using Ontologies to do biology
 
Properties and Individuals in OWL: Reasoning About Family History
Properties and Individuals in OWL: Reasoning About Family HistoryProperties and Individuals in OWL: Reasoning About Family History
Properties and Individuals in OWL: Reasoning About Family History
 
Choosing and Building Knowledge Artefacts
Choosing and Building Knowledge ArtefactsChoosing and Building Knowledge Artefacts
Choosing and Building Knowledge Artefacts
 
Populous: A tool for Populating OWL Ontologies from Templates
Populous: A tool for Populating OWL Ontologies from TemplatesPopulous: A tool for Populating OWL Ontologies from Templates
Populous: A tool for Populating OWL Ontologies from Templates
 
Spreadsheets to OWL
Spreadsheets to OWLSpreadsheets to OWL
Spreadsheets to OWL
 
Lessons from teaching non-computer scientists OWL and ontologies
Lessons from teaching non-computer scientists OWL and ontologiesLessons from teaching non-computer scientists OWL and ontologies
Lessons from teaching non-computer scientists OWL and ontologies
 
Kidney and Urinary Pathways Knowledge Base (part of e-LICO)
Kidney and Urinary Pathways Knowledge Base (part of e-LICO)Kidney and Urinary Pathways Knowledge Base (part of e-LICO)
Kidney and Urinary Pathways Knowledge Base (part of e-LICO)
 
A Rose by Any Other Name is Still a Rose
A Rose by Any Other Name is Still a RoseA Rose by Any Other Name is Still a Rose
A Rose by Any Other Name is Still a Rose
 
Working with big biomedical ontologies
Working with big biomedical ontologiesWorking with big biomedical ontologies
Working with big biomedical ontologies
 
Ontology learning from text
Ontology learning from textOntology learning from text
Ontology learning from text
 
Knowledge Management in a Knowledge Based Discipline
Knowledge Management in a Knowledge Based DisciplineKnowledge Management in a Knowledge Based Discipline
Knowledge Management in a Knowledge Based Discipline
 
Ontology at Manchester
Ontology at ManchesterOntology at Manchester
Ontology at Manchester
 
A family History Knowledge Base in OWL 2
A family History Knowledge Base in OWL 2A family History Knowledge Base in OWL 2
A family History Knowledge Base in OWL 2
 

Kürzlich hochgeladen

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 

Kürzlich hochgeladen (20)

Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Creating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening DesignsCreating and Analyzing Definitive Screening Designs
Creating and Analyzing Definitive Screening Designs
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 

The Language of the Gene Ontology

  • 1. The Language of the Gene Ontology Robert Stevens Bio-Health Informatics Group The University of Manchester Manchester United Kingdom Robert.Stevens@manchester.ac.uk
  • 2. Overview • Annotation of biological data using ontologies; • Communicating between annotator and user • Least effort, power laws and quality metrics • The analysis of GOA corpora • Interpreting the results
  • 3. Names in biology Some category of protein U2-type nuclear mRNA 5' splice site recognition spliceosomal E complex formation spliceosomal E complex biosynthesis spliceosomal CC complex formation U2-type nuclear mRNA 5'-splice site recognition
  • 4. De facto Integration with Ontologies • Agreement on the entities in biology to be described • Describe those entities with ontologies • Label those data entities with those ontology’s terms • Ontology building now a mainstream activity within biology • Ontologies built by and for biologists
  • 5. Genotype Phenotype Sequence Proteins Gene products Transcript Pathways Cell type BRENDA tissue / enzyme source Development Anatomy Phenotype Plasmodium life cycle -Sequence types and features -Genetic Context - Molecule role - Molecular Function - Biological process - Cellular component -Protein covalent bond -Protein domain -UniProt taxonomy -Pathway ontology -Event (INOH pathway ontology) -Systems Biology -Protein-protein interaction -Arabidopsis development -Cereal plant development -Plant growth and developmental stage -C. elegans development -Drosophila development FBdv fly development.obo OBO yes yes -Human developmental anatomy, abstract version -Human developmental anatomy, timed version -Mosquito gross anatomy -Mouse adult gross anatomy -Mouse gross anatomy and development -C. elegans gross anatomy -Arabidopsis gross anatomy -Cereal plant gross anatomy -Drosophila gross anatomy -Dictyostelium discoideum anatomy -Fungal gross anatomy FAO -Plant structure -Maize gross anatomy -Medaka fish anatomy and development -Zebrafish anatomy and development -NCI Thesaurus -Mouse pathology -Human disease -Cereal plant trait -PATO PATO attribute and value.obo -Mammalian phenotype -Habronattus courtship -Loggerhead nesting -Animal natural history and life history eVOC (Expressed Sequence Annotation for Humans)
  • 6. Gene Ontology http://www.geneontology.org  “a dynamic controlled vocabulary that can be applied to all eukaryotes”  Built by the community for the community.  Three organising principles:  Molecular function, Biological process, Cellular component  Describes kinds of things and parts of things  Describes ~25,000 things
  • 7. Annotating Biological Data • Some 40 species genome DB now annotated with GO • http:// www.geneontology.org/GO.current.annotations.sh • 395173 species specific, non-redundant genes/gene products annotated • 7718253 annotations in total
  • 9. GO associations CYP 51CYP 51 GO:0020037 : heme binding GO:0020037 : heme binding GO:0005506 : Iron ion binding GO:0005506 : Iron ion binding GO:0004497 : monooxygenase activity GO:0004497 : monooxygenase activity
  • 10. GO Evidence Codes • Each annotation given an evidence code • Broadly divide in to “computational inference” and “experimental inference” • Can partition GO annotated data in to “high” and “low” confidence anotations • Not directly quality
  • 11. Zipf’s Law (1934, 1949) • Frequency of a word in a corpus inversely proportional to rank Most popular word occurs twice as frequently as next most popular word, which itself occurs twice as frequently as the fourth Power law distributions seen in many natural and social situations • This distribution is a characteristic of human language Plot of log frequency against log rank The slope β gives information about the language used in the corpus
  • 12. The Communication Process Biology Encoded Message Encoding Channel Decoding ReceiverSource Decoded Message Source=Annotators Receiver= User of Annotation
  • 13. Principle of Least Effort • In the process of message passing from encoder to decoder effort is expended • Maximum information transfer with minimum effort • A rich language precisely defining the message is hard work to encode and should be easier to decode • The steeper the slope (β) the richer the message and the more effort involved • Values for β of 2 is about optimum • Does GO annotation behave like messages in a language? • Looking at β might tell us about annotation quality – how well is the message transfered
  • 14. Listener Speaker Effort High Low Integrin Complex Cell Effort in Encoding and Decoding Annotation for Intergrin Alpha8 Protein Cell Integrin Complex
  • 15. Values of Power Law Exponent • single author sources in English β is about 2 Ferrer i Cancho and Sole, 2001 • For young children β is around 1.6 Piotrowski and Pashkovskii, 1994 • β > 2 for sets of nouns in siphisticated, single authored texts Balasubrahmanyan and Naranan, 1996 • English texts in the range 1.6 < β < 2.4 Ferrer i Cancho, 2005b • Low values favour the speaker and is low effort for the speaker • High values favour the listener and are high effort for the speaker
  • 16. The Questions Does the Gene Ontology act like a language? Are GO annotations utterances in that language? That is, do GO annotations follow a power law? What is the quality of that communication? What is the exponent What is the effort involved in that communication? what is the effort involved in encoding and decoding?
  • 17. Materials and Methods • GOA and ENSEMBL annotations • For species: Human, gorilla, mouse, rat, yeast, fly, cow, fish • Divided in to “high” and “low” confidence using evidence codes • Plot log cumulative frequence against log rank • Fit to power law (Clauset, et al., 2009) • Look at exponent of lines for various samples
  • 18. The Equation…. If is the proportion of words in a text with frequency f, the Zipf Law is given as: Where refers to the frequency of word and indicates the exponent or scaling parameter of power law model ( )P f f β− : ( )P f f β
  • 19. Power law behavior for GO gene annotation of Biological Process within Human GOA
  • 20. Power law behavior for GO gene annotation of Molecular Function within Human GOA
  • 21. Power law behavior for the GO gene annotation of Cellular Component within Human GOA
  • 22. Values for β • For human GOA: – Biological process 2.04 – Molecular function 1.83 – Cellular component 1.73 Across species, most fit 1.6< β < 2.4 which is normal for language Mean for BP 2.14 Mean for MF 1.80 Mean for CC 1.75 BP different from MF and CC, which do not differ
  • 23. Species Sub- Ontology GOA Ensembl β P-value β P-value Hs CC 1.73 0.63 1.73 0.61 MF 1.83 0.55 1.68 0 BP 2.04 0.65 2 0.2 Mm CC 1.69 0.74 1.73 0.38 MF 1.76 0.36 1.79 0.29 BP 2.08 0.97 2.12 0.46 Dr CC 1.62 0.74 1.73 0.93 MF 1.69 0.91 1.82 0.84 BP 1.88 0.11 1.88 0.67 Bt CC 1.72 0.25 1.75 0.33 MF 1.72 0.36 1.71 0.01 BP 2.04 0.56 2.11 0.89 Sc CC 1.86 0.29 1.89 0.89 MF 1.88 0.78 1.81 0.79 BP 2.27 0.42 2.26 0.78 Rn CC 1.68 0.24 1.76 0.58 MF 1.91 0.85 1.71 0 BP 2.38 0.76 2.07 0.17 Dm CC 1.94 0.13 1.8 0.61 MF 1.84 0.01 1.69 0.01 BP 2.31 0.06 2.13 0.58 Following table shows the results obtained from the power law analysis of each of the data sets characterized in supplementary table 1. β is the Zipf’s law exponent and P-value is a statistic used to determine how good a model the power law is of the data. Statistically significant values are denoted in bold. H. sapiens (Hs), M. musculus (Mm), D. rerio (Dr), B. taurus (Bt), S. cerevisiae (Sc), R. norvegicus (Rn), D. melanogaster (Dm)
  • 24. Species Sub- Ontology GOA Ensembl HC LC HC LC β P-Value β P-Value β P-Value β P-Value Hs CC 1.88 0.37 1.62 0.11 1.89 0.3 1.87 0.64 MF 2.05 0.18 1.75 0.16 2.06 0.83 1.77 0.02 BP 2.12 0.37 2.04 0.62 2.11 0.34 1.86 0.04 Mm CC 1.9 0.43 1.5 0.71 1.91 0.2 1.6 0.86 MF 2.15 0.65 1.67 0.03 2.15 0.65 1.8 0.00 BP 2.6 0.61 1.62 0.00 2.62 0.3 1.8 0.08 Table below shows the results obtained from power law analysis of each of the data sets characterized in supplementary Table 2. β is the Zipf’s law exponent and P-value is a statistic used to determine how good a model the power law is of the data. Statistically significant values are denoted in bold. The GO evidence codes used to define the High Confidence (HC) and Low Confidence (LC) data sets are described in the materials and methods
  • 25. 0.0 0.5 1.0 1.5 2.0 2.5 Hs Mm Dr Bt Sc Rn Dm Species β BP MF CC Comparison of calculated Zipf’s Law exponents for various sub-ontologies chosen from GOA for different species: H. sapiens (Hs), M. musculus (Mm), D. rerio (Dr), B. taurus (Bt), S. cerevisiae (Sc), R. norvegicus (Rn), D. melanogaster (Dm)
  • 26. Power law behavior for the GO gene annotation of Molecular Function within Human Ensembl (with two-regime behavior)
  • 27. Findings • Most annotations with GO behave like utterences in a language • We can say something about the quality of those utterences • Non-fitting to power law suggests non-language like communication • Low confidence data fits less well to power law – Utterences in biological process about right (2.14) – Utterences in cell component and molecular function biased towards speaker Why might this be? – the quality is lower (1.7 and 1.8) Is it just because BP is that much bigger and thus it is easier to be more specific? Bias towards speaker for lower quality annotations
  • 28. 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 0 1000 2000 3000 4000 5000 6000 Distinct GO identifier β CC MF BP The power law exponent, β, as a function of the total number of distinct GO identifier in each of the GO sub-ontologies
  • 29. More Findings • This effect is independent of size • Why are BP and MF/CC different? • Can speculate about what we know: We know a lot about processes, much less about specific functions of proteins; there is simply much less to know about components and location is also a bit “tricky”
  • 30. Power law behaviour in dataset 1 from EFO (Experimental Factor Ontology)
  • 31. Power law behaviour in dataset 2 from EFO (Experimental Factor Ontology)
  • 32. Conclusions and the Future • Rapid assessment of language like qualities of GO annotations • Gives some idea of the quality of those annotations – what effort is involved • Need to make it more analytic • Look at many more annotations
  • 33. Acknowledgements • Leila R. Kalankesh • Andy brass • Robert Stevens