The Big Picture: The Industrial Revolutiona talk in berlin, 2008, about industrialising bioinformatics data analysis
1. The Big Picture: The Industrial
Revolution
Robert Stevens
Robert.stevens@manchester.ac.uk
The University of Manchester, UK
2. Industrialisation
• Biology has industrialised data production
• Beginning to industrialise data analysis
• Need to automate experimentation
• Need to join them all together
3. Data Integration
• Data integration is possible
• We know how to do it (technically)
• We know how to do plumbing
• What is left is a social issue
4. Classic and Modern Biology
Genotype Phenotype
Modern biology
Classic biology
5. Semantic Knowledge Base
Experimentation,
Data generation
Consistency checking
Querying
Automated reasoning
Hypothesis formulation
Experimental design
Information extraction,
Knowledge formalization
Semantic
Systems
Biology Cycle
6. What’s in a Lab?
• People
• Equipment, reagents, etc.
• Protocols
• Policy, governance
• All there to facilitate and manage
investigation
7. What’s in an e-Lab?
People
Data Process
Investigation
8. Data: BioGateway
• Uses Virtuoso Open Server
– Open Source software that can host a triple store
– Can build this from RDF files
– Has a DB backend
• Supports SPARQL* language which allows
querying RDF data (graphs)
• Its syntax is similar to that of SQL.
*http://www.w3.org/TR/rdf-sparql-query/
http://www.openlinksw.com/virtuoso/
10. Data as Input: Asking Questions
• Cancer: what candidate genes are involved in
cell cycle control, S-phase to G2 transition,
DNA damage response and skin cancer?
• Gastrin: what genes correlate with cancer and
the use of anti-acids, and are involved in the
gastrin response, and are associated with cell
cycle control?
• Inflammation: give me genes that are
mentioned in the context of high carbohydrate
intake and play a role in (process #1 to be
named) and are within x steps from a GO
ontology term related to inflammation
15. Data & Processes: Hypotheses
• Run workflow
• Make new data to put in repository
• Also generate hypotheses
• Generate plan from hypothesis
• Execute plan and make more data
• Automated?
Slide Title: G 2 P
Slide contains two semicircles labelled Genotype and Phenotype
Text says: Classic Biology; Modern Biology
Slide Title: Genotype to Pathway
QTL to Pathway workflow
This workflow:
Identifies all the genes, and their Ensembl ids, in a QTL region using BioMart
Cross-references the gene ids to Entrez and Uniprot ids
Entrez and Uniprot ids then map onto KEGG gene ids
The KEGG gene ids are then used to identify KEGG pathways, including a description and an ID
These lists of descriptions and IDs are then returned back to the user
Slide Title: Pathway to Phenotype
Pathways to PubMed abstracts workflow
This workflow:
Takes in a list of KEGG pathway descriptions
Appends a search string to the end of each description
Searches through PubMed using the NCBI eUtils Web Services
For each article found in PubMed, as a PubMed id, an abstract is returned along with the date of publication
These abstracts are then returned to the user as a single file
Thos abstracts, coupled with abstracts from the phenotype, provide evidence linking those pathways to the phenotype
Screenshot of the BioCatalogue homepage
Screenshot of the myExperiment front page
Screenshot of the workflows index page on myExperiment
Screenshot of one of Paul’s workflows on myExperiment