Forensic Biology & Its biological significance.pdf
Data integration
1. Primer for Predocs
17-19 January 2011
Rafael Jimenez
rafael@ebi.ac.uk
EnCORE
presentation
Data integration
2. Table of contents
• Data integration
Why do we need it?
What is it?
Problems
Suggestions
Different approaches
Important variables
Tools
3. Molecular Biology Database resources
Human Genes and
Diseases
13%
Proteomics Resources
1%
Other Molecular
Biology Databases
3%
Immunological
databases
2%
Plant databases
7%
Organelle databases
2%
Human and other
Vertebrate Genomes
8%
Nucleotide Sequence
Databases
9%
RNA sequence
databases
5%
Protein sequence
databases
13%
Structure Databases
9%
,Genomics Databases
non-vertebrate
19%
Metabolic and
Signaling Pathways
9%
Nucleic Acids Research annual
Database Issue and the NAR online
Molecular Biology Database
Collection in 2009. MY Galperin, GR
Cochrane - Nucleic Acids Research,
~1440
resources
http://www.oxfordjournals.org/nar/database/c
5. Why so many data sources?
• Many data types
• Many communities
• Different ways to structure data
• Control
• Reputation
• Easy publication
6. 23.08.18 6
DB
GUI
API
WS
A AA A
DB
GUI
API
WS
DB
GUI
API
WS
DB
GUI
API
WS
DB
GUI
API
WS
A AA A
A Annotator Database
Graphical User Interface
Application programming interface
Web Services
GUI
API
WS
User
Data collection
Ideally Reality
7. 23.08.18 7
Utility of bioinformaticsScientificimpact
Too little
bioinformatics
Too many databases
Too diverse interfaces
Tim Hubbard
8. 23.08.18 8
Data integration
DB
GUI
API
WS
DB DB DBDB
GUI
API
WS
DB
GUI
API
WS
DB
GUI
API
WS
DB
GUI
API
WS
NO YES
Database Query InterfaceQI User
Combining data residing in different sources
… providing users with a unified view of these data.
9. 23.08.18 9
Utility of bioinformaticsScientificimpact
Too little
bioinformatics
Too many databases
Too diverse interfaces
Integration of
10. Problems
Many data sources
• Many sources to maintain
• New sources appearing
• Just 20% has a sustained future*
• How to find them?
Different query interfaces
data integration?
Variable results
• Formats
• Schemas
• Controlled vocabularies
• Minimum information guidelines
Redundant results
* Merali Z. et all. Databases in peril. Nature 2005.
11. Suggestions
– Scientific and political independence of the databases
– Cross-database queries spanning domain and
organizational boundaries
– Sharing and adoption rather than reinventing
– Adoption of standards
– Coordination to avoid redundant content
– Infrastructure to avoid volatile resources
– Registries to find resources and services
12. QI
i
1
Data centralization
Curators / Annotators
Original data sources
Third party implementations
Users
Examples:
•Uniprot
•GenBank
•IntAct
S
i
S
integration
standardization
14. QI
i
2
Data warehousing
Curators / Annotators
Original data sources
Third party implementations
Users
Examples:
•Pathway Commons
•String
•Atlas
S
i
S
integration
standardization
16. QI
i
3
Dataset integration
Curators / Annotators
Original data sources
Third party implementations
Users
Examples:
•Your own script
•Workflows
i
S
integration
standardization
18. QI QIQI
i
4
Hyperlinks
Curators / Annotators
Original data sources
Third party implementations
Users
Examples:
•SRS
•Entrez
i
S
integration
standardization
20. QI QIQI
SP SP SP
QI
S
5
Federated databases
Curators / Annotators
Original data sources
Third party implementations
Users
Examples:
•DAS
•PSICQUIC
•EnCore
•RDF
i
i
S
integration
standardization
22. i
6
View integration
Curators / Annotators
Original data sources
Third party implementations
Users
QI QIQI
QI
Examples:
•BioZon
•TAMBIS
i
S
integration
standardization
26. Integrating different domains
Integration per domain
SPSPSP
Domain
Domain 1
QI
Domain 2
QI
Domain …
QI
QI
SP = Common identifiers, Controlled vocabularies, Common formats, Common schemas, Minimum information guidelines
1
2
leverage
27. Domain
Standards
• Standardization per domain
• Common identifiers
• Controlled vocabularies
• Common formats
• Common schemas
• Minimum information guidelines
• Common query interfaces
30. Architecture
• Data warehousing
– Pull data from several resources into one resource.
– Main features:
• Data centralization
• High maintenance
• Data out of date
• Modifications (schema, format, content, …)
• Federation
– Data residing in different sources with a common standard
protocol and query system.
– Main features:
• Fresh data (original)
• Data redundancy
• Data inconsistency
38. • PSI: Proteomics Standards Initiative
– Work group of the Human Proteome Organization
– Defines community standards for data in proteomics
• … facilitating data comparison, exchange and verification
Minimum information guidelines
38
• MIAPE: The Minimum Information About a Proteomics Experiment
• Data and metadata from proteomics experiments
• Data: results
• Metadata: data about the data
• Where the samples came from
• How the analysis were performed
39. Minimum information guidelines
MIMIx
• MIAPE document guideline for molecular interactions
• 1. Manuscript information
• 2. Experiment
• 3. Interaction
• 4. Confidence
40. ID Mapping services
Logical xref
(hyperlinked)
Inactive xref
Secondary
Identifier
Active xref
(hyperlinked)
Richard Cote
Web services!
•REST
•SOAP
http://www.ebi.ac.uk/Tools/picr/
Protein Identifier Cross-Reference Service
As a biologist I would prefer to see all the information in one unique database.
Centralized databases have this mission.
The aim to collect all the information for one specific domain.
However …
Medium-size labs and organizations are capable to produce large amounts of data.
The it becomes harder to submit data to centralized repositories.
Moreover data producers like to control and structure their own databases, developing their own GUI and access protocols.
For us, the users, it becomes harder to access the information.
For one specific domain we might find different databases, using different GUIs. We might end up downloading data in different formats complicating the integration of results. After integration we might find a problem of high redundancy in our results.
This workflow searches for genes which reside in a QTL (Quantitative Trait Loci) region in the mouse, Mus musculus. The workflow requires an input of: a chromosome name or number; a QTL start base pair position; QTL end base pair position. Data is then extracted from BioMart to annotate each of the genes found in this region. The Entrez and UniProt identifiers are then sent to KEGG to obtain KEGG gene identifiers. The KEGG gene identifiers are then used to searcg for pathways in the KEGG pathway database.
this is pathways_and_gene_annotations_for_qtl_phenotype_28303
exec with
chromosome = 17
start_position = 28500000
end_position = 32500000
The HUPO Proteomics Standards Initiative (PSI) defines community standards for data representation in proteomics to facilitate data comparison, exchange and verification.
The PSI was founded at the HUPO meeting in Washington, April 28-29, 2002
MIAPE: The Minimum Information About a Proteomics Experiment .
Guidance document specifying the data and metadata that should be collected from proteomics experiments
Where samples came from and how analyses were performed
Data accompanied by context: 'metadata' ('data about the data')
Integration of biological data of various types and development of adapted bioinformatics tools represent critical objectives to enable research at the systems level. The European Network of Excellence ENFIN is engaged in developing an adapted infrastructure to connect databases, and platforms to enable both generation of new bioinformatics tools and experimental validation of computational predictions. Beyond the use of common standards to format individual datasets, there is a need for sophisticated informatics platforms to enable mining data across various domains, sources, formats and types. The aim of the EnCORE project is to integrate across different disciplines an extensive list of database resources and analysis tools in a computationally accessible and extensible manner, facilitating automated data retrieval and processing with a special focus on systems biology. The EnCORE platform is available as a collection of webservices with a common standard format easy to integrate in Workflow management software such as Taverna. Additionally EnCORE services are also accessible thought EnVISION, a web graphical user interface providing elaborated information such as molecular interaction, biological pathways and computational models of pathways.