1. x-omics Data
Integration Challenges
Dr. Michael Lappe, Ph.D.
Senior Bioinformatics Scientist -
Functional Genomics and Systems Biology
CLCbio, Denmark
Thursday, February 14, 13
3. No more
cargo-cult
http://en.wikipedia.org/wiki/Cargo_cult_science
http://en.wikipedia.org/wiki/Cargo_cult
Thursday, February 14, 13
4. Form follows function
http://www.youtube.com/watch?v=pQHX-SjgQvQ
Do not follow empty ancient rituals that do not serve a useful purpose anymore!
Do NOT confuse the container with its content. Database systems are NOT the DATA!
Thursday, February 14, 13
5. Data Integration
• involves combining data
• residing in different sources and
• providing users with a unified view [...]
(combining research
results from different
bioinformatics repositories,
for example)
http://en.wikipedia.org/wiki/Data_integration
Thursday, February 14, 13
6. •
Different Levels of Resolution
Ecosystem
• Population
• Organism
• Organ
• Tissue
• Cell
• Organelle
• Complexes
• Assemblies
• Molecule
• Atoms
www.sciencephoto.com
Thursday, February 14, 13
7. Different experimental sources
Kühner et al. “Proteome organization in a genome-reduced bacterium.”
Science (2009) vol. 326 (5957) pp. 1235
Thursday, February 14, 13
9. www.abcam.com/cancer
Henning Stehr*, Seon-Hi J. Jang*, Jose M. Duarte, Christoph Wierling, Hans Lehrach, Michael Lappe, Bodo M.H. Lange
(2011) "The structural impact of cancer-associated mutations in oncogenes and tumor suppressors" Molecular Cancer
Thursday, February 14, 13
10. www.abcam.com/cancer
What are the typical mechanisms at the structural level
that cause the de/activation of cancer genes?
Henning Stehr*, Seon-Hi J. Jang*, Jose M. Duarte, Christoph Wierling, Hans Lehrach, Michael Lappe, Bodo M.H. Lange
(2011) "The structural impact of cancer-associated mutations in oncogenes and tumor suppressors" Molecular Cancer
Thursday, February 14, 13
12. Structural Analysis
surface vs. core - binding site - stability - clustering ...
ERBB2 MLH1
Thursday, February 14, 13
13. A simple yet robust classification
IN-
Thursday, February 14, 13
14. • Oncogenes • Tumor-suppressor genes
activating gain-of-function de-activating loss-of-function
mutations (surface, near mutations (in the core,
functional/binding sites) destabilising the structure)
ERBB2 MLH1
Thursday, February 14, 13
15. biological Networks -
getting to grips
with COMPLEXITY
Complex (biological) Systems as
Networks of Interacting Elements.
Graph
Life is a graph! G=(V, E)
records records
Nodes organize Relationships
(Vertices) (Edges)
have have
Properties
Thursday, February 14, 13
16. The human disease network.
Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabási AL.
Proc Natl Acad Sci U S A.
2007 May 22;104(21):8685-90.
Thursday, February 14, 13
17. Graph Databases
Think of Graphs not as a visualization
but as a DATA STRUCTURE
http://en.wikipedia.org/wiki/Graph_database
http://nosql-database.org/
http://www.neo4j.org/learn#graphs
Thursday, February 14, 13
18. Proteins as 1a1m - (Ca 8 A)
ResidueInteractionGraphs -
Anisotropic Network Model
eigen-mode 3
capturing dynamics
1a1m (Xray)
1jnj (NMR,
20 models)
oGNM: A protein dynamics online calculation engine using the Gaussian Network Model" Yang, L.-W.,
Rader, A.J., Liu, X., Jursa, C.J., Chen S.C., Karimi, H, Bahar, I. Nucleic Acids Res, 34, W24-31, 2006
Thursday, February 14, 13
19. Geometry & Structure
PDB: 1KX5 http://vimeo.com/24047115
S.Daujat, T. Weiss, F.Mohn, U.C.Lange, C.Ziegler-Birling, U.Zeissler, M.Lappe, D.Schubeler, M.E.Torres-Padilla, R.Schneider (2009). "H3K64 trimethylation
marks heterochromatin and is dynamically remodeled during developmental reprogramming" Nature Structural and Molecular Biology
Thursday, February 14, 13
20. x-omics =
Proteomics
Metabolomics
Regulation
[...] +
x-Seq Data
ChIP
= RNA BS ...
Thursday, February 14, 13
21. x-omics =
Proteomics
Metabolomics
Regulation
[...] +
x-Seq Data
ChIP
= RNA BS ...
Thursday, February 14, 13
22. some challenges ...
different experiments, protocols, samples, coverage ...
isolated information silos
different data formats
mapping & identifier chaos
error propagation / annotation bottleneck
statistical criteria for (dis-)similarity
knowledge lock-up, literature access
redundancy / implicit co-ordination
TMI & essential info ?
Thursday, February 14, 13
...
23. "Blind monks examining an elephant" by Itcho Hanabusa
題「衆瞽探象之圖」。英一蝶(はなぶさ・いっちょう 1652 – 1724)の作。
Thursday, February 14, 13
25. http://5stardata.info/
5★ Open Data
Tim Berners-Lee, the inventor of the Web and Linked Data initiator,
suggested a 5 star deployment scheme for Open Data.
Thursday, February 14, 13
26. http://5stardata.info/
5★ Open Data
★ make your stuff available on the Web (whatever format) under an Open License
Thursday, February 14, 13
27. http://5stardata.info/
5★ Open Data
★ ★ make it available as structured data (machine REadable, e.g. Excel*)
* http://dontuseexcel.wordpress.com/2013/02/07/dont-use-excel-for-biological-data/
Thursday, February 14, 13
28. http://5stardata.info/
5★ Open Data
★ ★ ★ use non-proprietary Open Formats (e.g. CSV instead of Excel)
Thursday, February 14, 13
29. http://5stardata.info/
5★ Open Data
★ ★ ★ ★ use URIs to denote things, so that people can point at your stuff
Thursday, February 14, 13
30. http://5stardata.info/
5★ Open Data
★ ★ ★ ★ ★ Link your Data to other data to provide (networked)
Thursday, February 14, 13
32. Giant Global Graph
important related concept that overlaps with GGG is that of the
"Semantic Web" - relates to decentralized Information. (≄Web3.0)
Thursday, February 14, 13
34. The next Web of open, linked data:
Tim Berners-Lee on TED.com
http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html
http://www.ted.com/talks/tim_berners_lee_the_year_open_data_went_worldwide.html
Thursday, February 14, 13
35. Web of biological Data
linked open scientific data grass-roots movement
Thursday, February 14, 13
36. scale-free
Protein Interaction Networks
small-world
Park, J., M. Lappe, et al. (2001). "Mapping protein family interactions: intramolecular and intermolecular
protein family interaction repertoires in the PDB and yeast." Journal of Molecular Biology 307(3): 929-38
Thursday, February 14, 13
38. modelling information gain:
Tandem-Affinity
Purifications in-silico
Michael Lappe and Liisa Holm
"Unraveling protein interaction networks
with near-optimal efficiency." (2004)
Nature Biotechnology 22(1): 98-103
Thursday, February 14, 13
39. Toward interoperable bioscience data
Susanna-Assunta Sansone et al., Nature Genetics, Feb 2012
“to make full use of research data, the
bioscience community needs to adopt
technologies and reward mechanisms that
support interoperability and promote the
growth of an open ‘data commoning’ culture.”
The open source ISA metadata tracking tools
facilitates standards compliant collection,
curation, local management and reuse of
datasets in an increasingly diverse set of life
science domains.
http://www.isa-tools.org/
http://www.nature.com/ng/journal/v44/n2/pdf/ng.1054.pdf
Thursday, February 14, 13
40. Free your data ...
Biology and BioInformatics are data-driven sciences
think beyond your own harddrive and the current paper
evaluate and embrace new technologies (LOD, GraphDBs)
rethink current incentive systems : no more cargo-cult
make it useful, re-useable
and sustainable
Open Access, Open Source
Open Linked Data Mash-Ups
focus on your science
Thursday, February 14, 13
41. Thank you!
wood engraving by an unknown artist, in “L'atmosphère:
météorologie populaire” (1888) Camille Flammarion
Hubble Space Telescope / NASA
Thursday, February 14, 13