Bioinformatics in the Era of Open Science and Big Data
1. Bioinformatics in the Era of
Open Science and Big Data
Philip E. Bourne
University of California San Diego
pbourne@ucsd.edu
1/28/14
SIB Biel/Bienne
1
2. My Bias
• RCSB PDB/IEDB Database Developer – Views on
community, quality, sustainability …
• PLOS Journal Co-founder – Open Science Advocate
• Associate Vice Chancellor for Innovation – Business
models, interaction with the private
sector, sustainability
• Professor – Mentoring, reward system, value (or not) of
research
• Associate Director of NIH for Data Science - ??
1/28/14
SIB Biel/Bienne
2
3. The History of Bioinformatics
According to Bourne
Searls (ed) The Roots in Bioinformatics Series PLOS Comp Biol
1980s
1990s
2000s
2010s
2020
Discipline:
Unknown Expt. Driven Emergent Over-sold A Service
A Partner
A Driver
The Raw Material:
Non-existent
Limited /Poor
More/Ontologies
Big Data/Siloed Open/Integrated
The People:
No name
1/28/14
Technicians
Industry recognition data scientists
SIB Biel/Bienne
Academics
3
4. We Need to Start By Asking How Are
We Using the Data Now!
Only Then Can We Make Rational
Decisions About Data – Large or Small
1/28/14
SIB Biel/Bienne
4
5. Web Logs etc. Are
Not Enough
Structure Summary page activity for
H1N1 Influenza related structures
Jan. 2008
Jul. 2008
Jan. 2009
Jul. 2009
Jan. 2010
Jul. 2010
3B7E: Neuraminidase of A/Brevig Mission/1/1918
H1N1 strain in complex with zanamivir
1RUZ: 1918 H1 Hemagglutinin
1/28/14
5
* http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm
SIB Biel/Bienne
[Andreas Prlic]
6. We Need to Learn from Industries
Whose Livelihood Addresses the
Question of Use
1/28/14
SIB Biel/Bienne
6
7. Next Consider What We Do Every Day
We take actions on digital data
increasingly across boundaries
1/28/14
SIB Biel/Bienne
7
8. Actions on Data Implies:
•
•
•
•
•
•
•
•
•
Insuring data quality and hence trust
Making data sustainable
Making data open and accessible
Making data findable
Providing suitable metadata and annotation
Making data queryable
Making data analyzable
Presenting data as to maximize its value
Rewarding good data practices
1/28/14
SIB Biel/Bienne
8
9. Actions on Data Implies:
•
•
•
•
•
•
•
•
•
Insuring data quality and hence trust
Making data sustainable
Making data open and accessible
Making data findable
Providing suitable metadata and annotation
Making data queryable
Making data analyzable
Presenting data as to maximize its value
Rewarding good data practices
1/28/14
SIB Biel/Bienne
9
10. Boundaries on Data Implies:
• Working across biological scales
• Working across biomedical disciplines
• Working across basic and clinical research and
practice
• Working across institutional boundaries
• Working across public and private sectors
• Working across national and international
borders
• Working across funding agencies
1/28/14
SIB Biel/Bienne
10
11. Boundaries on Data Implies:
• Working across biological scales
• Working across biomedical disciplines
• Working across basic and clinical research and
practice
• Working across institutional boundaries
• Working across public and private sectors
• Working across national and international
borders
• Working across funding agencies
1/28/14
SIB Biel/Bienne
11
12. These Issues Have Been Around
Almost As Long As Bioinformatics
The Good News is That “Big Data” Has
Bought More Attention to the Problem
1/28/14
SIB Biel/Bienne
12
13. What Are Big Data?
• Large datasets from high throughput
experiments
• Large numbers of small datasets
• Data which are “ill-formed”
• The why (causality) is replaced by the what
• A signal that a fundamental change is taking
place – a tipping point?
1/28/14
SIB Biel/Bienne
13
14. That Change is Embodied in:
The Digital Enterprise
• Consists of digital assets
• E.g. datasets, papers, software, lab notes
• Each asset is uniquely identified and has
provenance, including access control
• E.g. publishing simply involves changing the
access control
• Digital assets are interoperable across the
enterprise
1/28/14
SIB Biel/Bienne
14
15. The Enterprise Is Almost Anything..
Your Lab, your Institution, the SIB,
the NIH….
1/28/14
SIB Biel/Bienne
15
16. Consider an Academic Institution As A
Digital Enterprise
•
Jane scores extremely well in parts of her graduate on-line neurology class. Neurology professors,
whose research profiles are on-line and well described, are automatically notified of Jane’s
potential based on a computer analysis of her scores against the background interests of the
neuroscience professors. Consequently, professor Smith interviews Jane and offers her a research
rotation. During the rotation she enters details of her experiments related to understanding a
widespread neurodegenerative disease in an on-line laboratory notebook kept in a shared on-line
research space – an institutional resource where stakeholders provide metadata, including access
rights and provenance beyond that available in a commercial offering. According to Jane’s
preferences, the underlying computer system may automatically bring to Jane’s attention Jack, a
graduate student in the chemistry department whose notebook reveals he is working on using
bacteria for purposes of toxic waste cleanup. Why the connection? They reference the same gene a
number of times in their notes, which is of interest to two very different disciplines – neurology and
environmental sciences. In the analog academic health center they would never have discovered
each other, but thanks to the Digital Enterprise, pooled knowledge can lead to a distinct advantage.
The collaboration results in the discovery of a homologous human gene product as a putative target
in treating the neurodegenerative disorder. A new chemical entity is developed and patented.
Accordingly, by automatically matching details of the innovation with biotech companies worldwide
that might have potential interest, a licensee is found. The licensee hires Jack to continue working
on the project. Jane joins Joe’s laboratory, and he hires another student using the revenue from the
license. The research continues and leads to a federal grant award. The students are employed,
further research is supported and in time societal benefit arises from the technology.
From What Big Data Means to Me JAMIA 2014
1/28/14
SIB Biel/Bienne
16
17. The NIH is Starting to Think About the
Digital Enterprise, Witness…
bd2k.nih.gov
1/28/14
SIB Biel/Bienne
17
18. What Defines the Digital Enterprise
•
•
•
•
•
•
•
Trans-NIH collaboration – change culture
Long-term NIH strategic planning
The BD2K Initiative
A “hub” of data science activities
International cooperation
Interagency cooperation
Data sharing policies
1/28/14
SIB Biel/Bienne
18
19. Consider One NIH Scenario
• NIH-Drive
– Investigator A from the NCI makes frequent
reference to the over expression of genes x and y.
– Investigator B from the NHLBI makes frequent
reference to the under expression of genes x and y
– Automatic notification of a potential common
interest before publication or database deposition
1/28/14
SIB Biel/Bienne
19
20. The NIH Process
An external advisory group provided a
valuable blueprint for what should be
done
http://acd.od.nih.gov/Data%20and%20Informatics%20Working%20Group%20Report.pdf
1/28/14
SIB Biel/Bienne
20
21. Blueprint Recommendations
• Promote central and federated catalogs
– Establish minimal metadata framework
– Tools to facilitate data sharing
– Elaborate on existing data sharing policies
• Support methods and applications
– Fund all phases of software development
– Leverage lessons from National Centers
• Training
– More funding
– Enhance review of training apps
– Quantitative component to all awards
• On campus IT strategic plan
– Catalog of existing tools
– Informatics laboratory
– Ditto big data
• Sustainable funding commitment
1/28/14
SIB Biel/Bienne
acd.od.nih.gov/diwg.htm
21
22. General Features of NIH Data Science
• Lightweight metadata standards
• Data & software registries
• Expanded policies on data sharing, open
source software
• Training programs & reward systems
• Institutional incentives
• Private sector incentives
• Data centers serving community needs
1/28/14
SIB Biel/Bienne
22
23. What is Under Way?
• Now:
–
–
–
–
–
Data centers (under review)
Data science training grants (call Q1 14)
Pilot data catalog consortium (call out)
Genomic Research Data Alliance (being finalized)
Piloting “NIH-drive
• What Is Planned:
– Extended public-private programs specifically for data science
activities
– Interagency activities
– International exchange programs
– Cold Spring Harbor-like training facilities – by-coastal?
– Programs for better data descriptions
– Reward institutions/communities
– Policies to get clinical trial data into the public domain
1/28/14
SIB Biel/Bienne
23
24. The History of Bioinformatics
According to PEB
The Roots in Bioinformatics Series PLOS Comp Biol
1980s
1990s
2000s
2010s
2020
Discipline:
Unknown Expt. Driven Emergent Over-sold A Service
A Partner
Driver
The Raw Material:
Non-existent
Limited /Poor
More/Ontologies
Big Data/Siloed Open/Integrated
The People:
No name
1/28/14
Technicians
Industry recognition data scientists
SIB Biel/Bienne
Academics
24
25. Why Will Science Become More Open?
• The public (and hence the politicians demand
it)
• Its the right thing to do
• Its part of the modern psyche
• The scholarly enterprise is broken and more
stakeholders are acknowledging it
1/28/14
SIB Biel/Bienne
25
26. Personal Evidence
• I have a paper with 16,000 citations that no
one has ever read
• I have papers in PLOS ONE that have more
citations than ones in PNAS
• I have data sets I am proud of but no place to
put them
• I “cant” reproduce work from my own lab
1/28/14
SIB Biel/Bienne
26
27. Politicians Demand It:
G8 open data charter
1/28/14
SIB Biel/Bienne
http://opensource.com/government/13/7/open-data-charter-g8 27
28. What Are Some of the Ramifications of
Open Science?
1/28/14
SIB Biel/Bienne
28
29. Open Science Has The Potential to
Deinstitutionalize
Daniel Hulshizer/Associated Press
1/28/14
SIB Biel/Bienne
29
30. Open Science Has The Potential to
Deinstitutionalize
Daniel Hulshizer/Associated Press
1/28/14
SIB Biel/Bienne
30
31. An Example of That Potential:
The Story of Meredith
http://fora.tv/2012/04/20/Congress_Unplugged_Phil_Bourne
1/28/14
SIB Biel/Bienne
31
32. Open Science Has The Potential to
Deinstitutionalize
Daniel Hulshizer/Associated Press
1/28/14
SIB Biel/Bienne
32
33. Open Science Has The Potential to
Deinstitutionalize
Daniel Hulshizer/Associated Press
1/28/14
SIB Biel/Bienne
33
34. There Still Needs to be a Reward System
The Wikipedia Experiment – Topic Pages
Identify areas of Wikipedia that
relate to the journal that are
missing of stubs
Develop a Wikipedia page in the
sandbox
Have a Topic Page Editor Review
the page
Publish the copy of record with
associated rewards
Release the living version into
Wikipedia
1/28/14
SIB Biel/Bienne
34
35. One Possible End Product of Open
Science
0. Full text of PLoS papers stored
in a database
4. The composite view has
links to pertinent blocks
of literature text and back to the PDB
4.
1.
1. A link brings up figures
from the paper
2.
1/28/14
3. A composite view of
journal and database
content results
3.
2. Clicking the paper figure retrieves
data from the PDB which is
analyzedSIB Biel/Bienne
1. User clicks on thumbnail
2. Metadata and a
webservices call provide
a renderable image that
can be annotated
3. Selecting a features
provides a
database/literature
mashup
4. That leads to new
papers
PLoS Comp. Biol. 2005 1(3) e34
35
36. Change in the Way we Support the
Research Lifecycle
Authoring
Tools
Data
Capture
Lab
Notebooks
Software
Repositories
Analysis
Tools
Scholarly
Communication
Visualization
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Commercial &
Public Tools
DisciplineBased Metadata
Standards
Community Portals
Git-like
Resources
By Discipline
Data Journals
New Reward
Systems
Training
Institutional Repositories
1/28/14
SIB Biel/Bienne
Commercial Repositories
36
37. Change in the Way we Support the
Research Lifecycle
Authoring
Tools
Data
Capture
Lab
Notebooks
Software
Repositories
Analysis
Tools
Scholarly
Communication
Visualization
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Commercial &
Public Tools
DisciplineBased Metadata
Standards
Community Portals
Git-like
Resources
By Discipline
Data Journals
New Reward
Systems
Training
Institutional Repositories
1/28/14
SIB Biel/Bienne
Commercial Repositories
37
38. automate: workflows, pipeline &
service integrative frameworks
CS
SE
pool, share & collaborate web
systems
scientific software
engineering
semantics & ontologies
machine readable documentation
nanopub
1/28/14
[Carole Goble]
SIB Biel/Bienne
38
39. Why is This Important to Me
Personally?
• My wife is being treated for stage 1 breast
cancer
• This highlights for me the disparity
between what is happening in the lab and
what is happening in the clinic
– In the lab cancer is a personalized and
treatable condition
– In the clinic we are still equally “poisoning”
patients with drugs first introduced 10-20
years ago
1/28/14
SIB Biel/Bienne
39
42. Most Laboratories
• We are the long tail
• Goodbye to the student is
goodbye to the data
• Very few of us have
complied (or will comply
with the data
management plans we
write into grants)
• Too much software is
unusable
S.Veretnik, J.L.Fink, and P.E. Bourne 2008 Computational Biology Resources Lack
Persistence and Usability. PLoS Comp. Biol. . 4(7): e1000136
1/28/14
SIB Biel/Bienne
42
43. Today’s Research Lifecycle is Digitally
Fragmented at Best
• Proof:
– I cant immediately reproduce the research in
my own laboratory
• It took an estimated 280 hours for an average user
to approximately reproduce the paper
– Workflows are maturing and becoming helpful
– Data and software versions and accessibility
prevent exact reproducability
Daniel Garijo et al. 2013 Quantifying Reproducibility in Computational Biology:
The Case of the Tuberculosis Drugome PLOS ONE 8(11) e80278 .
1/28/14
SIB Biel/Bienne
43
44. We Have Some Really Big Problems to
Solve – The Commons Can Help
1/28/14
SIB Biel/Bienne
44
45. What Really Happens When You Take a
Drug?
• Can we predict drug efficacy and toxicity?
• Can we reuse old drugs?
• Can we design personalized medicines?
1/28/14
SIB Biel/Bienne
45
46. One Drug, One Gene, One Disease
Bernard M. Nat Rev Drug Disc 8(2009), 959-968
1/28/14
SIB Biel/Bienne
46
47. Polypharmacology
• Tykerb – Breast cancer
• Gleevac – Leukemia, GI
cancers
• Nexavar – Kidney and liver
cancer
• Staurosporine – natural product
– alkaloid – uses many e.g.,
antifungal antihypertensive
Collins and Workman 2006 Nature Chemical Biology 2 689-700
1/28/14
SIB Biel/Bienne
47
48. Polypharmacology is Not Rare but Common
• Single gene knockouts only
affect phenotype in 10-20% of
cases
A.L. Hopkins Nat. Chem. Biol. 2008 4:682-690
• 35% of biologically active
compounds bind to two or
more targets that do not have
similar sequences or global
shapes
Paolini et al. Nat. Biotechnol. 2006 24:805–815
Predict side effects
Repurpose drugs
Kaiser et al. Nature 462 (2009) 175-81
1/28/14
SIB Biel/Bienne
48
49. Drug Binding is Dynamic
• Drug effect dependents on
not only how strong (binding
affinity) but also how long the
drug is “stuck” in the protein
(residence time).
• Molecular Dynamics (MD)
simulation is powerful but
computationally intensive.
~ns
1 day simulation
~ms – hours
>106 days
D. Huang et al. (2011), PLoS Comp Biol 7(2):e1002002
1/28/14
SIB Biel/Bienne
49
51. Multiscale Modeling of Drug
Actions
Understanding of
dynamics and
kinetics of proteinligand interactions
Traditional
Approach
Knowledge
representation
and discovery &
model integration
1/28/14
Slide from Lei Xie
Prediction of molecular
interaction network on
a genome scale
physiological process
Systems-based
Approach
SIB Biel/Bienne
Reconstruction,
analysis and
simulation of
biological networks
51
52. More Generally Any Translationalbased Research That Involves
Modeling at Multiple Scales
http://sagebase.org/
1/28/14
SIB Biel/Bienne
52
53. The History of Bioinformatics
According to Bourne
The Roots in Bioinformatics Series PLOS Comp Biol
1980s
1990s
2000s
2010s
2020
Discipline:
Unknown Expt. Driven Emergent Over-sold A Service
A Partner
A Driver
The Raw Material:
Non-existent
Limited /Poor
More/Ontologies
Big Data/Siloed Open/Integrated
The People:
No name
1/28/14
Technicians
Industry recognition data scientists
SIB Biel/Bienne
Academics
53
54. In Summary:
By the End of the Decade Biomedical
Research will Be a Truly Digital
Enterprise and Computational
Scientists Will Be At the Forefront
You Have Much to Look Forward Too
1/28/14
SIB Biel/Bienne
54