1. Human Genome and Big Data Challenges
Philip E. Bourne Ph.D.
Associate Director for Data Science
National Institutes of Health
http://www.slideshare.net/pebourne
2. The History of Analytics in Biomedical
Research
1980s 1990s 2000s 2010s 2020
Discipline:
Unknown Expt. Driven Emergent Over-sold A Service A Partner A Driver
The Raw Material:
Non-existent Limited /Poor More/Ontologies Big Data/Siloed Open/Integrated
The People:
No name Technicians Industry recognition data scientists Academics
Searls (ed) The Roots in Bioinformatics Series PLOS Comp Biol
3. Data Science Timeline
6/12
• U54 Centers of Excellence - under review
• U54 BD2K-LINCS– under review
• U24 Data Discovery Index– under review
• R01, R41, R42, R43, R44, U01 software and
analysis methods grants – on-going
• T32, T15, K01, R25 and R26 training awards
– under review
2/14 3/14
4. “It was the best of times, it was the
worst of times, it was the age of
wisdom, it was the age of foolishness,
it was the epoch of belief, it was the
epoch of incredulity, it was the season
of Light, it was the season of
Darkness, it was the spring of hope, it
was the winter of despair …”
5. A Tale of Two Numbers
Source Michael Bell http://homepages.cs.ncl.ac.uk/m.j.bell1/blog/?p=830
7. Growth May Just be Beginning
Evidence:
– Google car
– 3D printers
– Waze
– Robotics
From: The Second Machine Age: Work, Progress,
and Prosperity in a Time of Brilliant Technologies
by Erik Brynjolfsson & Andrew McAfee
8. There are other drivers of change out
there besides economics and an
increasing emphasis on data and
analytics
9. Politicians Demand It:
G8 open data charter
9http://opensource.com/government/13/7/open-data-charter-g8
15. Scholarship is broken
I have a paper with 16,000 citations that no one has
ever read
I have papers in PLOS ONE that have more citations
than ones in PNAS
I have data sets I am proud of few places to put
them
I edited a journal but it did not count for much
16. It was the age when software
developers are in the greatest demand
for science..
It was the age when the rewards
outside academia are greater than the
rewards inside
17. It was a time when patient data are
becoming more available
It is a time when the ability to maintain
the anonymity of a patient gets harder
and harder
18. To Summarize Thus Far …
A time of great (unprecedented?)
scientific development but limited
funding
A time of upheaval in the way we do
science
19. From a funders perspective…
A time to squeeze every cent/penny to
maximize the amount of research that
can be done
A time when top down approaches
meet bottom up approaches
20. Top Down vs Bottom Up
Top Down
– Regulations e.g. US:
Common Rule, FISMA,
HIPPA
– Data sharing policies
• OSTP
• GWAS
• Genome data
• Clinical trials
– Digital enablement
– Moves towards
reproducibility
Bottom Up
– Communities emerge
and crowd source
• Collaboration
• Data shared
• Open source
software
• Common principles
• Standards
22. To start with we are thinking about the
complete research lifecycle
23. The Research Life Cycle
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
24. Tools and Resources Will Continue To
Be Developed
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Authoring
Tools
Lab
Notebooks
Data
Capture
Software
Analysis
Tools
Visualization
Scholarly
Communication
25. Those Elements of the Research Life Cycle Need to
Become More Interconnected Around a Common
Framework
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Authoring
Tools
Lab
Notebooks
Data
Capture
Software
Analysis
Tools
Visualization
Scholarly
Communication
26. Those Elements of the Research Life Cycle
Need to Become More Interconnected Around a
Common Framework
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Authoring
Tools
Lab
Notebooks
Data
Capture
Software
Analysis
Tools
Visualization
Scholarly
Communication
Commercial &
Public Tools
Git-like
Resources
By Discipline
Data Journals
Discipline-
Based Metadata
Standards
Community Portals
Institutional Repositories
New Reward
Systems
Commercial Repositories
Training
27. Those Elements of the Research Life Cycle Need to
Become More Interconnected Around a Common
Framework
IDEAS – HYPOTHESES – EXPERIMENTS – DATA - ANALYSIS - COMPREHENSION - DISSEMINATION
Authoring
Tools
Lab
Notebooks
Data
Capture
Software
Analysis
Tools
Visualization
Scholarly
Communication
Commercial &
Public Tools
Git-like
Resources
By Discipline
Data Journals
Discipline-
Based Metadata
Standards
Community Portals
Institutional Repositories
New Reward
Systems
Commercial Repositories
Training
28. Associate Director for Data Science
Commons
Training
Center
BD2K
Modified
Review
Sustainability* Education* Innovation* Process
• Cloud – Data &
Compute
• Search
• Security
• Reproducibility
Standards
• App Store
• Coordinate
• Hands-on
• Syllabus
• MOOCs
• Community
• Centers
• Training Grants
• Catalogs
• Standards
• Analysis
• Data
Resource
Support
• Metrics
• Best
Practices
• Evaluation
• Portfolio
Analysis
The Biomedical Research Digital Enterprise
Communication
Collaboration
rogrammatic Theme
Deliverable
Example Features • IC’s
• Researchers
• Federal
Agencies
• International
Partners
• Computer
Scientists
Scientific Data Council External Advisory Board
* Hires made
29. I. Facilitating Broad Use of Biomedical
Big Data
II. Developing and Disseminating
Analysis Methods and Software for
Biomedical Big Data
III. Enhancing Training for Biomedical
Big Data
IV. Establishing Centers of Excellence
for Biomedical Big Data
BD2K: Four Programmatic Areas
30. What are we proposing as that
common framework?
31. The Commons
Is …
A public/private partnership
An agile development starting with the evaluation of a
few pilots
An example: porting DbGAP to the cloud
An experiment with new funding strategies
32. What The Commons Is and Is Not
Is Not:
– A database
– Confined to one physical
location
– A new large
infrastructure
– Owned by any one group
Is:
– A conceptual framework
– Analogous to the Internet
– A collaboratory
– A few shared rules
• All research objects
have unique
identifiers
• All research objects
have limited
provenance
33. Sustainability and Sharing: The Commons
Data
The Long Tail
Core Facilities/HS Centers
Clinical /Patient
The Why:
Data Sharing Plans
The
Commons
Government
The How:
Data
Discovery
Index
Sustainable
Storage
Quality
Scientific
Discovery
Usability
Security/
Privacy
Commons == Extramural NCBI == Research Object Sandbox == Collaborative Environment
The End Game:
KnowledgeNIH
Awardees
Private
Sector
Metrics/
Standards
Rest of
Academia
Software Standards
Index
BD2K
Centers
Cloud, Research Objects,
34. What Does the Commons Enable?
Dropbox like storage
The opportunity to apply quality metrics
Bring compute to the data
A place to collaborate
A place to discover
http://100plus.com/wp-content/uploads/Data-Commons-3-
1024x825.png
35. [Adapted from George Komatsoulis]
One Possible Commons Business Model
HPC, Institution …
36. Commons Pilots
Define a set of use cases emphasizing:
– Openness of the system
– Support for basic statistical analysis
– Embedding of existing applications
– API support into existing resources
Evaluate against the use cases
Review results & business model with NIH leadership
Design a pilot phase with various groups
Conduct pilot for 6-12 months
Evaluate outcomes and determine whether a wider
deployment makes sense
Report to NIH leadership summer 2015
37. One Possible End Product
1. User clicks on thumbnail
2. Metadata and a
webservices call provide
a renderable image that
can be annotated
3. Selecting a features
provides a
database/literature
mashup
4. That leads to new
papers
1. A link brings up figures
from the paper
0. Full text of PLoS papers stored
in a database
2. Clicking the paper figure retrieves
data from the PDB which is
analyzed
3. A composite view of
journal and database
content results
4. The composite view has
links to pertinent blocks
of literature text and back to the PDB
1.
2.
3.
4.
PLoS Comp. Biol. 2005 1(3) e34
38. Mission Statement
To foster an ecosystem that enables
biomedical research to be conducted
as a digital enterprise that enhances
health, lengthens life and reduces
illness and disability
39. Some Acknowledgements
Eric Green & Mark Guyer (NHGRI)
Jennie Larkin (NHLBI)
Leigh Finnegan (NHGRI)
Vivien Bonazzi (NHGRI)
Michelle Dunn (NCI)
Mike Huerta (NLM)
David Lipman (NLM)
Jim Ostell (NLM)
Andrea Norris (CIT)
Peter Lyster (NIGMS)
All the over 100 folks on the BD2K team
Hinweis der Redaktion
1 hr
Within biomedical research, many data types
Victims of our own success
Data production outstrips data handling and analysis
Major long-term changes are needed
Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8): e124. doi:10.1371/journal.pmed.0020124
http://www.reuters.com/article/2012/03/28/us-science-cancer-idUSBRE82R12P20120328
Federal Information Security Management Act of 2002
The Health Insurance Portability and Accountability Act of 1996