1. Big Data in Practice
Creation of an Agile and Scalable Data Science Platform
to Increase Information Find-ability and Accessibility in
Research and Development
John Koch
Merck & Co.
3. 3
R&D decisions rely on high quality information to steer
programs and the pipeline
Knowledge Assets
“Target validation plan”
Business Groups
“Early Development team”
People
“John Smith”
Information Types
“Clinical Trial Name”
Organization Units
“Analytical Chemistry”
Sources
“Electronic Lab Notebook”
Business Processes
“Integrative assessment of liver
toxicity”
Activities
“Refine model”
Roles
“Statistician”
Decisions/ Gateways
“Determine Patient
Stratification Biomarkers”
R&D Information Landscape
>27,000 entities and 70,000 relationships defined
The volume and sophistication of internal information and that available through external
sources continues to grow at a rapid and accelerating rate
The ability to readily find, access, and use information is absolutely critical
Capabilities
“Biomarker
Validation”
Feedback
Surveys, VoC
6. 6
5
Today Next 2-3 Years Beyond
Culture of Single Use
“Find & Access”
DecisionMaking
Quality
Vocabulary
Management
Embedded
Stewardship
Information
Flows Modeled
Effective
Search
Integrated
Information
Architecture
IM Challenges
Characterized
Fragmented
tools,
processes
Systematic
categorization
of data
Information
ManagementMaturity
As knowledge workers understand and embrace improved information management
practices, better decision making can be enabled by better access to information
Organization-Wide Information Re-Use
? Better Information Management Better Decision Making: Better
analysis, more transparency and collaboration, better workflow
management, faster decisions
DecisionQualityAdoption,Maturity
Improving R&D Decision Making
Information
Flows Modeled
7. 7
5
Engaging the business: Focus Area Key Questions
User Interface Engine Content Creators
Creators
ContentEngineQuery Results
Interface
What information is required to make those decisions? Who needs that information? How do they use that
information used to make those decisions?2
What are the critical business processes? What major decisions are associated with those processes?1
How is that information created? Who creates it? Where is that information stored?3
How is that information accessed (searched for, found, displayed)?4
What challenges are associated with accessing and using that information?5
How can access to and use of that information be improved? What value will those improvements deliver to
the business?
6
Users
Morville & Callendar. 2010 Search Patterns
8. 8
Information Management CapabilitiesArchitectureSearchAccess
IM Capabilities Description
Search tools that enable users to locate scientific information across various sources,
both structured and unstructured, in various formats and across functional groups
Capability for users to identify colleagues with specific skills, expertise, or tacit
knowledge through a search tool and / or standardized profiles or tagging
System of access policies that prudently permits access to information and has clear
procedures for granting or restricting access
Shared practices for creating, storing, sharing, and maintaining explicit and tacit
information
Organization of critical data sources to make them more conducive to search,
retrieval, analysis and re-use through techniques including tagging and indexing
Well-maintained record of critical information and data sources across the
organization, including how the information is used or linked to other sources
Improving Information Management requires capabilities to enhance
information search, access, and architecture
Expertise Location
Access
Data Stewardship
Data Structuring
Key Data Assets
Scientific Search
9. 9
ILLUSTRATIVE
Leaders in Search & Information Management
Indexing of complex
hierarchical relationships
from relational database
tables
Multi-faceted, interactive
filtering of search results
based on document
metadata
Implementing solutions for
searching non-text
information (e.g., enterprise
video search)
Advanced search analytics
Integration with social media
Highly scalable / extensible
Service-Oriented
Architecture
Seamless information flow
between departments / sites
Includes a data services and
exchange layer
Reusable and configurable
code modules
Closed-loop data flow via
integrated data sources
across the product life
cycle
Consistent, personalized,
real-time access for internal
and external users
Enterprise-wide technology
to capture, create, and
share knowledge / best
practices
Data stewardship standards
and processes that ensure
consistency of data
quality, storage, and
exchange
BioPharma and other industry players have demonstrated innovative, peer-leading Search,
Access, and Architecture capabilities
Capability Maturity Stages
Basic
Developing
Functional
Advanced
World-class
1
2
3
4
5
Open
Access
Data
Stewardship
Data
Structuring
Key Data
Assets
Scientific
Search
Expertise
Location
ArchitectureSearch Access
10. 10
5
Today Next 2-3 Years Beyond
Culture of Single Use
“Find & Access”
DecisionMaking
Quality
Vocabulary
Management
Embedded
Stewardship
Information
Flows Modeled
Effective
Search
Integrated
Information
Architecture
IM Challenges
Characterized
Fragmented
tools,
processes
Systematic
categorization
of data
Information
ManagementMaturity
As knowledge workers understand and embrace improved information management
practices, better decision making can be enabled by better access to information
Organization-Wide Information Re-Use
? Better Information Management Better Decision Making: Better
analysis, more transparency and collaboration, better workflow
management, faster decisions
DecisionQualityAdoption,Maturity
Improving R&D Decision Making
Information
Flows Modeled
13. 13
Merck
Analysts need a way to collaborate on mapping information flows
from different domains without explicit coordination
http://www.dwalls.com/Nature/Nature-World-Travel/Aerial+View+of+Downtown+Boston
14. 14
Is a method of documenting and modeling the flow of information through an enterprise
that allows both targeted and holistic analysis across the information continuum.
Sales &
Marketing
MCC
•Regulatory
R&D
Manufacturing
Merck
Semantic Information Flow Modeling (sIFM)…
Regulatory
16. 16
Collaboration without Coordination
The use of an information modeling ontology allows multiple informatics and
business analysts to collaborate on the same model without explicit coordination
Analyst 1 Analyst 2 Analyst 3
Compound structure ELN Medicinal Chemist uses ChemCart Pharm Sci uses ELN
Program Biologist uses ELN Compound Structure ChemCart Active Pharmaceutical Ingredient ELN
Toxicologist uses ELN Medicinal Chemist member-of Lead
optimization team
Compound Structure ELN
17. 17
Leveraging the Information Flow Modeling to enable
Search and Analytics
By encoding this knowledge in a searchable semantic knowledgebase, we can discover details
about Merck’s information landscape on the fly, that were previously difficult to uncover.
Project Information
Types
Data
Sources
KM
Artifacts
Translational
PK/PD Modeling
?Information
Types
?KM
Artifacts
?Data
Sources
includes
flow
flow
What are the types
of information and
data sources
associated with
Translational PKPD
Modeling?
18. 18
Within Life Sciences, there has been a lot of discussion about
the potential of Big Data
Personalized Medicine
Genomics
Evidence Based Medicine
Health Outcomes
Customer Insights
Patient Enrollment
Supply Chain Management
Predictive Modeling
Clinical Trial Monitoring
New Drug Discovery
Collaboration
Connected Health
Volume
Variety
Velocity
Revenue
Cost
19. 19
Not a volume or velocity problem… yet, it’s about varietyOutputsInputs
Target ID/Val Lead Opt PreclinicalLead ID Phase 2
Reg /
Market
Phase 3Phase 1
Early Development Late Dev MarketDiscovery
Innovative research & breakthrough therapeutics
“culture of single use”
Internal External
20. 20
• NoSQL can handle significant increases of data
• Many best of breed technologies are open source
products
• Strong compatibility with cloud infrastructures
allows for rapid scale up/scale down
Big Data tools are well suited to address this classic data
integration problem
Scalable
• Not Only SQL (NoSQL) enables agile access to
data with lightweight, use-case driven models.
• Structured and unstructured data can be readily
integrated
• Design and implementation are not dependent on
the up-front, comprehensive knowledge of data
Agile
21. 21
Merck is using these tools and design patterns to create a
Data Science platform for research and development
R&D Information Landscape Big data platform
22. 22
New design patterns allow rapid data integration into a
scalable platform
Current Design Patterns
New Design Patterns
1-2 months
Data Types
Data Structure
Use Cases
2-3 months
Data Model
Data Mapping
4-6 months
Data Migration
Data Validation
Data Cleanup
Design
Development
Requirements
Requirements
Design
Development
1 year
Increased cost
Limited flexibility
4-5 months
Agile
Iterative
23. 23
Enabling scientists by shifting the paradigm with Big Data tools
and design patterns
Time spent finding
75%
Time spent
analyzing
25%
Time spent finding
25%
Time spent
analyzing
75%
Analysis Analysis Analysis
Analysis
Analysis
Manual Assembly of Data Integrated, Cross Data Set Analysis
Insights
24. 24
We began with a current problem facing Merck scientists…
… but kept the larger goal of changing the information management and
access paradigm in mind.
-like
Search Application
Today
Tomorrow
53% unable to find and access
information they need
Can
Find &
Access
47%
Cannot
Find or
Cannot
Access
53%
… a platform capable of handling new and additional data
… and capable of providing more complex analytics
…the future$131M productivity gap
Volume Velocity
25. 25
First “analytic” built on the data science platform = Search
• 2014: focus on 3000 discovery and preclinical
scientists
• 100’s of use cases prioritized
• 30+ features developed
• Ingested 8 of top 10 content sources
• Agile design and development pattern
82% of feature-by-feature
feedback has been positive
Electronic Lab Notebooks
Documents
Compound Information
Chemical Registration
27. 27
Leveraging the Information Flow Modeling to enable
Search and Analytics
By encoding this knowledge in a searchable semantic knowledgebase, we can discover details
about Merck’s information landscape on the fly, that were previously difficult to uncover.
Project Information
Types
Data
Sources
KM
Artifacts
Translational
PK/PD Modeling
?Information
Types
?KM
Artifacts
?Data
Sources
includes
flow
flow
What are the types
of information and
data sources
associated with
Translational PKPD
Modeling?
28. 28
Building on data and platform, rapidly built next analytic -
QUICK (Quantitative Pharmacology Knowledgebase)
Data capture
QUICK = a single, authoritative portal for access to definitive, integrated data sets of
clinical and pre-clinical metabolism and in vivo pharmacology experimental results
Reduce time to collate definitive
data sets by ~95%
Stewardship AnalyticsCommon platform
50-75% increase in efficiency
of analysis
Improved cross departmental
collaboration through stewardship
+ + +
29. 29
Future = include more data and more analytics within Merck
Research Labs
… in addition, other business units are already leveraging this data
sciences platform… and we are poised for volume and velocity
30. The key to our current and future success has been the
development of a flexible & scalable technology platform…
Identify a
Problem
Statement
•User Story
Estimate
Impact &
Benefits
•Prioritize
Develop
New
Capability
•Use Case(s)
•Feature(s)
Verify
Impact &
Benefits
•Feedback
•Refine
Extend
target user
group
•Referrals
Problem Statement,
Use case, User Story,
Question, Pain Point…
No
User feedback helps prioritize features
2 weeks
31. 31
…combined with integrated, multi-disciplinary teams
People: engaged scientists, integrated agile team
Process: maturing process for data source access, ingestion and
feature development
Technology: extensible & flexible platform
Software
Engineers
Math &
Stats
Analysts
Domain
Experts
32. 32
Acknowledgments
Merck IT - Scientific Information Architecture & Search
Merck IT – Informatics, Scientific Computing, Cloud COE
Merck Research Laboratories
Booz Allen Hamilton