RWE & Patient Analytics Leveraging Databricks - An Use Case
Harini Gopalakrishnan & Martin Longpre from Sanofi present on leveraging real world data and evidence generation using Databricks. They discuss defining real world data and evidence, using advanced analytics for indication searching, and implementing a conceptual architecture in Databricks for privacy-preserved analysis. Their system offers secure data management, self-service analytics tools, and controls access and auditing. Databricks is customized for their needs with cluster policies, Gitlab integration, and IAM roles. They demonstrate their workflow and discuss future improvements to further enhance insights from real world data.
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
RWE & Patient Analytics Leveraging Databricks – A Use Case
1. RWE & Patient Analytics
Leveraging Databricks
An Use Case
Harini Gopalakrishnan & Martin Longpre
Sanofi
2. Disclaimer
• The views and opinions expressed in this presentation are that of
the individual presenter and should not be attributed to any
organization with whom the presenter is employed or affiliated
• All registered trademarks cited are property of their respective
owners.
3. Agenda
Harini Gopalakrishnan -20 minutes
▪ What is Real world evidence and Real world data
▪ Advanced analytics in RWE generation
▪ Security and privacy of our Data
▪ Our journey – an conceptual view of the architecture
and what we have achieved
Martin Longpre – 20 minutes
▪ Databricks implementation- our customization
▪ Demo
▪ Look forward: where we want to partner for
improvements
Q&A – 20 minutes
5. Context: How do we define RWE & RWD
Real World Data (RWD) is a term used to
describe health care related data that are
collected outside the context of
randomized clinical trials (RCTs),
Real world evidence (RWE) is defined as the
insight or knowledge derived from the analysis
of real world data, conducted to respond to a
specific research question
RWE leverages analytics on RWD to discover, develop, deliver and
provide new insights on healthcare interventions
Examples of Real-world data sources
~ 130 TB (EHR/Claims)
~2000 TB per month in versions, transformations
6. Analysis in RWE: Advanced analytics methodology
Traditional analytics
• Traditional RWE statistics, meta-analysis, data modelling, propensity-score matching
Advanced analytics
• Predictive modelling, unsupervised clustering, rule extraction, model bootstrapping,
natural language processing, machine learning
Machine learning: a computer
program is said to learn from
experience (partially captured
within data), when its performance
increases with experience
Supervised techniques example
• Logistic regression
• Markov chain
• Bayesian network
• K-nearest neighbour
Non-supervised techniques examples
• K-means clustering
• Hierarchical ascendant classification
• Factorial analyses
• Non-negative matrix factorization
Innovation in evidence generation
7. Uses of RWE – why is it valuable
https://www.healthcatalyst.com/insights/real-world-data-chief-driver-drug-development
The driving reasons for
leveraging them more
recently include:
• Ease of availability in
compute resources for big
data
• Availability of curated and
high quality data sources
both internally and
externally
Real world evidence influences all aspects of a pharma value chain
Regulatory Decision
making
Reimbursement decisions
Clinical Guidelines
2 3
1
8. Transforming RWD to Evidence: Use case in action
AI based indication searching approach that relies on Real-World Data thus bringing a higher confidence and reducing
biases
Data is always privacy preserved and de-identified
Sanofi: Novel Indications via AI —
Finding new treatment indications for an
approved therapy is of immense value to
pharma for drug re-purposing efforts,
R&D candidate prioritization, and overall
productivity. Sanofi wanted to develop an
AI based indication searching approach
that relies on real-world data thus
bringing a higher confidence and
reducing biases. Sanofi applied
unsupervised machine learning to create
a phenotypic cluster of patients in order
to identify relevant indications that
worked across clusters. The pipeline
crunched nearly 17 million patients with
2,700 characteristics derived from
electronic health records (EHRs) The
initial results of the novel approach
recovered 90% of known indications and
identified many more deemed credible by
development teams producing a higher
level of confidence in results and a
reduction in cost and time to market, with
fewer, faster and more targeted trials,
while minimizing attrition and risk.
https://www.gartner.com/en/newsroom/press-releases/2020-11-17-gartner-announces-winners-of-th
e-2020-gartner-healthcare-and-life-sciences-eye-on-innovation-award
9. Winner of the Gartner Award 2020 for Innovation in Health care and
Lifesciences
https://www.gartner.com/en/newsroom/press-releases/2020-11-17-gartner-announces-winners-of-th
e-2020-gartner-healthcare-and-life-sciences-eye-on-innovation-award
10. Trust of data and analysis being performed is a MUST
“ Patients and consumers have a
significant role to play in the
collection of real-world data and
generation of real-world evidence,
but to be effective, patient and
consumer engagement approaches
would include considering them
partners and capturing outcomes that
are important to them “
▪ Patient consent is a must
▪ Privacy preserved linkage must be
performed, encryption is a key
aspect
▪ Establish trusted Patient relationship
to explain the usage of data and
consent (e. g: secondary use of
primary data)
▪ Data should not be used beyond the
intended purpose- governance
around the usage is a must
12. Key aspects of a RWE Ecosystem
Data
Management
Secure data
storage – triple
encrypted with
audited access
control
Full data lineage –
complete history
of every data
transformation
Data pipeline –
designed for high
performance
handling of big
data
Analytics
Self-service tools
– filtering and
querying tools for
feasibility an
descriptive
information
Interactive tools –
dashboards and
applications for
study execution
Low-level tools –
R, Python and
SQL for
comparative
analysis and
advanced
analytics
Access
Control
Multi-tenant
configuration –
provide each
organization with
their own
namespace
User provisioning
– role-based
access controlled
by each
organization
Inherited data
permissions –
transformed data
retains access
control
Auditing
and
Monitoring
Full auditing of
user actions – log
each action and
generate reports
Comprehensive
monitoring –
performance,
usage, and
custom actions
13. Powerful computer resources to handle billions of rows of data
Complete history of all data updates, with ability to bind to
specific versions
Complete data traceability – every transform and resulting data
set is captured
Robust data security and access control for all data and projects
Ability to manage metadata, reference data and master data
Built on a scalable data lake
What does our system offer?
14. 14
Data is always privacy preserved and de-identified. We do not own the KEY for re-identification within this eco system
Disclaimer: For example purposes only
Clinical Bioinformatics
Internal Sources External Sources
Self Service Analysis Advanced Analytics
Data Augmentation
Visualization / Dashboards
Data lake (Sanofi AWS )
Artificial Intelligence/ML
Standardized analytical workflows
Cohort Definitions and Data Modelling
Conventional Studies
(NLP)
Secured and Traceable Sanofi controlled
environment
Data and Analysis Collaboration*
Societies and
Consortia
Academic
Institutions
Regulatory
Agencies
Internal sources
Insights
External Collaboration Other Internal Platforms
The Conceptual architecture
https://aws.amazon.com/blogs/industries/sanofi-webinar-performing-end-to-end-real-world-evidence-generation-with-traceability-and-transparency-on-aws/
Data lake
(Secured and Access controlled at the data level)
15. When do we use Databricks
▪ Exploratory use cases – projects where we need to run AI/ML workflow for use cases that require
GPU , custom libraries, NLP /sentiment analysis
▪ Cross functional team: working on a specific project – both internal and external stakeholders
▪ Flexibility: Ability for users to manage their own cluster profiles – size up and down based on
policy
▪ Data ingestion pipelines migrating away from AWS Glue and Batch for cost and performance
reasons- 30% improvement in costs & productivity
▪ Delta lake under analysis: today it is directly managed in parquet /S3
▪ SQL analytics: under evaluation
16. ▪ Usage of our Azure AD
configuration
▪ One AD groups per data
type
▪ Deactivation of the DBFS
file system for end users
(DBFS not align with our
data restriction polices)
▪ All data access are
predefined and available
through /mnt
▪ Integration of the
DB REPOS feature
connected directly
to our enterprise
Gitlab services
▪ Usage of CI/CD
pipelines for
deploying scripts
and tasks
Passthrough for Security
▪ Cluster names suffixed with
the policies names for audit
and monitoring
▪ Limit the type of worker
and driver for better budget
management
▪ Enforce the termination of
cluster with default values
based on projects/use cases
(manage by cluster policies)
Databricks Customization (1/2)
Gitlab integration Cluster Policies
17. ▪ Only used for specific use case mostly for
Rstudio
▪Fully integrated to our AWS stack
▪IAM roles setup for S3 bucket accesses
▪One home folder per users created by default
(internal process)
Instance Profiling IAM roles and policies
Databricks Customization (2/2)
19. Improvements
▪ Support for R studio
▪ Data access control and policy propagation to restrict
unauthorized use of data- no lineage on data
20. Summary- Our Journey and benefits
▪ Started from a traditional ware house 3
years ago to crate an end to end eco
system for evidence generation and insights
▪ Helped move away from conventional to
more advanced analytical approaches
leveraging the power of big data and cloud
▪ Delivered several evidence generating
studies, i.e studies at scale that have
impacted all aspects of pharma value chain
with demonstratable ROI
https://www.dovepress.com/cr_data/article_fulltext/s160000/160029/img/jmdh-160029_F003.jpg