SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study
1.
2. All In: Migrating a Genomics
Pipeline from BASH/Hive to Spark
and Azure Databricks—A Real
World Case Study
Victoria Morris
Unicorn Health Bridge Consulting working for Atrium Health
3. Agenda
Victoria Morris
▪ Overview Link
▪ Issues – why change?
▪ Next Moves
▪ Migration Starting Small Pharmacogenomics
Pipeline
▪ Clinical Trials Matching Pipeline
▪ The Great Migration Hive-> Databricks
▪ Things we Learned
▪ Business Impact
6. Original Problem Statement(s)
▪ Genomic reports are hard to find in the Electronic Medical Record (EMR)
▪ The reports are difficult to read (++ pages) are different from each lab, may not
have relevant recommendations and require manual efforts to summarize
▪ Presenting relevant Clinical Trails to providers when making treatment decisions
will increase Clinical Trial participation
▪ As a Center of Excellence(COE) for the American Society of Clinical Oncology
(ASCO)’s Targeted Agent and Profiling Utilization Registry (TAPUR) Clinical trial,
clinical outcomes and treatment data must be reported back to the COE for
patients enrolled in the studies
▪ Current process is complicated, time consuming and manual
7. Overview
▪ The objective of LInK (LCI Integrated Knowledgebase) is to provide
interoperability of data between different LCI data sources
▪ Specifically to address the multiple data silo’s, that contain related data, which is a
consistent challenge across the System
▪ Data meaning, must be transferred, not just values
▪ Apple: Fruit vs. Computer
▪ Originally we had 4 people, and we all had day jobs
8. Specialized External testing
Testing Results
PDF’s, results and
Raw Sequence data in
PDF, Clinical Decision Support Out
(External –sftp/data factory)
Clinical
Trails
Management
Software
(On-Premise-
soon to be
Cloud)
EMR
Clinical Data
(Cerner reporting
Database/EDW)
EAPathways embedded in
Cerner
via SMART/FHIR
Genomic results and
PDF reports
via Tier 1 SharePoint
for molecular tumor
board review
Converting Raw Reads to
Genotype-> Phenotype and
generating report for Provider
LCI
Encounter
Data
(EDW)
LInK
Unstructured Notes
(e.g. Cerner reporting
Database)
EAPathways
Database
(On-premise
DB)
Integration
Office
365
(External-
API)
POC
Clinical
Decisio
n
Support
Clinical
Trials
Matching
Pharmacogenomics
Specialized Internal testing
Testing Results and
Raw Sequence data in PDF
out
(internal)
9. Frd1Storage
Netezza Cloud
Azure
On-Premise Databases
EDW
EaPathways
Oncore
External Labs
Caris
Inivata
FMI
Azure Storage
• Cerner
• EPIC
• CRSTAR
On-Premise Lab
Genomics Lab
LInK Data connections – High Level
Clinical Trials
Management
Clinical Decision
Supprt
Enterprise Data
Warehouse
ARIA
Genomic Pipelines-
Auto-generate by WebApps
Radiation
Treatments
CoPath
Pathology
MS Web
Apps
MS
SharePoint
Designer
12. Issues
▪ We run 365 days a year
▪ The Data is used in real time by providers to make clinical decisions for
patient treatment for Cancer any breakdown in the pipeline is a
Priority 1 fix that needs to be fixed as soon as possible
▪ We were early adopters of HDI – this server has been up since 2016 – it
is old technology and HDI was not built for servers to live this long.
13. Issues cont’d
▪ Randomly the cluster would freeze and go into SAFE mode – with no
warning, this happened on a weekly basis often several days, in a row
during the overnight batch.
▪ We were past the default allocated 10,000 tez counters and had to
change the runs to constantly run with additional ones, back at
around 3,000 lines of Hive code.
▪ Although we tried using Matrix manipulation in hive– at some point you
just need a loop.
14. Issues cont’d
▪ The costs to have the HDI cluster up 24x365 was very expensive, we
scaled it up and down to help reduced costs.
▪ The cluster was not stable, because we were scaling up and scaling
down everyday, at one point there so many logs on the daily scaling it
took the entire HDI cluster down.
15. Issues cont’d
▪ Twice the cluster went down so bad and so hard MS Support’s
response was destroy it and start again, which we did the first time…
▪ The HDI server choice-dichotomy to HiveV2 had forced us into not
allowing vectorized execution– we had to constantly set
hive.vectorized.execution.enabled=false; through out the script
because it would “forget” and which was slowing down processing.
17. Search
▪ We wanted something that was cheaper
▪ We wanted to keep our old wasbi storage – not have to migrate the
datalake
▪ We wanted flexibility in language options for on-going operations and
continuity of care we did not want to get boxed into just one
▪ We wanted something less agnostic, more fully integrated into the
Microsoft eco-system
18. Search cont’d
▪ We needed it to be HIPAA compliant because we were working with
Patient data.
▪ We needed something that was self sufficient with the Cluster
management so we could concentrate on the programming aspect
instead of infrastructure.
▪ We really liked the Notebook concept – and had started experimenting
with Jupiter notebooks inside HDI
21. Migration – starting small
▪ There is a large steep learning curve to get into the databricks
▪ We had a new project the second pipeline that had to be built and it
seemed easier to start with something smaller than the 8000 lines of
Hive code that would be required if we started transitioning the
original pipeline.
30. Clinical Trial Match Criteria
Age (today’s) Gender
First line eligible(no
previous anti-
neoplastics
ordered)
Genomic Results
(over 1290 genes)
Diagnosis Tumor Site
Secondary Gene
results
Must have/not have
a specific protein
change/mutation
Previous Lab
results
Previous
Medications
39. Process Tempus
files
Process Caris
files Process FMI files
Process Inivata
files
Main Match
Create Summary
Preprocess each
lab into similar
data format
Create Clinical Matches
Create Genomic Summary,
combine with matches an
save to database
1
2
3
42. Reading the file
▪ Not a separate step in Hive part of the next step ▪ Bulleted list
▪ Bulleted list
DatabricksHive
43. Creating a clean view of the data
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
DatabricksHive
44.
45. Databricks by the numbers
▪ We work in a Premium Workspace, using our internal ip addresses
inside a secured subnet inside the Atrium Health Azure Subscription
▪ Databricks is fully HIPPA compliant
▪ Clusters are created with predefined tags and costs associated to
each tagged cluster’s run can be separated out
▪ Our data lake is ~110 terabytes
▪ We have 2.3+ million gene results x 240+ CTC to match against 10
criteria
▪ Yes even during COVID-19 we are still seeing an average of 1 new
report a day –
We still run 365 a year
47. Azure Key Vaults and Back-up
▪ Azure Key Vaults are tricky to implement and you only need to do the
connection on a new workspace – so save those instructions!
▪ But these are a very secure way to save all your connection info
without having it in plain text on the notebook itself.
▪ Do not forget to save a copy of everything periodically offline –if your
workspace goes you lose all the notebooks and any manually uploaded
data tables…
▪ Yes we have had to replace the workspace twice in this project
48. Working with complex nested Json and XML sucks
▪ It sounds so simple in the examples and works great in the simple 1 level
examples – real world when something is nested and duplicated or
missing entirely from that record several levels deep and usually in
structs -it sucks
▪ Struct versus arrays- we ended-up having to convert structs to arrays all
the time
▪ Use the cardinality function a lot to determine if there was anything in an
array
▪ The concat_ws trick if you are not sure if ended up with an array or a
string in a sql in your data
49. Tips and tricks?
▪ Databricks only reads a Blob Type of Block blob. Any other type means
that databricks does not even see the directory – that took a fair bit to
uncover when one of our vendors uploaded a new set of files in the
wrong block type without realizing it.
▪ We ended up using data factory a lot less than we thought –odbc
connections worked well except for Oracle we never could get that to
work – it is the only thing still sqooped nightly
50. Code Snips I used all the time
▪ %python pythonDF.write.mode(“overwrite”).saveAsTable(“pythonTable”)
▪ %scala val ScalaDF= spark.read($“pythonTable”)
▪ If you need a table from a JDBC source to use in SQL:
▪ %scala val JDBCTableDF = spark.read.jdbc(jdbcUrl, "JDBCTableName", connectionProperties)
▪ JDBCTableDF.write.mode("overwrite").saveAsTable(" JDBCTableNameTbl")
▪ If you suddenly cannot write out a table:
▪ dbutils.fs.rm("dbfs:/user/hive/warehouse/JDBCTableNameTbl/", true)
I am no expert – but I ended up using these all the time
51. Code Snips I used all the time
▪ Save tables between notebooks – use REFERSH table at the start of
the new notebook to grab the latest version
▪ The null problem – using the cast function to save yourself from
Parquet
I am no expert – but I ended up using these all the time
52. Business Impact
▪ More stable infrastructure
▪ Lower costs
▪ Results come in faster
▪ Easier to add additional labs
▪ Easier to troubleshoot when there are issues
▪ Increase in volume handled easily
▪ Self-service for end-users means no IAS intervention
53. Thanks!
Dr Derek Ragavan,
Carol Farhangfar, Nury Steuerwald, Jai Patel
Chris Danzi, Lance Richey, Scott Blevins
Andrea Bouronich, Stephanie King, Melanie Bamberg,
Stacy Harris
Kelly Jones and his team
All the data and system owners who let us access their data
All the Microsoft support folks who helped us push to the edge
And of course Databricks