SlideShare ist ein Scribd-Unternehmen logo
1 von 55
Downloaden Sie, um offline zu lesen
All In: Migrating a Genomics
Pipeline from BASH/Hive to Spark
and Azure Databricks—A Real
World Case Study
Victoria Morris
Unicorn Health Bridge Consulting working for Atrium Health
Agenda
Victoria Morris
▪ Overview Link
▪ Issues – why change?
▪ Next Moves
▪ Migration Starting Small Pharmacogenomics
Pipeline
▪ Clinical Trials Matching Pipeline
▪ The Great Migration Hive-> Databricks
▪ Things we Learned
▪ Business Impact
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Overview LInk
Original Problem Statement(s)
▪ Genomic reports are hard to find in the Electronic Medical Record (EMR)
▪ The reports are difficult to read (++ pages) are different from each lab, may not
have relevant recommendations and require manual efforts to summarize
▪ Presenting relevant Clinical Trails to providers when making treatment decisions
will increase Clinical Trial participation
▪ As a Center of Excellence(COE) for the American Society of Clinical Oncology
(ASCO)’s Targeted Agent and Profiling Utilization Registry (TAPUR) Clinical trial,
clinical outcomes and treatment data must be reported back to the COE for
patients enrolled in the studies
▪ Current process is complicated, time consuming and manual
Overview
▪ The objective of LInK (LCI Integrated Knowledgebase) is to provide
interoperability of data between different LCI data sources
▪ Specifically to address the multiple data silo’s, that contain related data, which is a
consistent challenge across the System
▪ Data meaning, must be transferred, not just values
▪ Apple: Fruit vs. Computer
▪ Originally we had 4 people, and we all had day jobs
Specialized External testing
Testing Results
PDF’s, results and
Raw Sequence data in
PDF, Clinical Decision Support Out
(External –sftp/data factory)
Clinical
Trails
Management
Software
(On-Premise-
soon to be
Cloud)
EMR
Clinical Data
(Cerner reporting
Database/EDW)
EAPathways embedded in
Cerner
via SMART/FHIR
Genomic results and
PDF reports
via Tier 1 SharePoint
for molecular tumor
board review
Converting Raw Reads to
Genotype-> Phenotype and
generating report for Provider
LCI
Encounter
Data
(EDW)
LInK
Unstructured Notes
(e.g. Cerner reporting
Database)
EAPathways
Database
(On-premise
DB)
Integration
Office
365
(External-
API)
POC
Clinical
Decisio
n
Support
Clinical
Trials
Matching
Pharmacogenomics
Specialized Internal testing
Testing Results and
Raw Sequence data in PDF
out
(internal)
Frd1Storage
Netezza Cloud
Azure
On-Premise Databases
EDW
EaPathways
Oncore
External Labs
Caris
Inivata
FMI
Azure Storage
• Cerner
• EPIC
• CRSTAR
On-Premise Lab
Genomics Lab
LInK Data connections – High Level
Clinical Trials
Management
Clinical Decision
Supprt
Enterprise Data
Warehouse
ARIA
Genomic Pipelines-
Auto-generate by WebApps
Radiation
Treatments
CoPath
Pathology
MS Web
Apps
MS
SharePoint
Designer
Frd1Storage
Netezza Cloud
Azure
On-Premise Databases
EDW
EaPathways
Oncore
External Labs
Tempus Caris
Inivata
FMI
External Vendor÷s
Containers
Azure Storage
Azure Storage
• Cerner
• EPIC
• CRSTAR
On-Premise Lab
Genomics Lab
LInK Data connections – High Level
Clinical Trials
Management
Clinical Decision
Supprt
Enterprise Data
Warehouse
ARIA
Genomic Pipelines
PharmacoGenomics
Radiation
Treatments
CoPath
Pathology
Issues
Issues
▪ We run 365 days a year
▪ The Data is used in real time by providers to make clinical decisions for
patient treatment for Cancer any breakdown in the pipeline is a
Priority 1 fix that needs to be fixed as soon as possible
▪ We were early adopters of HDI – this server has been up since 2016 – it
is old technology and HDI was not built for servers to live this long.
Issues cont’d
▪ Randomly the cluster would freeze and go into SAFE mode – with no
warning, this happened on a weekly basis often several days, in a row
during the overnight batch.
▪ We were past the default allocated 10,000 tez counters and had to
change the runs to constantly run with additional ones, back at
around 3,000 lines of Hive code.
▪ Although we tried using Matrix manipulation in hive– at some point you
just need a loop.
Issues cont’d
▪ The costs to have the HDI cluster up 24x365 was very expensive, we
scaled it up and down to help reduced costs.
▪ The cluster was not stable, because we were scaling up and scaling
down everyday, at one point there so many logs on the daily scaling it
took the entire HDI cluster down.
Issues cont’d
▪ Twice the cluster went down so bad and so hard MS Support’s
response was destroy it and start again, which we did the first time…
▪ The HDI server choice-dichotomy to HiveV2 had forced us into not
allowing vectorized execution– we had to constantly set
hive.vectorized.execution.enabled=false; through out the script
because it would “forget” and which was slowing down processing.
Next moves
Search
▪ We wanted something that was cheaper
▪ We wanted to keep our old wasbi storage – not have to migrate the
datalake
▪ We wanted flexibility in language options for on-going operations and
continuity of care we did not want to get boxed into just one
▪ We wanted something less agnostic, more fully integrated into the
Microsoft eco-system
Search cont’d
▪ We needed it to be HIPAA compliant because we were working with
Patient data.
▪ We needed something that was self sufficient with the Cluster
management so we could concentrate on the programming aspect
instead of infrastructure.
▪ We really liked the Notebook concept – and had started experimenting
with Jupiter notebooks inside HDI
Frd1Storage
Netezza Cloud
Azure
On-Premise Databases
EDW
EaPathways
Oncore
External Labs
Tempus Caris
Inivata
FMI
External Vendor÷s
Containers
Azure Storage
Azure Storage
• Cerner
• EPIC
• CRSTAR
On-Premise Lab
Genomics Lab
LInK Data connections – High Level
Clinical Trials
Management
Clinical Decision
Supprt
Enterprise Data
Warehouse
ARIA
Genomic Pipelines
PharmacoGenomics
Radiation
Treatments
CoPath
Pathology
Migration
Migration – starting small
▪ There is a large steep learning curve to get into the databricks
▪ We had a new project the second pipeline that had to be built and it
seemed easier to start with something smaller than the 8000 lines of
Hive code that would be required if we started transitioning the
original pipeline.
Pharmacogenomics In progress
Pharmacogenomics
We receive raw Genomic test
results from our internal lab
Pharmacogenomics
Single Notebook
Overview Genomic Clinical Trials Pipeline
--------------------
Clinical Trial Match Criteria
Age (today’s) Gender
First line eligible(no
previous anti-
neoplastics
ordered)
Genomic Results
(over 1290 genes)
Diagnosis Tumor Site
Secondary Gene
results
Must have/not have
a specific protein
change/mutation
Previous Lab
results
Previous
Medications
Opening Screen
Frd1Storage
Netezza Cloud
Azure
On-Premise Databases
EDW
EaPathways
Oncore
External Labs
Tempus Caris
Inivata
FMI
External Vendor÷s
Containers
Azure Storage
Azure Storage
• Cerner
• EPIC
• CRSTAR
On-Premise Lab
Genomics Lab
LInK Data connections – High Level
Clinical Trials
Management
Clinical Decision
Supprt
Enterprise Data
Warehouse
ARIA
Genomic Pipelines
PharmacoGenomics
Radiation
Treatments
CoPath
Pathology
The Great Migration
Process Tempus
files
Process Caris
files Process FMI files
Process Inivata
files
Main Match
Create Summary
Preprocess each
lab into similar
data format
Create Clinical Matches
Create Genomic Summary,
combine with matches an
save to database
1
2
3
Hive
Conversion
Initial Definitions
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
DatabricksHive
Reading the file
▪ Not a separate step in Hive part of the next step ▪ Bulleted list
▪ Bulleted list
DatabricksHive
Creating a clean view of the data
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
▪ Bulleted list
DatabricksHive
Databricks by the numbers
▪ We work in a Premium Workspace, using our internal ip addresses
inside a secured subnet inside the Atrium Health Azure Subscription
▪ Databricks is fully HIPPA compliant
▪ Clusters are created with predefined tags and costs associated to
each tagged cluster’s run can be separated out
▪ Our data lake is ~110 terabytes
▪ We have 2.3+ million gene results x 240+ CTC to match against 10
criteria
▪ Yes even during COVID-19 we are still seeing an average of 1 new
report a day –
We still run 365 a year
Things we learned
Azure Key Vaults and Back-up
▪ Azure Key Vaults are tricky to implement and you only need to do the
connection on a new workspace – so save those instructions!
▪ But these are a very secure way to save all your connection info
without having it in plain text on the notebook itself.
▪ Do not forget to save a copy of everything periodically offline –if your
workspace goes you lose all the notebooks and any manually uploaded
data tables…
▪ Yes we have had to replace the workspace twice in this project
Working with complex nested Json and XML sucks
▪ It sounds so simple in the examples and works great in the simple 1 level
examples – real world when something is nested and duplicated or
missing entirely from that record several levels deep and usually in
structs -it sucks
▪ Struct versus arrays- we ended-up having to convert structs to arrays all
the time
▪ Use the cardinality function a lot to determine if there was anything in an
array
▪ The concat_ws trick if you are not sure if ended up with an array or a
string in a sql in your data
Tips and tricks?
▪ Databricks only reads a Blob Type of Block blob. Any other type means
that databricks does not even see the directory – that took a fair bit to
uncover when one of our vendors uploaded a new set of files in the
wrong block type without realizing it.
▪ We ended up using data factory a lot less than we thought –odbc
connections worked well except for Oracle we never could get that to
work – it is the only thing still sqooped nightly
Code Snips I used all the time
▪ %python pythonDF.write.mode(“overwrite”).saveAsTable(“pythonTable”)
▪ %scala val ScalaDF= spark.read($“pythonTable”)
▪ If you need a table from a JDBC source to use in SQL:
▪ %scala val JDBCTableDF = spark.read.jdbc(jdbcUrl, "JDBCTableName", connectionProperties)
▪ JDBCTableDF.write.mode("overwrite").saveAsTable(" JDBCTableNameTbl")
▪ If you suddenly cannot write out a table:
▪ dbutils.fs.rm("dbfs:/user/hive/warehouse/JDBCTableNameTbl/", true)
I am no expert – but I ended up using these all the time
Code Snips I used all the time
▪ Save tables between notebooks – use REFERSH table at the start of
the new notebook to grab the latest version
▪ The null problem – using the cast function to save yourself from
Parquet
I am no expert – but I ended up using these all the time
Business Impact
▪ More stable infrastructure
▪ Lower costs
▪ Results come in faster
▪ Easier to add additional labs
▪ Easier to troubleshoot when there are issues
▪ Increase in volume handled easily
▪ Self-service for end-users means no IAS intervention
Thanks!
Dr Derek Ragavan,
Carol Farhangfar, Nury Steuerwald, Jai Patel
Chris Danzi, Lance Richey, Scott Blevins
Andrea Bouronich, Stephanie King, Melanie Bamberg,
Stacy Harris
Kelly Jones and his team
All the data and system owners who let us access their data
All the Microsoft support folks who helped us push to the edge
And of course Databricks
Questions?
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Weitere ähnliche Inhalte

Was ist angesagt?

Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSpark Summit
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTechWell
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDataWorks Summit
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationZaloni
 
ETL using Big Data Talend
ETL using Big Data Talend  ETL using Big Data Talend
ETL using Big Data Talend Edureka!
 
The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInOSCON Byrum
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoopMaulik Thaker
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsStavros Kontopoulos
 
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Rehgan Avon
 
Building Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field ExperienceBuilding Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field ExperienceDatabricks
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineeringNovita Sari
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHumza Naseer
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondDataWorks Summit/Hadoop Summit
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 

Was ist angesagt? (20)

Big data pipelines
Big data pipelinesBig data pipelines
Big data pipelines
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit AgarwalSuccinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
Succinct Spark: Fast Interactive Queries on Compressed RDDs by Rachit Agarwal
 
Testing the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big ProblemsTesting the Data Warehouse―Big Data, Big Problems
Testing the Data Warehouse―Big Data, Big Problems
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
 
Strata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma PresentationStrata San Jose 2017 - Ben Sharma Presentation
Strata San Jose 2017 - Ben Sharma Presentation
 
ETL using Big Data Talend
ETL using Big Data Talend  ETL using Big Data Talend
ETL using Big Data Talend
 
The Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedInThe Big Data Ecosystem at LinkedIn
The Big Data Ecosystem at LinkedIn
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
 
Machine learning at scale challenges and solutions
Machine learning at scale challenges and solutionsMachine learning at scale challenges and solutions
Machine learning at scale challenges and solutions
 
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
Kelly O'Briant - DataOps in the Cloud: How To Supercharge Data Science with a...
 
Building Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field ExperienceBuilding Data Science into Organizations: Field Experience
Building Data Science into Organizations: Field Experience
 
Summary introduction to data engineering
Summary introduction to data engineeringSummary introduction to data engineering
Summary introduction to data engineering
 
Flexible Design
Flexible DesignFlexible Design
Flexible Design
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 
Hadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing ArchitecturesHadoop Integration into Data Warehousing Architectures
Hadoop Integration into Data Warehousing Architectures
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyond
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 

Ähnlich wie All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study

iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago
 
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity HardwareBig Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity HardwareLaurens De Vocht
 
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...confluent
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High CostsJonathan Long
 
Medidata AMUG Meeting / Presentation 2013
Medidata AMUG Meeting / Presentation 2013Medidata AMUG Meeting / Presentation 2013
Medidata AMUG Meeting / Presentation 2013Brock Heinz
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryRTTS
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Ola Spjuth
 
Leaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real WorldLeaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real WorldArmonDadgar
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionChris Dwan
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopKevin Crawley
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataPhilip Cheung
 
Qiagram
QiagramQiagram
Qiagramjwppz
 
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...OSTHUS
 
The Architecture of Continuous Innovation - OSCON 2015
The Architecture of Continuous Innovation - OSCON 2015The Architecture of Continuous Innovation - OSCON 2015
The Architecture of Continuous Innovation - OSCON 2015Chip Childers
 
From allotrope to reference master data management
From allotrope to reference master data management From allotrope to reference master data management
From allotrope to reference master data management OSTHUS
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Precisely
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217lyarmey
 

Ähnlich wie All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study (20)

iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
 
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity HardwareBig Linked Data ETL Benchmark on Cloud Commodity Hardware
Big Linked Data ETL Benchmark on Cloud Commodity Hardware
 
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
 
Ceph - High Performance Without High Costs
Ceph - High Performance Without High CostsCeph - High Performance Without High Costs
Ceph - High Performance Without High Costs
 
Medidata AMUG Meeting / Presentation 2013
Medidata AMUG Meeting / Presentation 2013Medidata AMUG Meeting / Presentation 2013
Medidata AMUG Meeting / Presentation 2013
 
Data Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical IndustryData Warehouse Testing in the Pharmaceutical Industry
Data Warehouse Testing in the Pharmaceutical Industry
 
Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...Automating the process of continuously prioritising data, updating and deploy...
Automating the process of continuously prioritising data, updating and deploy...
 
Leaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real WorldLeaving the Ivory Tower: Research in the Real World
Leaving the Ivory Tower: Research in the Real World
 
Production Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on ProductionProduction Bioinformatics, emphasis on Production
Production Bioinformatics, emphasis on Production
 
DockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability WorkshopDockerCon SF 2019 - Observability Workshop
DockerCon SF 2019 - Observability Workshop
 
BioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadataBioAssay Express: Creating and exploiting assay metadata
BioAssay Express: Creating and exploiting assay metadata
 
Qiagram
QiagramQiagram
Qiagram
 
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
Allotrope Foundation & OSTHUS at SmartLab Exchange 2015: Update on the Allotr...
 
The Architecture of Continuous Innovation - OSCON 2015
The Architecture of Continuous Innovation - OSCON 2015The Architecture of Continuous Innovation - OSCON 2015
The Architecture of Continuous Innovation - OSCON 2015
 
From allotrope to reference master data management
From allotrope to reference master data management From allotrope to reference master data management
From allotrope to reference master data management
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217CSU-ACADIS_dataManagement101-20120217
CSU-ACADIS_dataManagement101-20120217
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

Kürzlich hochgeladen

怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制vexqp
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schscnajjemba
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........EfruzAsilolu
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjurptikerjasaptiker
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATIONLakpaYanziSherpa
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制vexqp
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 

Kürzlich hochgeladen (20)

怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
怎样办理旧金山城市学院毕业证(CCSF毕业证书)成绩单学校原版复制
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
PLE-statistics document for primary schs
PLE-statistics document for primary schsPLE-statistics document for primary schs
PLE-statistics document for primary schs
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........Switzerland Constitution 2002.pdf.........
Switzerland Constitution 2002.pdf.........
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
怎样办理纽约州立大学宾汉姆顿分校毕业证(SUNY-Bin毕业证书)成绩单学校原版复制
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 

All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databricks) - A Real World Case Study

  • 1.
  • 2. All In: Migrating a Genomics Pipeline from BASH/Hive to Spark and Azure Databricks—A Real World Case Study Victoria Morris Unicorn Health Bridge Consulting working for Atrium Health
  • 3. Agenda Victoria Morris ▪ Overview Link ▪ Issues – why change? ▪ Next Moves ▪ Migration Starting Small Pharmacogenomics Pipeline ▪ Clinical Trials Matching Pipeline ▪ The Great Migration Hive-> Databricks ▪ Things we Learned ▪ Business Impact
  • 4. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 6. Original Problem Statement(s) ▪ Genomic reports are hard to find in the Electronic Medical Record (EMR) ▪ The reports are difficult to read (++ pages) are different from each lab, may not have relevant recommendations and require manual efforts to summarize ▪ Presenting relevant Clinical Trails to providers when making treatment decisions will increase Clinical Trial participation ▪ As a Center of Excellence(COE) for the American Society of Clinical Oncology (ASCO)’s Targeted Agent and Profiling Utilization Registry (TAPUR) Clinical trial, clinical outcomes and treatment data must be reported back to the COE for patients enrolled in the studies ▪ Current process is complicated, time consuming and manual
  • 7. Overview ▪ The objective of LInK (LCI Integrated Knowledgebase) is to provide interoperability of data between different LCI data sources ▪ Specifically to address the multiple data silo’s, that contain related data, which is a consistent challenge across the System ▪ Data meaning, must be transferred, not just values ▪ Apple: Fruit vs. Computer ▪ Originally we had 4 people, and we all had day jobs
  • 8. Specialized External testing Testing Results PDF’s, results and Raw Sequence data in PDF, Clinical Decision Support Out (External –sftp/data factory) Clinical Trails Management Software (On-Premise- soon to be Cloud) EMR Clinical Data (Cerner reporting Database/EDW) EAPathways embedded in Cerner via SMART/FHIR Genomic results and PDF reports via Tier 1 SharePoint for molecular tumor board review Converting Raw Reads to Genotype-> Phenotype and generating report for Provider LCI Encounter Data (EDW) LInK Unstructured Notes (e.g. Cerner reporting Database) EAPathways Database (On-premise DB) Integration Office 365 (External- API) POC Clinical Decisio n Support Clinical Trials Matching Pharmacogenomics Specialized Internal testing Testing Results and Raw Sequence data in PDF out (internal)
  • 9. Frd1Storage Netezza Cloud Azure On-Premise Databases EDW EaPathways Oncore External Labs Caris Inivata FMI Azure Storage • Cerner • EPIC • CRSTAR On-Premise Lab Genomics Lab LInK Data connections – High Level Clinical Trials Management Clinical Decision Supprt Enterprise Data Warehouse ARIA Genomic Pipelines- Auto-generate by WebApps Radiation Treatments CoPath Pathology MS Web Apps MS SharePoint Designer
  • 10. Frd1Storage Netezza Cloud Azure On-Premise Databases EDW EaPathways Oncore External Labs Tempus Caris Inivata FMI External Vendor÷s Containers Azure Storage Azure Storage • Cerner • EPIC • CRSTAR On-Premise Lab Genomics Lab LInK Data connections – High Level Clinical Trials Management Clinical Decision Supprt Enterprise Data Warehouse ARIA Genomic Pipelines PharmacoGenomics Radiation Treatments CoPath Pathology
  • 12. Issues ▪ We run 365 days a year ▪ The Data is used in real time by providers to make clinical decisions for patient treatment for Cancer any breakdown in the pipeline is a Priority 1 fix that needs to be fixed as soon as possible ▪ We were early adopters of HDI – this server has been up since 2016 – it is old technology and HDI was not built for servers to live this long.
  • 13. Issues cont’d ▪ Randomly the cluster would freeze and go into SAFE mode – with no warning, this happened on a weekly basis often several days, in a row during the overnight batch. ▪ We were past the default allocated 10,000 tez counters and had to change the runs to constantly run with additional ones, back at around 3,000 lines of Hive code. ▪ Although we tried using Matrix manipulation in hive– at some point you just need a loop.
  • 14. Issues cont’d ▪ The costs to have the HDI cluster up 24x365 was very expensive, we scaled it up and down to help reduced costs. ▪ The cluster was not stable, because we were scaling up and scaling down everyday, at one point there so many logs on the daily scaling it took the entire HDI cluster down.
  • 15. Issues cont’d ▪ Twice the cluster went down so bad and so hard MS Support’s response was destroy it and start again, which we did the first time… ▪ The HDI server choice-dichotomy to HiveV2 had forced us into not allowing vectorized execution– we had to constantly set hive.vectorized.execution.enabled=false; through out the script because it would “forget” and which was slowing down processing.
  • 17. Search ▪ We wanted something that was cheaper ▪ We wanted to keep our old wasbi storage – not have to migrate the datalake ▪ We wanted flexibility in language options for on-going operations and continuity of care we did not want to get boxed into just one ▪ We wanted something less agnostic, more fully integrated into the Microsoft eco-system
  • 18. Search cont’d ▪ We needed it to be HIPAA compliant because we were working with Patient data. ▪ We needed something that was self sufficient with the Cluster management so we could concentrate on the programming aspect instead of infrastructure. ▪ We really liked the Notebook concept – and had started experimenting with Jupiter notebooks inside HDI
  • 19. Frd1Storage Netezza Cloud Azure On-Premise Databases EDW EaPathways Oncore External Labs Tempus Caris Inivata FMI External Vendor÷s Containers Azure Storage Azure Storage • Cerner • EPIC • CRSTAR On-Premise Lab Genomics Lab LInK Data connections – High Level Clinical Trials Management Clinical Decision Supprt Enterprise Data Warehouse ARIA Genomic Pipelines PharmacoGenomics Radiation Treatments CoPath Pathology
  • 21. Migration – starting small ▪ There is a large steep learning curve to get into the databricks ▪ We had a new project the second pipeline that had to be built and it seemed easier to start with something smaller than the 8000 lines of Hive code that would be required if we started transitioning the original pipeline.
  • 23. Pharmacogenomics We receive raw Genomic test results from our internal lab
  • 24.
  • 25.
  • 26.
  • 28. Overview Genomic Clinical Trials Pipeline
  • 30. Clinical Trial Match Criteria Age (today’s) Gender First line eligible(no previous anti- neoplastics ordered) Genomic Results (over 1290 genes) Diagnosis Tumor Site Secondary Gene results Must have/not have a specific protein change/mutation Previous Lab results Previous Medications
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37. Frd1Storage Netezza Cloud Azure On-Premise Databases EDW EaPathways Oncore External Labs Tempus Caris Inivata FMI External Vendor÷s Containers Azure Storage Azure Storage • Cerner • EPIC • CRSTAR On-Premise Lab Genomics Lab LInK Data connections – High Level Clinical Trials Management Clinical Decision Supprt Enterprise Data Warehouse ARIA Genomic Pipelines PharmacoGenomics Radiation Treatments CoPath Pathology
  • 39. Process Tempus files Process Caris files Process FMI files Process Inivata files Main Match Create Summary Preprocess each lab into similar data format Create Clinical Matches Create Genomic Summary, combine with matches an save to database 1 2 3
  • 41. Initial Definitions ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list DatabricksHive
  • 42. Reading the file ▪ Not a separate step in Hive part of the next step ▪ Bulleted list ▪ Bulleted list DatabricksHive
  • 43. Creating a clean view of the data ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list ▪ Bulleted list DatabricksHive
  • 44.
  • 45. Databricks by the numbers ▪ We work in a Premium Workspace, using our internal ip addresses inside a secured subnet inside the Atrium Health Azure Subscription ▪ Databricks is fully HIPPA compliant ▪ Clusters are created with predefined tags and costs associated to each tagged cluster’s run can be separated out ▪ Our data lake is ~110 terabytes ▪ We have 2.3+ million gene results x 240+ CTC to match against 10 criteria ▪ Yes even during COVID-19 we are still seeing an average of 1 new report a day – We still run 365 a year
  • 47. Azure Key Vaults and Back-up ▪ Azure Key Vaults are tricky to implement and you only need to do the connection on a new workspace – so save those instructions! ▪ But these are a very secure way to save all your connection info without having it in plain text on the notebook itself. ▪ Do not forget to save a copy of everything periodically offline –if your workspace goes you lose all the notebooks and any manually uploaded data tables… ▪ Yes we have had to replace the workspace twice in this project
  • 48. Working with complex nested Json and XML sucks ▪ It sounds so simple in the examples and works great in the simple 1 level examples – real world when something is nested and duplicated or missing entirely from that record several levels deep and usually in structs -it sucks ▪ Struct versus arrays- we ended-up having to convert structs to arrays all the time ▪ Use the cardinality function a lot to determine if there was anything in an array ▪ The concat_ws trick if you are not sure if ended up with an array or a string in a sql in your data
  • 49. Tips and tricks? ▪ Databricks only reads a Blob Type of Block blob. Any other type means that databricks does not even see the directory – that took a fair bit to uncover when one of our vendors uploaded a new set of files in the wrong block type without realizing it. ▪ We ended up using data factory a lot less than we thought –odbc connections worked well except for Oracle we never could get that to work – it is the only thing still sqooped nightly
  • 50. Code Snips I used all the time ▪ %python pythonDF.write.mode(“overwrite”).saveAsTable(“pythonTable”) ▪ %scala val ScalaDF= spark.read($“pythonTable”) ▪ If you need a table from a JDBC source to use in SQL: ▪ %scala val JDBCTableDF = spark.read.jdbc(jdbcUrl, "JDBCTableName", connectionProperties) ▪ JDBCTableDF.write.mode("overwrite").saveAsTable(" JDBCTableNameTbl") ▪ If you suddenly cannot write out a table: ▪ dbutils.fs.rm("dbfs:/user/hive/warehouse/JDBCTableNameTbl/", true) I am no expert – but I ended up using these all the time
  • 51. Code Snips I used all the time ▪ Save tables between notebooks – use REFERSH table at the start of the new notebook to grab the latest version ▪ The null problem – using the cast function to save yourself from Parquet I am no expert – but I ended up using these all the time
  • 52. Business Impact ▪ More stable infrastructure ▪ Lower costs ▪ Results come in faster ▪ Easier to add additional labs ▪ Easier to troubleshoot when there are issues ▪ Increase in volume handled easily ▪ Self-service for end-users means no IAS intervention
  • 53. Thanks! Dr Derek Ragavan, Carol Farhangfar, Nury Steuerwald, Jai Patel Chris Danzi, Lance Richey, Scott Blevins Andrea Bouronich, Stephanie King, Melanie Bamberg, Stacy Harris Kelly Jones and his team All the data and system owners who let us access their data All the Microsoft support folks who helped us push to the edge And of course Databricks
  • 55. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.