SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Processing Genetic Data at
Scale
Mark Schroering
Software Architect
@MarkSchroering
#IndyCloudConf @MarkSchroering
About Me
Mark Schroering
Twitter: @MarkSchroering
Github: mschroering
Life sciences company headquartered in
Indianapolis with offices in Salt Lake City
and Research Triangle Park (RTP)
#IndyCloudConf @MarkSchroering
Sequencing costs have dropped significantly
https://www.researchgate.net/figure/The-change-over-time-for-cost-per-raw-megabase-of-
DNA-sequencing-Source_fig1_261879801
#IndyCloudConf @MarkSchroering
Precision Medicine is on the rise
Source: Frost & Sullivan
#IndyCloudConf @MarkSchroering
A lot of genetic data is being
generated
A human genome has ~3 billion base pairs
AGCCCCTCAGGAGTCCGGCCACATGGAAACTCCTCATTCCGGAGGTCA
GTCAGATTTACCCTGGCTCACCTTGGCGTCGCGTCCGGCGGCAAACTA
AGAACACGTCGTCTAAATGACTTCTTAAAGTAGAATAGCGTGTTCTCT
About 0.1% of the genome is different among
individuals
~3 million germline variants (mutations) per person
#IndyCloudConf @MarkSchroering
Traditionally variant data has been transferred
and shared as files - Variant Call Format (VCF)
#CHROM POS ID REF ALT
QUAL
chr1 120056534 . G A .
chr3 178936091 . G A .
chr11 108198392 . T TA .
chr12 69233096 . C T .
chr13 32913764 . A G .
CLI tools exist that can search and transform the data
>> ./vcftools --vcf input_data.vcf --chr1 --from-bp 1000000 --to-bp 2000000
#IndyCloudConf @MarkSchroering
Variants can be “annotated” to include additional
information from other sources
##INFO=<ID=CLNSRCID,Number=.,Type=String,Description="Variant Clinical Channel IDs">
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance, 0 - unknown, 1 - untested,
2 - non-pathogenic, 3 - probable-non-pathogenic, 4 - probable-pathogenic, 5 - pathogenic, 6 - drug-response, 7
- histocompatibility, 255 - other">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 985955 rs199476396 G C . .
CLNSRCID=103320.0001;CLNSIG=5;
1 1199489 rs207460006 G A . .
CLNSRCID=.;CLNSIG=1;
ClinVar - Database of variants that pertain to human health
COSMIC - Database of Cancer variants
#IndyCloudConf @MarkSchroering
The Problem
Create a cloud based solution that
can store and efficiently analyze
large genomic data sets to
provide meaningful insights to
clinicians and researchers in a
responsive manner
#IndyCloudConf @MarkSchroering
Solution
Data Lake
Indexed Variants
Variant Annotations
MicroservicesAPI Gateway
Spark
Transformation
Jobs
Batch Annotation
Jobs
Application Analytics Storage Ingestion
#IndyCloudConf @MarkSchroering
Ingestion - Convert from legacy file formats
Variants File Data lake in S3Apache Parquet
Apache Parquet provides efficient columnar storage and integrates with technologies like Spark,
Redshift Spectrum, and Athena
https://github.com/lifeomic/spark-vcf - Natively load variant files into a Spark Dataframe/Dataset
#IndyCloudConf @MarkSchroering
Ingestion - Variant Annotation
Variants File(s) Data lake in S3
Legacy variant annotation tools are CLI based which made AWS Batch a good candidate for the
annotation process.
ClinVar
COSMIC
Dockerized
annotations tools
#IndyCloudConf @MarkSchroering
Ingestion - Lessons Learned
Utilize DynamoDB on-demand provisioning for
tables that have unpredictable spikes in
read/write capacity (released Nov’ 18)
● DynamoDB capacity auto-scaling is slow to
react to spikes in throughput
● With on-demand provisioning, you pay per
request
● Request rates are still capped by max table
throughput and account limits
#IndyCloudConf @MarkSchroering
Ingestion - Lessons Learned
Be aware of limits (hard and soft) put in place by your cloud provider on
compute resources. Large spikes of ingestion requests can result in failures.
Solutions:
● Add rate limiting to your API and force clients to slow down
● Add a queue to capture requests and process them when resources are
available
#IndyCloudConf @MarkSchroering
Ingestion - Lessons Learned
Utilize Spot (Preemptible) compute to save cost for big data ingestion tasks
Solutions:
● Use a Batch Spot Compute Environment and set the retry strategy for jobs
to >= 1 to allow jobs to be retried should an instance get terminated
● Use Spot Instance Fleets in EMR
● Have monitoring in place to get notified when jobs fail
#IndyCloudConf @MarkSchroering
Analytics - Index Variants and Annotations
Data Lake
Variant attributes needed for analytics are stored in PostgreSQL. Full annotation records are stored in
DynamoDB.
Indexed variants
in PostgreSQL
Annotations
#IndyCloudConf @MarkSchroering
Analytics - Lessons Learned
Partition large tables for better query performance
Solutions:
● PostgreSQL offers table partitioning by range (defined by a key column) or
explicit listings. After partitioning a large table we saw dramatic
improvements in query performance. Updates to a partition also do not
impact query performance of other partitions.
#IndyCloudConf @MarkSchroering
Analytics - Lessons Learned
Use a data lake for storing raw data
Solutions:
● Store raw data in S3 in a big data friendly format like Apache Parquet
○ Do not throw any data away. You may need it later
○ Can rebuild indexed data stores or create new ones as needed using
the raw data
#IndyCloudConf @MarkSchroering
Application - Provide query results to client
The Lambda function executes a query against the PostgreSQL database and joins annotation records
from DynamoDB.
Indexed variants
in PostgreSQL
Annotations
API Gateway Microservice
#IndyCloudConf @MarkSchroering
Application - Lessons Learned
Use reader endpoints to get better performance
for AWS Aurora databases
● High write load caused by large ingestion
jobs will not impact query performance for
clients needing read only access
● Reader endpoint load balances for query
intensive applications
#IndyCloudConf @MarkSchroering
Current Storage and Performance Stats
● ~101,706,611 processed unique variants
● S3 Storage: ~289 GB
● DynamoDB Storage: ~94 GB
● PostgreSQL Snapshot: ~3.3 TB
● Query Response Times
○ Min: ~214ms
○ Avg: ~1s
○ Max: ~3.5s
#IndyCloudConf @MarkSchroering
“A system is never finished being developed until it
ceases to be used.”
@jerryweinberg
Questions?
Mark Schroering
Software Architect
@MarkSchroering

Weitere ähnliche Inhalte

Was ist angesagt?

Stream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobStream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobDatabricks
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataDataWorks Summit
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 
Vectorized R Execution in Apache Spark
Vectorized R Execution in Apache SparkVectorized R Execution in Apache Spark
Vectorized R Execution in Apache SparkDatabricks
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Databricks
 
Big Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaBig Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaSpark Summit
 
Dataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSDataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSAlasdair Gray
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
 
Piyali Kamra - Analytics and Data Visualization pipeline backed by AWS Glue &...
Piyali Kamra - Analytics and Data Visualization pipeline backed by AWS Glue &...Piyali Kamra - Analytics and Data Visualization pipeline backed by AWS Glue &...
Piyali Kamra - Analytics and Data Visualization pipeline backed by AWS Glue &...AWS Chicago
 
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...Uri Savelchev
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeDatabricks
 
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordMigrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordSri Ambati
 
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike LimcacoExtending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike LimcacoData Con LA
 
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkReal-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkDatabricks
 
GraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on DemandGraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on DemandOntotext
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesDatabricks
 

Was ist angesagt? (20)

Stream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the JobStream Processing: Choosing the Right Tool for the Job
Stream Processing: Choosing the Right Tool for the Job
 
Obfuscating LinkedIn Member Data
Obfuscating LinkedIn Member DataObfuscating LinkedIn Member Data
Obfuscating LinkedIn Member Data
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Vectorized R Execution in Apache Spark
Vectorized R Execution in Apache SparkVectorized R Execution in Apache Spark
Vectorized R Execution in Apache Spark
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
 
Big Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al EssaBig Data Meets Learning Science: Keynote by Al Essa
Big Data Meets Learning Science: Keynote by Al Essa
 
Dataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSDataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLS
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat Detection
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Piyali Kamra - Analytics and Data Visualization pipeline backed by AWS Glue &...
Piyali Kamra - Analytics and Data Visualization pipeline backed by AWS Glue &...Piyali Kamra - Analytics and Data Visualization pipeline backed by AWS Glue &...
Piyali Kamra - Analytics and Data Visualization pipeline backed by AWS Glue &...
 
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
AWS Finland September Meetup - Using Amazon Neptune to build Fashion Knowledg...
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken SanfordMigrating from Closed to Open Source - Fonda Ingram & Ken Sanford
Migrating from Closed to Open Source - Fonda Ingram & Ken Sanford
 
Extending Analytic Reach
Extending Analytic ReachExtending Analytic Reach
Extending Analytic Reach
 
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike LimcacoExtending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
Extending Analytic Reach - From The Warehouse to The Data Lake by Mike Limcaco
 
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkReal-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
 
Query O
Query OQuery O
Query O
 
GraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on DemandGraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on Demand
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on Kubernetes
 

Ähnlich wie Processing genetic data at scale

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCMark Smith
 
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Ahsan Javed Awan
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriChetan Khatri
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 Databricks
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileRoy Kim
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudRick Bilodeau
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudStreamsets Inc.
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceeRic Choo
 
Big data at AWS Chicago User Group - 2014
Big data at AWS Chicago User Group - 2014Big data at AWS Chicago User Group - 2014
Big data at AWS Chicago User Group - 2014AWS Chicago
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
 
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Andre Essing
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...Amazon Web Services
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015Iulia Emanuela Iancuta
 

Ähnlich wie Processing genetic data at scale (20)

Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Data Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKCData Pipeline for The Big Data/Data Science OKC
Data Pipeline for The Big Data/Data Science OKC
 
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
Big data at AWS Chicago User Group - 2014
Big data at AWS Chicago User Group - 2014Big data at AWS Chicago User Group - 2014
Big data at AWS Chicago User Group - 2014
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
 

Kürzlich hochgeladen

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 

Kürzlich hochgeladen (20)

IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 

Processing genetic data at scale

  • 1. Processing Genetic Data at Scale Mark Schroering Software Architect @MarkSchroering
  • 2. #IndyCloudConf @MarkSchroering About Me Mark Schroering Twitter: @MarkSchroering Github: mschroering Life sciences company headquartered in Indianapolis with offices in Salt Lake City and Research Triangle Park (RTP)
  • 3. #IndyCloudConf @MarkSchroering Sequencing costs have dropped significantly https://www.researchgate.net/figure/The-change-over-time-for-cost-per-raw-megabase-of- DNA-sequencing-Source_fig1_261879801
  • 4. #IndyCloudConf @MarkSchroering Precision Medicine is on the rise Source: Frost & Sullivan
  • 5. #IndyCloudConf @MarkSchroering A lot of genetic data is being generated A human genome has ~3 billion base pairs AGCCCCTCAGGAGTCCGGCCACATGGAAACTCCTCATTCCGGAGGTCA GTCAGATTTACCCTGGCTCACCTTGGCGTCGCGTCCGGCGGCAAACTA AGAACACGTCGTCTAAATGACTTCTTAAAGTAGAATAGCGTGTTCTCT About 0.1% of the genome is different among individuals ~3 million germline variants (mutations) per person
  • 6. #IndyCloudConf @MarkSchroering Traditionally variant data has been transferred and shared as files - Variant Call Format (VCF) #CHROM POS ID REF ALT QUAL chr1 120056534 . G A . chr3 178936091 . G A . chr11 108198392 . T TA . chr12 69233096 . C T . chr13 32913764 . A G . CLI tools exist that can search and transform the data >> ./vcftools --vcf input_data.vcf --chr1 --from-bp 1000000 --to-bp 2000000
  • 7. #IndyCloudConf @MarkSchroering Variants can be “annotated” to include additional information from other sources ##INFO=<ID=CLNSRCID,Number=.,Type=String,Description="Variant Clinical Channel IDs"> ##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance, 0 - unknown, 1 - untested, 2 - non-pathogenic, 3 - probable-non-pathogenic, 4 - probable-pathogenic, 5 - pathogenic, 6 - drug-response, 7 - histocompatibility, 255 - other"> #CHROM POS ID REF ALT QUAL FILTER INFO 1 985955 rs199476396 G C . . CLNSRCID=103320.0001;CLNSIG=5; 1 1199489 rs207460006 G A . . CLNSRCID=.;CLNSIG=1; ClinVar - Database of variants that pertain to human health COSMIC - Database of Cancer variants
  • 8. #IndyCloudConf @MarkSchroering The Problem Create a cloud based solution that can store and efficiently analyze large genomic data sets to provide meaningful insights to clinicians and researchers in a responsive manner
  • 9. #IndyCloudConf @MarkSchroering Solution Data Lake Indexed Variants Variant Annotations MicroservicesAPI Gateway Spark Transformation Jobs Batch Annotation Jobs Application Analytics Storage Ingestion
  • 10. #IndyCloudConf @MarkSchroering Ingestion - Convert from legacy file formats Variants File Data lake in S3Apache Parquet Apache Parquet provides efficient columnar storage and integrates with technologies like Spark, Redshift Spectrum, and Athena https://github.com/lifeomic/spark-vcf - Natively load variant files into a Spark Dataframe/Dataset
  • 11. #IndyCloudConf @MarkSchroering Ingestion - Variant Annotation Variants File(s) Data lake in S3 Legacy variant annotation tools are CLI based which made AWS Batch a good candidate for the annotation process. ClinVar COSMIC Dockerized annotations tools
  • 12. #IndyCloudConf @MarkSchroering Ingestion - Lessons Learned Utilize DynamoDB on-demand provisioning for tables that have unpredictable spikes in read/write capacity (released Nov’ 18) ● DynamoDB capacity auto-scaling is slow to react to spikes in throughput ● With on-demand provisioning, you pay per request ● Request rates are still capped by max table throughput and account limits
  • 13. #IndyCloudConf @MarkSchroering Ingestion - Lessons Learned Be aware of limits (hard and soft) put in place by your cloud provider on compute resources. Large spikes of ingestion requests can result in failures. Solutions: ● Add rate limiting to your API and force clients to slow down ● Add a queue to capture requests and process them when resources are available
  • 14. #IndyCloudConf @MarkSchroering Ingestion - Lessons Learned Utilize Spot (Preemptible) compute to save cost for big data ingestion tasks Solutions: ● Use a Batch Spot Compute Environment and set the retry strategy for jobs to >= 1 to allow jobs to be retried should an instance get terminated ● Use Spot Instance Fleets in EMR ● Have monitoring in place to get notified when jobs fail
  • 15. #IndyCloudConf @MarkSchroering Analytics - Index Variants and Annotations Data Lake Variant attributes needed for analytics are stored in PostgreSQL. Full annotation records are stored in DynamoDB. Indexed variants in PostgreSQL Annotations
  • 16. #IndyCloudConf @MarkSchroering Analytics - Lessons Learned Partition large tables for better query performance Solutions: ● PostgreSQL offers table partitioning by range (defined by a key column) or explicit listings. After partitioning a large table we saw dramatic improvements in query performance. Updates to a partition also do not impact query performance of other partitions.
  • 17. #IndyCloudConf @MarkSchroering Analytics - Lessons Learned Use a data lake for storing raw data Solutions: ● Store raw data in S3 in a big data friendly format like Apache Parquet ○ Do not throw any data away. You may need it later ○ Can rebuild indexed data stores or create new ones as needed using the raw data
  • 18. #IndyCloudConf @MarkSchroering Application - Provide query results to client The Lambda function executes a query against the PostgreSQL database and joins annotation records from DynamoDB. Indexed variants in PostgreSQL Annotations API Gateway Microservice
  • 19. #IndyCloudConf @MarkSchroering Application - Lessons Learned Use reader endpoints to get better performance for AWS Aurora databases ● High write load caused by large ingestion jobs will not impact query performance for clients needing read only access ● Reader endpoint load balances for query intensive applications
  • 20. #IndyCloudConf @MarkSchroering Current Storage and Performance Stats ● ~101,706,611 processed unique variants ● S3 Storage: ~289 GB ● DynamoDB Storage: ~94 GB ● PostgreSQL Snapshot: ~3.3 TB ● Query Response Times ○ Min: ~214ms ○ Avg: ~1s ○ Max: ~3.5s
  • 21. #IndyCloudConf @MarkSchroering “A system is never finished being developed until it ceases to be used.” @jerryweinberg