2. #IndyCloudConf @MarkSchroering
About Me
Mark Schroering
Twitter: @MarkSchroering
Github: mschroering
Life sciences company headquartered in
Indianapolis with offices in Salt Lake City
and Research Triangle Park (RTP)
3. #IndyCloudConf @MarkSchroering
Sequencing costs have dropped significantly
https://www.researchgate.net/figure/The-change-over-time-for-cost-per-raw-megabase-of-
DNA-sequencing-Source_fig1_261879801
5. #IndyCloudConf @MarkSchroering
A lot of genetic data is being
generated
A human genome has ~3 billion base pairs
AGCCCCTCAGGAGTCCGGCCACATGGAAACTCCTCATTCCGGAGGTCA
GTCAGATTTACCCTGGCTCACCTTGGCGTCGCGTCCGGCGGCAAACTA
AGAACACGTCGTCTAAATGACTTCTTAAAGTAGAATAGCGTGTTCTCT
About 0.1% of the genome is different among
individuals
~3 million germline variants (mutations) per person
6. #IndyCloudConf @MarkSchroering
Traditionally variant data has been transferred
and shared as files - Variant Call Format (VCF)
#CHROM POS ID REF ALT
QUAL
chr1 120056534 . G A .
chr3 178936091 . G A .
chr11 108198392 . T TA .
chr12 69233096 . C T .
chr13 32913764 . A G .
CLI tools exist that can search and transform the data
>> ./vcftools --vcf input_data.vcf --chr1 --from-bp 1000000 --to-bp 2000000
7. #IndyCloudConf @MarkSchroering
Variants can be “annotated” to include additional
information from other sources
##INFO=<ID=CLNSRCID,Number=.,Type=String,Description="Variant Clinical Channel IDs">
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Variant Clinical Significance, 0 - unknown, 1 - untested,
2 - non-pathogenic, 3 - probable-non-pathogenic, 4 - probable-pathogenic, 5 - pathogenic, 6 - drug-response, 7
- histocompatibility, 255 - other">
#CHROM POS ID REF ALT QUAL FILTER INFO
1 985955 rs199476396 G C . .
CLNSRCID=103320.0001;CLNSIG=5;
1 1199489 rs207460006 G A . .
CLNSRCID=.;CLNSIG=1;
ClinVar - Database of variants that pertain to human health
COSMIC - Database of Cancer variants
8. #IndyCloudConf @MarkSchroering
The Problem
Create a cloud based solution that
can store and efficiently analyze
large genomic data sets to
provide meaningful insights to
clinicians and researchers in a
responsive manner
10. #IndyCloudConf @MarkSchroering
Ingestion - Convert from legacy file formats
Variants File Data lake in S3Apache Parquet
Apache Parquet provides efficient columnar storage and integrates with technologies like Spark,
Redshift Spectrum, and Athena
https://github.com/lifeomic/spark-vcf - Natively load variant files into a Spark Dataframe/Dataset
11. #IndyCloudConf @MarkSchroering
Ingestion - Variant Annotation
Variants File(s) Data lake in S3
Legacy variant annotation tools are CLI based which made AWS Batch a good candidate for the
annotation process.
ClinVar
COSMIC
Dockerized
annotations tools
12. #IndyCloudConf @MarkSchroering
Ingestion - Lessons Learned
Utilize DynamoDB on-demand provisioning for
tables that have unpredictable spikes in
read/write capacity (released Nov’ 18)
● DynamoDB capacity auto-scaling is slow to
react to spikes in throughput
● With on-demand provisioning, you pay per
request
● Request rates are still capped by max table
throughput and account limits
13. #IndyCloudConf @MarkSchroering
Ingestion - Lessons Learned
Be aware of limits (hard and soft) put in place by your cloud provider on
compute resources. Large spikes of ingestion requests can result in failures.
Solutions:
● Add rate limiting to your API and force clients to slow down
● Add a queue to capture requests and process them when resources are
available
14. #IndyCloudConf @MarkSchroering
Ingestion - Lessons Learned
Utilize Spot (Preemptible) compute to save cost for big data ingestion tasks
Solutions:
● Use a Batch Spot Compute Environment and set the retry strategy for jobs
to >= 1 to allow jobs to be retried should an instance get terminated
● Use Spot Instance Fleets in EMR
● Have monitoring in place to get notified when jobs fail
15. #IndyCloudConf @MarkSchroering
Analytics - Index Variants and Annotations
Data Lake
Variant attributes needed for analytics are stored in PostgreSQL. Full annotation records are stored in
DynamoDB.
Indexed variants
in PostgreSQL
Annotations
16. #IndyCloudConf @MarkSchroering
Analytics - Lessons Learned
Partition large tables for better query performance
Solutions:
● PostgreSQL offers table partitioning by range (defined by a key column) or
explicit listings. After partitioning a large table we saw dramatic
improvements in query performance. Updates to a partition also do not
impact query performance of other partitions.
17. #IndyCloudConf @MarkSchroering
Analytics - Lessons Learned
Use a data lake for storing raw data
Solutions:
● Store raw data in S3 in a big data friendly format like Apache Parquet
○ Do not throw any data away. You may need it later
○ Can rebuild indexed data stores or create new ones as needed using
the raw data
18. #IndyCloudConf @MarkSchroering
Application - Provide query results to client
The Lambda function executes a query against the PostgreSQL database and joins annotation records
from DynamoDB.
Indexed variants
in PostgreSQL
Annotations
API Gateway Microservice
19. #IndyCloudConf @MarkSchroering
Application - Lessons Learned
Use reader endpoints to get better performance
for AWS Aurora databases
● High write load caused by large ingestion
jobs will not impact query performance for
clients needing read only access
● Reader endpoint load balances for query
intensive applications