Hadoop ecosystem for life sciences

1
Hadoop ecosystem for life sciences
Uri Laserson
30 September 2013

About the speaker
• Currently “Data Scientist” at Cloudera
• PhD in Biomedical Engineering at
MIT/Harvard (2005-2012)
• Focused on next-generation DNA sequencing
technology in George Church’s lab
• Co-founded Good Start Genetics (2007-)
• First application of next-gen sequencing to
genetic carrier screening
• laserson@cloudera.com
2

Agenda
• Historical context
• Introduction to Hadoop ecosystem
• Genomics on Hadoop
• Other use cases in life sciences
3

Indexing the Web
• Web is Huge
• Hundreds of millions of pages in 1999
• How do you index it?
• Crawl all the pages
• Rank pages based on relevance metrics
• Build search index of keywords to pages
• Do it in real time!
6

Databases in 1999
1. Buy a really big machine
2. Install expensive DBMS on it
3. Point your workload at it
4. Hope it doesn’t fail
5. Ambitious: buy another big machine as backup
8

Database Limitations
• Didn’t scale horizontally
• High marginal cost ($$$)
• No real fault-tolerance story
• Vendor lock-in ($$$)
• SQL unsuited for search ranking
• Complex analysis (PageRank)
• Unstructured data
10

Google does something different
• Designed their own storage and processing
infrastructure
• Google File System (GFS) and MapReduce (MR)
• Goals: KISS
• Cheap
• Scalable
• Reliable
12

Google does something different
• It worked!
• Powered Google Search for many years
• General framework for large-scale batch computation
tasks
• Still used internally at Google to this day
13

Google benevolent enough to publish
14
2003 2004

Birth of Hadoop at Yahoo!
• 2004-2006: Doug Cutting and Mike Cafarella
implement GFS/MR.
• 2006: Spun out as Apache Hadoop
• Named after Doug’s son’s yellow stuffed elephant
15

Industry strategy: Copy Google
16
Google Open-source Function
GFS HDFS Distributed file system
MapReduce MapReduce Batch distributed data processing
Bigtable HBase Distributed DB/key-value store
Protobuf/Stubby Thrift or Avro Data serialization/RPC
Pregel Giraph Distributed graph processing
Dremel/F1 Cloudera Impala Scalable interactive SQL (MPP)
FlumeJava Crunch Abstracted data pipelines on Hadoop
Hadoop

17
Overview of core technology

HDFS design assumptions
• Based on Google File System
• Files are large (GBs to TBs)
• Failures are common
• Massive scale means failures very likely
• Disk, node, or network failures
• Accesses are large and sequential
• Files are append-only
18

HDFS properties
• Fault-tolerant
• Gracefully responds to node/disk/network failures
• Horizontally scalable
• Low marginal cost
• High-bandwidth
19
1
2
3
4
5
2
4
5
1
2
5
1
3
4
2
3
5
1
3
4
Input File
HDFS storage distribution
Node A Node B Node C Node D Node E

MapReduce
• Structured as
1. Embarrassingly parallel “map stage”
2. Cluster-wide distributed sort (“shuffle”)
3. Aggregation “reduce stage”
• Data-locality: process the data where it is stored
• Fault-tolerance: failed tasks automatically detected
and restarted
• Schema-on-read: data must not be stored conforming
to rigid schema
21

HPC separates compute from storage
23
Storage infrastructure Compute cluster
• Proprietary, distributed
file system
• Expensive
• High-performance
hardware
• Low failure rate
• Expensive
Big network
pipe ($$$)
User typically works by manually submitting jobs to scheduler
e.g., LSF, Grid Engine, etc.
HPC is about compute.
Hadoop is about data.

Hadoop colocates compute and storage
24
Compute cluster
Storage infrastructure
• Commodity hardware
• Data-locality
• Reduced networking
needs
User typically works by manually submitting jobs to scheduler
e.g., LSF, Grid Engine, etc.
HPC is about compute.
Hadoop is about data.

HPC is lower-level than Hadoop
• HPC only exposes job scheduling
• Parallelization typically occurs through MPI
• Very low-level communication primitives
• Difficult to horizontally scale by simply adding nodes
• Large data sets must be manually split
• Failures must be dealt with manually
• Hadoop has fault-tolerance, data locality, horizontal
scalability
25

Sqoop
26
Bidirectional data transfer
between Hadoop and
almost any SQL database
with a JDBC driver

Flume
27
A streaming data
collection and
aggregation system for
massive volumes of
data, such as RPC
services, Log4J,
Syslog, etc.
Client
Client
Client
Client
Agent
Agent
Agent

Cloudera Impala
28
Modern MPP
database built on top
of HDFS
Designed for
interactive queries
on terabyte-scale
data sets.

Cloudera Search
29
• Interactive search queries on top of
HDFS
• Built on Solr and SolrCloud
• Near-realtime indexing of new documents

Benefits of Hadoop ecosystem
• Inexpensive commodity compute/storage
• Tolerates random hardware failure
• Decreased need for high-bandwidth network pipes
• Co-locate compute and storage
• Exploit data locality
• Simple horizontal scalability by adding nodes
• MapReduce jobs effectively guaranteed to scale
• Fault-tolerance/replication built-in. Data is durable
• Large ecosystem of tools
• Flexible data storage. Schema-on-read. Unstructured data.
30

NCBI Sequence Read Archive (SRA)
33
Today…
1.14 petabytes
One year ago…
609 terabytes

Every ‘ome has a -seq
34
Genome DNA-seq
Transcriptome
RNA-seq
FRT-seq
NET-seq
Methylome Bisulfite-seq
Immunome Immune-seq
Proteome
PhIP-seq
Bind-n-seq

Genomics ETL
35
GATK best practices

Genomics ETL
36
.fastq .bam .vcf
short read
alignment
genotype
calling
• Short read alignment is embarrassingly parallel
• Pileup/variant calling requires distributed sort
• GATK is a reimplementation of MapReduce; could run on Hadoop
• Already available Hadoop tools
• Crossbow: short read alignment/variant calling
• Hadoop-BAM: distributed bamtools
• BioPig: manipulating large fasta/q
• SEAL: Hadoop-enabled BWA
• Contrail: de-novo assembly

Use case 1: Scaling a genome center
pipeline
• Currently at 5k genomes (150 TB incl. raw), looking to
scale to 25k now (1 PB) and eventually 100k
(requiring 4 PB)
• Current throughput
• >1300 samples per month
• >12 TB raw data per month
• Data ultimately served from MySQL database
• 750 GB of processed variant data
• 25k genomes requires >3.5 TB in MySQL
• Complex 4-tier storage system, including
tape, filer, and RDMBS
37

Use case 1: Scaling a genome center
pipeline
• Database serves population genetics applications and
case/control studies
• Unify all data processing into HDFS
• Replace MySQL with Impala on Hadoop for increased
scalability
• Possibly move raw data processing into MapReduce
38

Use case 2: Querying large, integrated data
sets
• Biotech client has thousands of genomes
• Want to expose ad hoc querying functionality on large
scale
• e.g., vcftools/PLINK-SEQ on terabyte-scale data sets
• Integrating data with public data sets (e.g., ENCODE,
UCSC browser)
• Terabyte-scale annotation sets
• Currently, these capabilities (e.g., data joins) are often
manually implemented
39

Use case 2: Querying large, integrated data
sets
• Hadoop allows all data to be centrally stored and
accessible
• Impala exposes a SQL query interface to data sets in
Hadoop
40

Variant-filtering example
• “Give me all SNPs that are:
• on chromosome 5
• absent from dbSNP
• present in COSMIC
• observed in breast cancer samples
• absent from prostate cancer samples
• overlap a DNase hypersensitivity site
• overlap a ChIP-seq site for a particular TF”
• On full 1000 genome data set (~37 billion
variants), query finishes in a couple seconds
41

All-vs-all eQTL
• Possible to generate trillions of hypothesis tests
• 107 loci x 104 phenotypes x 10s of tissues = 1012 p-values
• Tested below on 120 billion associations
• Example queries:
• “Given 5 genes of interest, find top 20 most significant
eQTLs (cis and/or trans)”
• Finishes in several seconds
• “Find all cis-eQTLs across the entire genome”
• Finishes in a couple of minutes
• Limited by disk throughput
42

All-vs-all eQTL
• “Find all SNPs that are:
• in LD with some lead SNP
or eQTL of interest
• align with some functional
annotation of interest”
• Still in testing, but likely
finishes in seconds
43
Schaub et al, Genome Research, 2012

Genomics summary
• ETL (raw data to analysis-ready data)
• Data integration
• e.g., interactively queryable UCSC genome browser
• De novo assembly
• NLP on scientific literature
44

45
Clinical data
Manufacturing
Other use cases

Use case 3: Clinical document queries for
EHR company
• EHR wants to expose query functionality to clinicians
• >16 million clinical documents with free text; processed
through NLP pipeline
• >500 million lab results
• Perform subject expansion on search queries via
ontologies
• e.g., “myocardial infarction” will match “heart disease”
• Search functionality implemented with Lucene
(serving) on top of Hbase
(processing/storage/indexing)
46

EHR company
• Interested in recommendation engine-enabled
queries, like:
• Clinician searches “diabetes” and has relevant lab results
already highlighted when opening a patient’s record
• Clinician wants to know what other conditions might be
correlated with a finding of interest
47

EHR company
48
“Find other patients
similar to mine”
• The Stanford system is limited
to search
• Recommendation engines
allow a button “find similar”

Use case 4: Insurance company
• Data from 30 different EHRs across multiple business
units
• High variance in ICD9 coding between locales.
• Use NLP and machine learning to improve ICD9
coding to reduce variance in diagnosis
49

Use case 5: Pharma company variance in
yields
• Pharma company performs large batch fermentations
of their product
• Find high levels of variance in their yield
• Fermentations are automated and highly
instrumented
• e.g., dissolved oxygen, nutrients, COAs, temperature, etc.
• Perform time series analysis on fermentation runs to
predict yields and determine which variables control
variance.
50

Use case 6: AgTech company integrating
data sources
• Multiple reference genome sequences
• Genotyping on thousands of samples
• Weather data
• Soil data
• Microbiome data
• Yield data
• Geo data
• All integrated in HBase
51

Use case 6: AgTech company integrating
data sources
• Can increase crop yields ~15% by “printing” seeds
onto a field
• Support search queries by name, ontology
concepts, protein families, creation dates,
assembly/chromsome positions, SNPs
• Import any annotation data in CSV/GFF
• Integration with cloning tools
• Supports a web front-end for easy access
52

Highly heterogeneous data
5
4
COMMUNICATIONS
Location-
based
advertising
HEALTH CARE
Patient sensors,
monitoring,
EHRs
Quality
of care
LAW ENFORCEMENT
& DEFENSE
Threat analysis,
Social media
monitoring,
Photo analysis
EDUCATION
& RESEARCH
Experiment
sensor
analysis
FINANCIAL SERVICES
Risk & portfolio
analysis
New products
ON-LINE ERVICES /
SOCIAL MEDIA
People & career
matching
Website
optimization
UTILITIES
Smart Meter
analysis for
network
capacity
CONSUMER PACKAGED
GOODS
Sentiment analysis
of what’s hot,
customer service
MEDIA /
ENTERTAINMENT
Viewers /
advertising
effectiveness
TRAVEL &
TRANSPORTATION
Sensor analysis
for optimal
traffic flows
Customer
sentiment
LIFE SCIENCES
Clinical trials
Genomics
RETAIL
Consumer sentiment
Optimized
marketing
AUTOMOTIVE
Auto sensors
reporting location,
problems
HIGH TECH /
INDUSTRIAL MFG.
Mfg. quality
Warranty
analysis
OIL & GAS
Drilling
exploration
sensor
analysis
©2013 Cloudera, Inc. All Rights Reserved.

Flexibility
• Store any data
• Run any analysis and processing
• Keeps pace with the rate of change of incoming data
Scalability
• Proven growth to PBs/1,000s of nodes
• No need to rewrite queries, automatically scales
• Keeps pace with the rate of growth of incoming data
Efficiency
• Cost per TB at a fraction of other options
• Keep all of your data alive in an active archive
• Powering the data beats algorithm movement
The Cloudera Enterprise Platform for Big Data
55
©2013 Cloudera, Inc. All Rights Reserved.

Hadoop ecosystem for life sciences

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Hadoop ecosystem for life sciences

Similar to Hadoop ecosystem for life sciences (20)

Recently uploaded

Recently uploaded (20)

Hadoop ecosystem for life sciences

Editor's Notes