2. About the speaker
• Currently “Data Scientist” at Cloudera
• PhD in Biomedical Engineering at
MIT/Harvard (2005-2012)
• Focused on next-generation DNA sequencing
technology in George Church’s lab
• Co-founded Good Start Genetics (2007-)
• First application of next-gen sequencing to
genetic carrier screening
• laserson@cloudera.com
2
3. Agenda
• Historical context
• Introduction to Hadoop ecosystem
• Genomics on Hadoop
• Other use cases in life sciences
3
6. Indexing the Web
• Web is Huge
• Hundreds of millions of pages in 1999
• How do you index it?
• Crawl all the pages
• Rank pages based on relevance metrics
• Build search index of keywords to pages
• Do it in real time!
6
8. Databases in 1999
1. Buy a really big machine
2. Install expensive DBMS on it
3. Point your workload at it
4. Hope it doesn’t fail
5. Ambitious: buy another big machine as backup
8
12. Google does something different
• Designed their own storage and processing
infrastructure
• Google File System (GFS) and MapReduce (MR)
• Goals: KISS
• Cheap
• Scalable
• Reliable
12
13. Google does something different
• It worked!
• Powered Google Search for many years
• General framework for large-scale batch computation
tasks
• Still used internally at Google to this day
13
15. Birth of Hadoop at Yahoo!
• 2004-2006: Doug Cutting and Mike Cafarella
implement GFS/MR.
• 2006: Spun out as Apache Hadoop
• Named after Doug’s son’s yellow stuffed elephant
15
16. Industry strategy: Copy Google
16
Google Open-source Function
GFS HDFS Distributed file system
MapReduce MapReduce Batch distributed data processing
Bigtable HBase Distributed DB/key-value store
Protobuf/Stubby Thrift or Avro Data serialization/RPC
Pregel Giraph Distributed graph processing
Dremel/F1 Cloudera Impala Scalable interactive SQL (MPP)
FlumeJava Crunch Abstracted data pipelines on Hadoop
Hadoop
18. HDFS design assumptions
• Based on Google File System
• Files are large (GBs to TBs)
• Failures are common
• Massive scale means failures very likely
• Disk, node, or network failures
• Accesses are large and sequential
• Files are append-only
18
19. HDFS properties
• Fault-tolerant
• Gracefully responds to node/disk/network failures
• Horizontally scalable
• Low marginal cost
• High-bandwidth
19
1
2
3
4
5
2
4
5
1
2
5
1
3
4
2
3
5
1
3
4
Input File
HDFS storage distribution
Node A Node B Node C Node D Node E
21. MapReduce
• Structured as
1. Embarrassingly parallel “map stage”
2. Cluster-wide distributed sort (“shuffle”)
3. Aggregation “reduce stage”
• Data-locality: process the data where it is stored
• Fault-tolerance: failed tasks automatically detected
and restarted
• Schema-on-read: data must not be stored conforming
to rigid schema
21
23. HPC separates compute from storage
23
Storage infrastructure Compute cluster
• Proprietary, distributed
file system
• Expensive
• High-performance
hardware
• Low failure rate
• Expensive
Big network
pipe ($$$)
User typically works by manually submitting jobs to scheduler
e.g., LSF, Grid Engine, etc.
HPC is about compute.
Hadoop is about data.
24. Hadoop colocates compute and storage
24
Compute cluster
Storage infrastructure
• Commodity hardware
• Data-locality
• Reduced networking
needs
User typically works by manually submitting jobs to scheduler
e.g., LSF, Grid Engine, etc.
HPC is about compute.
Hadoop is about data.
25. HPC is lower-level than Hadoop
• HPC only exposes job scheduling
• Parallelization typically occurs through MPI
• Very low-level communication primitives
• Difficult to horizontally scale by simply adding nodes
• Large data sets must be manually split
• Failures must be dealt with manually
• Hadoop has fault-tolerance, data locality, horizontal
scalability
25
27. Flume
27
A streaming data
collection and
aggregation system for
massive volumes of
data, such as RPC
services, Log4J,
Syslog, etc.
Client
Client
Client
Client
Agent
Agent
Agent
29. Cloudera Search
29
• Interactive search queries on top of
HDFS
• Built on Solr and SolrCloud
• Near-realtime indexing of new documents
30. Benefits of Hadoop ecosystem
• Inexpensive commodity compute/storage
• Tolerates random hardware failure
• Decreased need for high-bandwidth network pipes
• Co-locate compute and storage
• Exploit data locality
• Simple horizontal scalability by adding nodes
• MapReduce jobs effectively guaranteed to scale
• Fault-tolerance/replication built-in. Data is durable
• Large ecosystem of tools
• Flexible data storage. Schema-on-read. Unstructured data.
30
36. Genomics ETL
36
.fastq .bam .vcf
short read
alignment
genotype
calling
• Short read alignment is embarrassingly parallel
• Pileup/variant calling requires distributed sort
• GATK is a reimplementation of MapReduce; could run on Hadoop
• Already available Hadoop tools
• Crossbow: short read alignment/variant calling
• Hadoop-BAM: distributed bamtools
• BioPig: manipulating large fasta/q
• SEAL: Hadoop-enabled BWA
• Contrail: de-novo assembly
37. Use case 1: Scaling a genome center
pipeline
• Currently at 5k genomes (150 TB incl. raw), looking to
scale to 25k now (1 PB) and eventually 100k
(requiring 4 PB)
• Current throughput
• >1300 samples per month
• >12 TB raw data per month
• Data ultimately served from MySQL database
• 750 GB of processed variant data
• 25k genomes requires >3.5 TB in MySQL
• Complex 4-tier storage system, including
tape, filer, and RDMBS
37
38. Use case 1: Scaling a genome center
pipeline
• Database serves population genetics applications and
case/control studies
• Unify all data processing into HDFS
• Replace MySQL with Impala on Hadoop for increased
scalability
• Possibly move raw data processing into MapReduce
38
39. Use case 2: Querying large, integrated data
sets
• Biotech client has thousands of genomes
• Want to expose ad hoc querying functionality on large
scale
• e.g., vcftools/PLINK-SEQ on terabyte-scale data sets
• Integrating data with public data sets (e.g., ENCODE,
UCSC browser)
• Terabyte-scale annotation sets
• Currently, these capabilities (e.g., data joins) are often
manually implemented
39
40. Use case 2: Querying large, integrated data
sets
• Hadoop allows all data to be centrally stored and
accessible
• Impala exposes a SQL query interface to data sets in
Hadoop
40
41. Variant-filtering example
• “Give me all SNPs that are:
• on chromosome 5
• absent from dbSNP
• present in COSMIC
• observed in breast cancer samples
• absent from prostate cancer samples
• overlap a DNase hypersensitivity site
• overlap a ChIP-seq site for a particular TF”
• On full 1000 genome data set (~37 billion
variants), query finishes in a couple seconds
41
42. All-vs-all eQTL
• Possible to generate trillions of hypothesis tests
• 107 loci x 104 phenotypes x 10s of tissues = 1012 p-values
• Tested below on 120 billion associations
• Example queries:
• “Given 5 genes of interest, find top 20 most significant
eQTLs (cis and/or trans)”
• Finishes in several seconds
• “Find all cis-eQTLs across the entire genome”
• Finishes in a couple of minutes
• Limited by disk throughput
42
43. All-vs-all eQTL
• “Find all SNPs that are:
• in LD with some lead SNP
or eQTL of interest
• align with some functional
annotation of interest”
• Still in testing, but likely
finishes in seconds
43
Schaub et al, Genome Research, 2012
44. Genomics summary
• ETL (raw data to analysis-ready data)
• Data integration
• e.g., interactively queryable UCSC genome browser
• De novo assembly
• NLP on scientific literature
44
46. Use case 3: Clinical document queries for
EHR company
• EHR wants to expose query functionality to clinicians
• >16 million clinical documents with free text; processed
through NLP pipeline
• >500 million lab results
• Perform subject expansion on search queries via
ontologies
• e.g., “myocardial infarction” will match “heart disease”
• Search functionality implemented with Lucene
(serving) on top of Hbase
(processing/storage/indexing)
46
47. Use case 3: Clinical document queries for
EHR company
• Interested in recommendation engine-enabled
queries, like:
• Clinician searches “diabetes” and has relevant lab results
already highlighted when opening a patient’s record
• Clinician wants to know what other conditions might be
correlated with a finding of interest
47
48. Use case 3: Clinical document queries for
EHR company
48
“Find other patients
similar to mine”
• The Stanford system is limited
to search
• Recommendation engines
allow a button “find similar”
49. Use case 4: Insurance company
• Data from 30 different EHRs across multiple business
units
• High variance in ICD9 coding between locales.
• Use NLP and machine learning to improve ICD9
coding to reduce variance in diagnosis
49
50. Use case 5: Pharma company variance in
yields
• Pharma company performs large batch fermentations
of their product
• Find high levels of variance in their yield
• Fermentations are automated and highly
instrumented
• e.g., dissolved oxygen, nutrients, COAs, temperature, etc.
• Perform time series analysis on fermentation runs to
predict yields and determine which variables control
variance.
50
51. Use case 6: AgTech company integrating
data sources
• Multiple reference genome sequences
• Genotyping on thousands of samples
• Weather data
• Soil data
• Microbiome data
• Yield data
• Geo data
• All integrated in HBase
51
52. Use case 6: AgTech company integrating
data sources
• Can increase crop yields ~15% by “printing” seeds
onto a field
• Support search queries by name, ontology
concepts, protein families, creation dates,
assembly/chromsome positions, SNPs
• Import any annotation data in CSV/GFF
• Integration with cloning tools
• Supports a web front-end for easy access
52
Already mature technologies at this point.DB community thought it was silly.Non-Google were not yet at this scale.Google not in the business of releasing infrastructure software. They sell ads.
Mostly through the Apache Software Foundation
Talk HDFS and MapReduce.Then some other tools.
Large blocksBlocks replicated around
Two functions required.
Only need to supply 2 functions
Need to be careful because you can DDoS your database.
Log scale.
Define ETL
Define ETL
Volume,Variety, Velocity
Software: Cloudera Enterprise – The Platform for Big DataA complete data management solution powered by Apache HadoopA collection of open source projects form the foundation of the platformCloudera has wrapped the open source core with additional software for system and data management as well as technical support5 Attributes of Cloudera Enterprise:ScalableStorage and compute in a single system – brings computation to data (rather than the other way around)Scale capacity and performance linearly – just add nodesProven at massive scale – tens of PB of data, millions of usersFlexibleStore any type of dataStructured, unstructured, semi-structuredIn it’s native format – no conversion requiredNo loss of data fidelity due to ETLFluid structuringNo single model or schema that the data must conform toDetermine how you want to look at data at the time you ask the question – if the attribute exists in the raw data, you can query against itAlter structure to optimize query performance as desired (not required) – multiple open source file formats like Avro, ParquetMultiple forms of computationBring different tools to bear on the data, depending on your skillset and what you want to doBatch processing – MapReduce, Hive, Pig, JavaInteractive SQL – Impala, BI toolsInteractive Search – for non-technical users, or helping to identify datasets for further analysisMachine learning – apply algorithms to large datasets using libraries like Apache MahoutMath – tools like SAS and R for data scientists and statisticiansMore to come…Cost-EffectiveScale out on inexpensive, industry standard hardware (vs. highly tuned, specialized hardware)Fault tolerance built-inLeverage cost structures with existing vendorsReduced data movement – can perform more operations in a single place due to flexible toolingFewer redundant copies of dataLess time spent migrating/managingOpen source software is easy acquire and prove the value/ROIOpenRapid innovationLarge development communitiesThe most talented engineers from across the worldEasy to acquire and prove valueFree to download and deployDemonstrate the value of the technology before you make a large-scale investmentNo vendor lock-in – choose your vendor based solely on meritCloudera’s open source strategyIf it stores or processes data, it’s open sourceBig commitment to open sourceLeading contributor to the Apache Hadoop ecosystem – defining the future of the platform together with the communityIntegratedWorks with all your existing investmentsDatabases and data warehousesAnalytics and BI solutionsETL toolsPlatforms and operating systemsHardware and networking equipmentOver 700 partners including all of the leaders in the market segments aboveComplements those investments by allowing you to align data and processes to the right solution