Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
140614 bigdatacamp-la-keynote-jon hsieh
1. Headline Goes Here
Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Evolution of the Big Data Stack
Jonathan Hsieh| Tech Lead / Software engineer @ Cloudera
BigDataCamp LA ‘14
June 14, 2014
2. Who Am I?
• Cloudera since 2009
• Tech Lead HBase Team
• Software Engineer
• Apache HBase committer / PMC
• Apache Flume founder / PMC
• U of Washington:
• Research in Distributed Systems
6/14/14 BigDataCamp LA '14 - Hsieh2
3. Big Data Stack Evolution
•Inspiration
•Imitation
•Innovation
6/14/14 BigDataCamp LA '14 - Hsieh3
4. Big Data Stack Evolution
•Inspiration
•Imitation
•Innovation
6/14/14 BigDataCamp LA '14 - Hsieh4
5. Emergence of Big Data
Inspiration
6/14/14 BigDataCamp LA '14 - Hsieh5
8. The brute force solution
1. Collect all the data
2. Analyze all the data
3. Serve the results
6/14/14 BigDataCamp LA '14 - Hsieh8
9. End of free MHz coincides with Rise of Big Data
6/14/14 BigDataCamp LA '14 - Hsieh
http://cacm.acm.org/magazines/2012/4/147359-cpu-db-recording-microprocessor-history/abstract
9
10. A Move towards Distributed Systems
• Scaling Horizontally instead of Vertically
• Challenges:
• Reliability
• Fault tolerance
• Atomicity / Consistency / Isolation / Durability
• High-Availability
• Latency Predictability
6/14/14 BigDataCamp LA '14 - Hsieh10
11. Google built a Big Data Stack
Sawzall
MapReduce
GFS
6/14/14 BigDataCamp LA '14 - Hsieh11
12. Google built a Big Data Stack
Sawzall
MapReduce
MySql
Gateway
Big Table
GFS
Chubby
Evenflow Protobufs
6/14/14 BigDataCamp LA '14 - Hsieh12
13. The core of a Big Data Stack
• .
Query
Processing
Data
Integration
Fast Read /
Write access
File System
Distributed Coordination
Workflow and Scheduling Metadata
6/14/14 BigDataCamp LA '14 - Hsieh13
14. Big Data for the rest of us
Imitation
6/14/14 BigDataCamp LA '14 - Hsieh14
16. The core of a Hadoop stack
Query
Processing
Data
Integration
Fast Read /
Write access
File System
Distributed Coordination
Workflow and Scheduling Metadata
6/14/14 BigDataCamp LA '14 - Hsieh16
17. built a Big Data stack
• Donated Hadoop + Friends to the Apache Software Foundation
Pig / Hive
HadoopData Highway* HBase
HDFS
ZooKeeper
Oozie Hive
6/14/14 BigDataCamp LA '14 - Hsieh17
18. Parallel Components
6/14/14 BigDataCamp LA '14 - Hsieh
Function Google Yahoo! Facebook The Rest of Us
File system GFS => Colossus HDFS HDFS HDFS
Low latency Data store
(NoSQL)
BigTable => Megastore
=> Spanner
PNUTS => Hbase HBase Hbase
Batch processing Google MapReduce Hadoop MapReduce Hadoop MapReduce Hadoop MapReduce
Spark
Batch query Sawzall, Tenzing,
FlumeJava
Pig Hive Pig, Hive, Impala,
Drill, Crunch
Resource Management Borg => Omega => YARN => Corona YARN
Mesos
Ingest EvenFlow
Custom MySQL Proxy
Custom Scribe / Calligraphus
Custom proxy
Sqoop
Flume
Kafka
Coordination Chubby Zookeeper Zookeeper Zookeeper
Graph Processing Pregel Giraph Giraph, Golden orb
Hama, Titan
Stream processing MillWheel S3 => Storm Puma/PTail Storm, Spark
18
19. Simplify and remove features to enable scaling
• Scalable and simple
first
• Focus only on
needed features.
Exclude others.
• Re-add them later.
• Ex: NoSQL
• No transactions
• No Schema
6/14/14 BigDataCamp LA '14 - Hsieh19
20. Big Data industry steps up
Innovation
6/14/14 BigDataCamp LA '14 - Hsieh20
21. Nov ’06:
Google
BigTable, Chubby OSDI ‘06
Mar’10: Cloudera
Founded
Big Data Stack Timeline
6/14/14 BigDataCamp LA '14 - Hsieh
20142006 2007 2008 2009 2010 2011 20132012
Apr’11: CDH3 GA
with HBase,
Flume, Sqoop,
Oozie
Feb’12: CDH4 GA
with HDFS NN
HA, and YARN
preview
Mar’10: CDH2 GA
with CM
(manager)
2009: CDH1 GA
(first hadoop
distro)
Mar ’04:
Google MapReduce
OSDI ‘04
Oct ’03:
Google GFS
SOSP ‘03
2008:
Google Tenzing
Pub (VLDB’11)
2008:
Facebook
Hive
ICMD ‘08:
Pig Latin
21
22. Nov ’06:
Google
BigTable, Chubby OSDI ‘06
Mar’10: Cloudera
Founded
Big Data Stack Timeline
6/14/14 BigDataCamp LA '14 - Hsieh
20142006 2007 2008 2009 2010 2011 20132012
Apr’11: CDH3 GA
with HBase,
Flume, Sqoop,
Oozie
Feb’12: CDH4 GA
with HDFS NN
HA, and YARN
preview
Mar’10: CDH2 GA
with CM
(manager)
2009: CDH1 GA
(first hadoop
distro)
Apr’14: CDH5 GA
with Impala,
Spark, Solr,
Navigator
Mar ’04:
Google MapReduce
OSDI ‘04
Oct ’03:
Google GFS
SOSP ‘03
2008:
Google Tenzing
Pub (VLDB’11)
2008:
Google Spanner
OSDI ‘12
2008:
Facebook
Hive
2014:
Facebook
discusses HydraBase
ICMD ‘08:
Pig Latin
2011:
Google Megastore
CIDR ‘11
2010:
Google Percolator
OSDI’10
22