SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Headline Goes Here
Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Evolution of the Big Data Stack
Jonathan Hsieh| Tech Lead / Software engineer @ Cloudera
BigDataCamp LA ‘14
June 14, 2014
Who Am I?
• Cloudera since 2009
• Tech Lead HBase Team
• Software Engineer
• Apache HBase committer / PMC
• Apache Flume founder / PMC
• U of Washington:
• Research in Distributed Systems
6/14/14 BigDataCamp LA '14 - Hsieh2
Big Data Stack Evolution
•Inspiration
•Imitation
•Innovation
6/14/14 BigDataCamp LA '14 - Hsieh3
Big Data Stack Evolution
•Inspiration
•Imitation
•Innovation
6/14/14 BigDataCamp LA '14 - Hsieh4
Emergence of Big Data
Inspiration
6/14/14 BigDataCamp LA '14 - Hsieh5
6/14/14 BigDataCamp LA '14 - Hsieh6
6/14/14 BigDataCamp LA '14 - Hsieh7
The brute force solution
1. Collect all the data
2. Analyze all the data
3. Serve the results
6/14/14 BigDataCamp LA '14 - Hsieh8
End of free MHz coincides with Rise of Big Data
6/14/14 BigDataCamp LA '14 - Hsieh
http://cacm.acm.org/magazines/2012/4/147359-cpu-db-recording-microprocessor-history/abstract
9
A Move towards Distributed Systems
• Scaling Horizontally instead of Vertically
• Challenges:
• Reliability
• Fault tolerance
• Atomicity / Consistency / Isolation / Durability
• High-Availability
• Latency Predictability
6/14/14 BigDataCamp LA '14 - Hsieh10
Google built a Big Data Stack
Sawzall
MapReduce
GFS
6/14/14 BigDataCamp LA '14 - Hsieh11
Google built a Big Data Stack
Sawzall
MapReduce
MySql
Gateway
Big Table
GFS
Chubby
Evenflow Protobufs
6/14/14 BigDataCamp LA '14 - Hsieh12
The core of a Big Data Stack
• .
Query
Processing
Data
Integration
Fast Read /
Write access
File System
Distributed Coordination
Workflow and Scheduling Metadata
6/14/14 BigDataCamp LA '14 - Hsieh13
Big Data for the rest of us
Imitation
6/14/14 BigDataCamp LA '14 - Hsieh14
6/14/14 BigDataCamp LA '14 - Hsieh15
The core of a Hadoop stack
Query
Processing
Data
Integration
Fast Read /
Write access
File System
Distributed Coordination
Workflow and Scheduling Metadata
6/14/14 BigDataCamp LA '14 - Hsieh16
built a Big Data stack
• Donated Hadoop + Friends to the Apache Software Foundation
Pig / Hive
HadoopData Highway* HBase
HDFS
ZooKeeper
Oozie Hive
6/14/14 BigDataCamp LA '14 - Hsieh17
Parallel Components
6/14/14 BigDataCamp LA '14 - Hsieh
Function Google Yahoo! Facebook The Rest of Us
File system GFS => Colossus HDFS HDFS HDFS
Low latency Data store
(NoSQL)
BigTable => Megastore
=> Spanner
PNUTS => Hbase HBase Hbase
Batch processing Google MapReduce Hadoop MapReduce Hadoop MapReduce Hadoop MapReduce
Spark
Batch query Sawzall, Tenzing,
FlumeJava
Pig Hive Pig, Hive, Impala,
Drill, Crunch
Resource Management Borg => Omega => YARN => Corona YARN
Mesos
Ingest EvenFlow
Custom MySQL Proxy
Custom Scribe / Calligraphus
Custom proxy
Sqoop
Flume
Kafka
Coordination Chubby Zookeeper Zookeeper Zookeeper
Graph Processing Pregel Giraph Giraph, Golden orb
Hama, Titan
Stream processing MillWheel S3 => Storm Puma/PTail Storm, Spark
18
Simplify and remove features to enable scaling
• Scalable and simple
first
• Focus only on
needed features.
Exclude others.
• Re-add them later.
• Ex: NoSQL
• No transactions
• No Schema
6/14/14 BigDataCamp LA '14 - Hsieh19
Big Data industry steps up
Innovation
6/14/14 BigDataCamp LA '14 - Hsieh20
Nov ’06:
Google
BigTable, Chubby OSDI ‘06
Mar’10: Cloudera
Founded
Big Data Stack Timeline
6/14/14 BigDataCamp LA '14 - Hsieh
20142006 2007 2008 2009 2010 2011 20132012
Apr’11: CDH3 GA
with HBase,
Flume, Sqoop,
Oozie
Feb’12: CDH4 GA
with HDFS NN
HA, and YARN
preview
Mar’10: CDH2 GA
with CM
(manager)
2009: CDH1 GA
(first hadoop
distro)
Mar ’04:
Google MapReduce
OSDI ‘04
Oct ’03:
Google GFS
SOSP ‘03
2008:
Google Tenzing
Pub (VLDB’11)
2008:
Facebook
Hive
ICMD ‘08:
Pig Latin
21
Nov ’06:
Google
BigTable, Chubby OSDI ‘06
Mar’10: Cloudera
Founded
Big Data Stack Timeline
6/14/14 BigDataCamp LA '14 - Hsieh
20142006 2007 2008 2009 2010 2011 20132012
Apr’11: CDH3 GA
with HBase,
Flume, Sqoop,
Oozie
Feb’12: CDH4 GA
with HDFS NN
HA, and YARN
preview
Mar’10: CDH2 GA
with CM
(manager)
2009: CDH1 GA
(first hadoop
distro)
Apr’14: CDH5 GA
with Impala,
Spark, Solr,
Navigator
Mar ’04:
Google MapReduce
OSDI ‘04
Oct ’03:
Google GFS
SOSP ‘03
2008:
Google Tenzing
Pub (VLDB’11)
2008:
Google Spanner
OSDI ‘12
2008:
Facebook
Hive
2014:
Facebook
discusses HydraBase
ICMD ‘08:
Pig Latin
2011:
Google Megastore
CIDR ‘11
2010:
Google Percolator
OSDI’10
22
Usability
6/14/14 BigDataCamp LA '14 - Hsieh23
Security + Integration
6/14/14 BigDataCamp LA '14 - Hsieh24
New directions
6/14/14 BigDataCamp LA '14 - Hsieh
oryx
25
6/14/14 BigDataCamp LA '14 - Hsieh26
Thanks!
@jmhsieh
6/14/14 BigDataCamp LA '14 - Hsieh27

Weitere ähnliche Inhalte

Was ist angesagt?

A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
DataWorks Summit
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Hadoop User Group
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Hadoop User Group
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
Hadoop User Group
 
Splunk's Hunk: A Powerful Way to Visualize Your Data Stored in MongoDB
Splunk's Hunk: A Powerful Way to Visualize Your Data Stored in MongoDBSplunk's Hunk: A Powerful Way to Visualize Your Data Stored in MongoDB
Splunk's Hunk: A Powerful Way to Visualize Your Data Stored in MongoDB
MongoDB
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!
Cloudera, Inc.
 

Was ist angesagt? (20)

Hunk - Unlocking the Power of Big Data
Hunk - Unlocking the Power of Big DataHunk - Unlocking the Power of Big Data
Hunk - Unlocking the Power of Big Data
 
Hadoop and HBase @eBay
Hadoop and HBase @eBayHadoop and HBase @eBay
Hadoop and HBase @eBay
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and AnalyticsA Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
A Non-Standard use Case of Hadoop: High Scale Image Processing and Analytics
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Sparkhbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and Spark
 
Hadoop at Ebay
Hadoop at EbayHadoop at Ebay
Hadoop at Ebay
 
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
Yahoo! Hadoop User Group - May 2010 Meetup - Apache Hadoop Release Plans for ...
 
Searching At Scale
Searching At ScaleSearching At Scale
Searching At Scale
 
The Bixo Web Mining Toolkit
The Bixo Web Mining ToolkitThe Bixo Web Mining Toolkit
The Bixo Web Mining Toolkit
 
Pig on spark
Pig on sparkPig on spark
Pig on spark
 
Review of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsReview of Calculation Paradigm and its Components
Review of Calculation Paradigm and its Components
 
Pig on Spark
Pig on SparkPig on Spark
Pig on Spark
 
Hadoop-2 @ eBay
Hadoop-2 @ eBayHadoop-2 @ eBay
Hadoop-2 @ eBay
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
 
Splunk's Hunk: A Powerful Way to Visualize Your Data Stored in MongoDB
Splunk's Hunk: A Powerful Way to Visualize Your Data Stored in MongoDBSplunk's Hunk: A Powerful Way to Visualize Your Data Stored in MongoDB
Splunk's Hunk: A Powerful Way to Visualize Your Data Stored in MongoDB
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!
 
HBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraphHBaseCon2017 Community-Driven Graphs with JanusGraph
HBaseCon2017 Community-Driven Graphs with JanusGraph
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 

Andere mochten auch

Andere mochten auch (20)

Summit v4 dave wolcott
Summit v4 dave wolcottSummit v4 dave wolcott
Summit v4 dave wolcott
 
Aziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jhaAziksa hadoop for buisness users2 santosh jha
Aziksa hadoop for buisness users2 santosh jha
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben white
 
Ag big datacampla-06-14-2014-ajay_gopal
Ag big datacampla-06-14-2014-ajay_gopalAg big datacampla-06-14-2014-ajay_gopal
Ag big datacampla-06-14-2014-ajay_gopal
 
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
Big Data Day LA 2015 - HBase at Factual: Real time and Batch Uses by Molly O'...
 
2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky
 
Yarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-tingYarn cloudera-kathleenting061414 kate-ting
Yarn cloudera-kathleenting061414 kate-ting
 
Kiji cassandra la june 2014 - v02 clint-kelly
Kiji cassandra la   june 2014 - v02 clint-kellyKiji cassandra la   june 2014 - v02 clint-kelly
Kiji cassandra la june 2014 - v02 clint-kelly
 
La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
Big datacamp june14_alex_liu
Big datacamp june14_alex_liuBig datacamp june14_alex_liu
Big datacamp june14_alex_liu
 
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
Big Data Day LA 2015 - Solr Search with Spark for Big Data Analytics in Actio...
 
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
Big Data Day LA 2015 - NoSQL: Doing it wrong before getting it right by Lawre...
 
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
Big Data Day LA 2015 - Lessons Learned from Designing Data Ingest Systems by ...
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapR
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
Big Data Day LA 2015 - Introducing N1QL: SQL for Documents by Jeff Morris of ...
 
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
Big Data Day LA 2015 - Deep Learning Human Vocalized Animal Sounds by Sabri S...
 
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
Big Data Day LA 2016/ Data Science Track - Decision Making and Lambda Archite...
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Introduction to Kafka - Je...
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
 

Ähnlich wie 140614 bigdatacamp-la-keynote-jon hsieh

10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
Donald Miner
 
SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases  SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases
DataWorks Summit
 
Beauty and Big Data
Beauty and Big DataBeauty and Big Data
Beauty and Big Data
Sri Ambati
 
Apache HBase Application Archetypes
Apache HBase Application ArchetypesApache HBase Application Archetypes
Apache HBase Application Archetypes
Cloudera, Inc.
 
Big Data Conference April 2015
Big Data Conference April 2015Big Data Conference April 2015
Big Data Conference April 2015
Aaron Benz
 

Ähnlich wie 140614 bigdatacamp-la-keynote-jon hsieh (20)

Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases  SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases
 
Beauty and Big Data
Beauty and Big DataBeauty and Big Data
Beauty and Big Data
 
SQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQSQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQ
 
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache HadoopFirst NL-HUG: Large-scale data processing at SARA with Apache Hadoop
First NL-HUG: Large-scale data processing at SARA with Apache Hadoop
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
 
Apache HBase Application Archetypes
Apache HBase Application ArchetypesApache HBase Application Archetypes
Apache HBase Application Archetypes
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
Big Data Conference April 2015
Big Data Conference April 2015Big Data Conference April 2015
Big Data Conference April 2015
 
How To Achieve Real-Time Analytics On A Data Lake Using GPUs
How To Achieve Real-Time Analytics On A Data Lake Using GPUsHow To Achieve Real-Time Analytics On A Data Lake Using GPUs
How To Achieve Real-Time Analytics On A Data Lake Using GPUs
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
 
Real Time and Big Data – It’s About Time
Real Time and Big Data – It’s About TimeReal Time and Big Data – It’s About Time
Real Time and Big Data – It’s About Time
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 

Mehr von Data Con LA

Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 

Mehr von Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 

140614 bigdatacamp-la-keynote-jon hsieh

  • 1. Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Evolution of the Big Data Stack Jonathan Hsieh| Tech Lead / Software engineer @ Cloudera BigDataCamp LA ‘14 June 14, 2014
  • 2. Who Am I? • Cloudera since 2009 • Tech Lead HBase Team • Software Engineer • Apache HBase committer / PMC • Apache Flume founder / PMC • U of Washington: • Research in Distributed Systems 6/14/14 BigDataCamp LA '14 - Hsieh2
  • 3. Big Data Stack Evolution •Inspiration •Imitation •Innovation 6/14/14 BigDataCamp LA '14 - Hsieh3
  • 4. Big Data Stack Evolution •Inspiration •Imitation •Innovation 6/14/14 BigDataCamp LA '14 - Hsieh4
  • 5. Emergence of Big Data Inspiration 6/14/14 BigDataCamp LA '14 - Hsieh5
  • 6. 6/14/14 BigDataCamp LA '14 - Hsieh6
  • 7. 6/14/14 BigDataCamp LA '14 - Hsieh7
  • 8. The brute force solution 1. Collect all the data 2. Analyze all the data 3. Serve the results 6/14/14 BigDataCamp LA '14 - Hsieh8
  • 9. End of free MHz coincides with Rise of Big Data 6/14/14 BigDataCamp LA '14 - Hsieh http://cacm.acm.org/magazines/2012/4/147359-cpu-db-recording-microprocessor-history/abstract 9
  • 10. A Move towards Distributed Systems • Scaling Horizontally instead of Vertically • Challenges: • Reliability • Fault tolerance • Atomicity / Consistency / Isolation / Durability • High-Availability • Latency Predictability 6/14/14 BigDataCamp LA '14 - Hsieh10
  • 11. Google built a Big Data Stack Sawzall MapReduce GFS 6/14/14 BigDataCamp LA '14 - Hsieh11
  • 12. Google built a Big Data Stack Sawzall MapReduce MySql Gateway Big Table GFS Chubby Evenflow Protobufs 6/14/14 BigDataCamp LA '14 - Hsieh12
  • 13. The core of a Big Data Stack • . Query Processing Data Integration Fast Read / Write access File System Distributed Coordination Workflow and Scheduling Metadata 6/14/14 BigDataCamp LA '14 - Hsieh13
  • 14. Big Data for the rest of us Imitation 6/14/14 BigDataCamp LA '14 - Hsieh14
  • 15. 6/14/14 BigDataCamp LA '14 - Hsieh15
  • 16. The core of a Hadoop stack Query Processing Data Integration Fast Read / Write access File System Distributed Coordination Workflow and Scheduling Metadata 6/14/14 BigDataCamp LA '14 - Hsieh16
  • 17. built a Big Data stack • Donated Hadoop + Friends to the Apache Software Foundation Pig / Hive HadoopData Highway* HBase HDFS ZooKeeper Oozie Hive 6/14/14 BigDataCamp LA '14 - Hsieh17
  • 18. Parallel Components 6/14/14 BigDataCamp LA '14 - Hsieh Function Google Yahoo! Facebook The Rest of Us File system GFS => Colossus HDFS HDFS HDFS Low latency Data store (NoSQL) BigTable => Megastore => Spanner PNUTS => Hbase HBase Hbase Batch processing Google MapReduce Hadoop MapReduce Hadoop MapReduce Hadoop MapReduce Spark Batch query Sawzall, Tenzing, FlumeJava Pig Hive Pig, Hive, Impala, Drill, Crunch Resource Management Borg => Omega => YARN => Corona YARN Mesos Ingest EvenFlow Custom MySQL Proxy Custom Scribe / Calligraphus Custom proxy Sqoop Flume Kafka Coordination Chubby Zookeeper Zookeeper Zookeeper Graph Processing Pregel Giraph Giraph, Golden orb Hama, Titan Stream processing MillWheel S3 => Storm Puma/PTail Storm, Spark 18
  • 19. Simplify and remove features to enable scaling • Scalable and simple first • Focus only on needed features. Exclude others. • Re-add them later. • Ex: NoSQL • No transactions • No Schema 6/14/14 BigDataCamp LA '14 - Hsieh19
  • 20. Big Data industry steps up Innovation 6/14/14 BigDataCamp LA '14 - Hsieh20
  • 21. Nov ’06: Google BigTable, Chubby OSDI ‘06 Mar’10: Cloudera Founded Big Data Stack Timeline 6/14/14 BigDataCamp LA '14 - Hsieh 20142006 2007 2008 2009 2010 2011 20132012 Apr’11: CDH3 GA with HBase, Flume, Sqoop, Oozie Feb’12: CDH4 GA with HDFS NN HA, and YARN preview Mar’10: CDH2 GA with CM (manager) 2009: CDH1 GA (first hadoop distro) Mar ’04: Google MapReduce OSDI ‘04 Oct ’03: Google GFS SOSP ‘03 2008: Google Tenzing Pub (VLDB’11) 2008: Facebook Hive ICMD ‘08: Pig Latin 21
  • 22. Nov ’06: Google BigTable, Chubby OSDI ‘06 Mar’10: Cloudera Founded Big Data Stack Timeline 6/14/14 BigDataCamp LA '14 - Hsieh 20142006 2007 2008 2009 2010 2011 20132012 Apr’11: CDH3 GA with HBase, Flume, Sqoop, Oozie Feb’12: CDH4 GA with HDFS NN HA, and YARN preview Mar’10: CDH2 GA with CM (manager) 2009: CDH1 GA (first hadoop distro) Apr’14: CDH5 GA with Impala, Spark, Solr, Navigator Mar ’04: Google MapReduce OSDI ‘04 Oct ’03: Google GFS SOSP ‘03 2008: Google Tenzing Pub (VLDB’11) 2008: Google Spanner OSDI ‘12 2008: Facebook Hive 2014: Facebook discusses HydraBase ICMD ‘08: Pig Latin 2011: Google Megastore CIDR ‘11 2010: Google Percolator OSDI’10 22
  • 24. Security + Integration 6/14/14 BigDataCamp LA '14 - Hsieh24
  • 25. New directions 6/14/14 BigDataCamp LA '14 - Hsieh oryx 25
  • 26. 6/14/14 BigDataCamp LA '14 - Hsieh26