SlideShare a Scribd company logo
1 of 12
Download to read offline
Thursday, May 13, 2010
Evolving a New Analytical Platform
         What Works and What’s Missing


         Jeff Hammerbacher
         Chief Scientist and Vice President of Products, Cloudera
         May 13, 2010



Thursday, May 13, 2010
My Background
         Thanks for Asking
         ▪   hammer@cloudera.com
         ▪   Studied Mathematics at Harvard
         ▪   Worked as a Quant on Wall Street
         ▪   Conceived, built, and led Data team at Facebook
             ▪   Nearly 30 amazing engineers and data scientists
             ▪   Several open source projects and research papers
         ▪   Founder of Cloudera
             ▪   Vice President of Products and Chief Scientist
             ▪   Also, check out the book “Beautiful Data”

Thursday, May 13, 2010
Presentation Outline
         ▪   Architectures for large scale data analysis
             ▪   Reference architecture: ETL, DW, BI, Analytics
             ▪   New foundations: HDFS and MapReduce
         ▪   SQL Server 2008 R2
             ▪   The new platform emerges
         ▪   Building a new platform
             ▪   Motivations
             ▪   Implementation
         ▪   Questions and Discussion



Thursday, May 13, 2010
Summary of the Presentation
         (I have a short attention span, too)
         ▪   The abstractions provided by a relational database are no longer
             useful on their own for analytical data management.


         ▪   The abstraction layer needs to be redrawn to include the
             functionality provided by ETL, MDM, stream management,
             reporting, OLAP, and search tools, with a unified user interface
             for collaboration on investigation and results.


         ▪   I don’t think the cloud has much to do with the above, except to
             kill “scale up” once and for all.


Thursday, May 13, 2010
Experiences at Facebook
         Early 2006: The First Research Scientist
         ▪   Source data living on horizontally partitioned MySQL tier
         ▪   Intensive historical analysis difficult
         ▪   No way to assess impact of changes to the site


         ▪   First try: Python scripts pull data into MySQL
         ▪   Second try: Python scripts pull data into Oracle


         ▪   ...and then we turned on impression logging



Thursday, May 13, 2010
Facebook Data Infrastructure
         2007                                          Scribe Tier                     MySQL Tier


         ▪   “Data Warehousing”
         ▪   Began with Oracle database
         ▪   Schedule data collection via cron
         ▪   Collect data every 24 hours
         ▪   “ETL” scripts: hand-coded Python                        Data Collection
                                                                         Server
         ▪   Data volumes quickly grew
             ▪   Started at tens of GB in early 2006                 Oracle Database
                                                                          Server
             ▪   Up to about 1 TB per day by mid-2007
             ▪   Log files largest source of data growth


Thursday, May 13, 2010
Facebook Data Infrastructure
                                                      2008
                                      Scribe Tier            MySQL Tier




                              Hadoop Tier




                                 Oracle RAC Servers




Thursday, May 13, 2010
SQL Server 2008 R2
         Old Features
         ▪   ETL: SQL Server Integration Services
         ▪   DW: SQL Server
         ▪   Reporting: SQL Server Reporting Services
         ▪   Analytics: SQL Server Analysis Services
         ▪   Search: Full-Text Search




Thursday, May 13, 2010
SQL Server 2008 R2
         New Features
         ▪   Stream management: StreamInsight
         ▪   OLAP: PowerPivot
         ▪   Collaboration: SharePoint
         ▪   MDM: Master Data Services
         ▪   Scale-out: Parallel Data Warehouse




Thursday, May 13, 2010
A New Foundation
         Motivations and Implementation
         ▪   Orders of magnitude growth in data volumes and complexity
             ▪   Often from machine-generated logs
             ▪   Complex data is vast majority of data
         ▪   Built by consumer web teams and not enterprise software firms
             ▪   Open source
             ▪   Modular collection of tools, not an opaque abstraction
             ▪   Applications, not just analysis
             ▪   Solve user needs, don’t implement a spec



Thursday, May 13, 2010
(c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0




Thursday, May 13, 2010

More Related Content

Similar to 20100513brown

Data Collection and Integration, Linked Data Management
Data Collection and Integration, Linked Data ManagementData Collection and Integration, Linked Data Management
Data Collection and Integration, Linked Data ManagementRENDER project
 
SQL Saturday 79 Enterprise Data Mining for SQL Server 2008 R2
SQL Saturday 79 Enterprise Data Mining for SQL Server 2008 R2SQL Saturday 79 Enterprise Data Mining for SQL Server 2008 R2
SQL Saturday 79 Enterprise Data Mining for SQL Server 2008 R2Mark Tabladillo
 
COSCUP 2019 - The discussion between Knex.js and PostgreSQL
COSCUP 2019 - The discussion between Knex.js and PostgreSQLCOSCUP 2019 - The discussion between Knex.js and PostgreSQL
COSCUP 2019 - The discussion between Knex.js and PostgreSQLLen Chang
 
SemTechBiz 2012 Panel on Linking Enterprise Data
SemTechBiz 2012 Panel on Linking Enterprise DataSemTechBiz 2012 Panel on Linking Enterprise Data
SemTechBiz 2012 Panel on Linking Enterprise Data3 Round Stones
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Understanding Big Data for policy professionals
Understanding Big Data for policy professionalsUnderstanding Big Data for policy professionals
Understanding Big Data for policy professionalsAlex Jouravlev
 
Achieving cyber mission assurance with near real-time impact
Achieving cyber mission assurance with near real-time impactAchieving cyber mission assurance with near real-time impact
Achieving cyber mission assurance with near real-time impactElasticsearch
 
Designing and developing your database for application availability
Designing and developing your database for application availabilityDesigning and developing your database for application availability
Designing and developing your database for application availabilityCharley Hanania
 
Prague data management meetup 2016-01-12 pub
Prague data management meetup 2016-01-12 pubPrague data management meetup 2016-01-12 pub
Prague data management meetup 2016-01-12 pubMartin Bém
 
Myth Busters II: BI Tools and Data Virtualization are Interchangeable
Myth Busters II: BI Tools and Data Virtualization are InterchangeableMyth Busters II: BI Tools and Data Virtualization are Interchangeable
Myth Busters II: BI Tools and Data Virtualization are InterchangeableDenodo
 
Implementing MongoDB at Shutterfly (Kenny Gorman)
Implementing MongoDB at Shutterfly (Kenny Gorman)Implementing MongoDB at Shutterfly (Kenny Gorman)
Implementing MongoDB at Shutterfly (Kenny Gorman)MongoSF
 
Data mining with excel 2010 and power pivot
Data mining with excel 2010 and power pivotData mining with excel 2010 and power pivot
Data mining with excel 2010 and power pivotigsc
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopDavid Yahalom
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Rittman Analytics
 
AFCEA C4I Symposium: The 4th C in C4I Stands for Cloud:Factors Driving Adopti...
AFCEA C4I Symposium: The 4th C in C4I Stands for Cloud:Factors Driving Adopti...AFCEA C4I Symposium: The 4th C in C4I Stands for Cloud:Factors Driving Adopti...
AFCEA C4I Symposium: The 4th C in C4I Stands for Cloud:Factors Driving Adopti...Patrick Chanezon
 
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio..."Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...Dataconomy Media
 
Experiences Evolving a New Analytical Platform: What Works and What's Missing
Experiences Evolving a New Analytical Platform: What Works and What's MissingExperiences Evolving a New Analytical Platform: What Works and What's Missing
Experiences Evolving a New Analytical Platform: What Works and What's MissingCloudera, Inc.
 

Similar to 20100513brown (20)

20100301icde
20100301icde20100301icde
20100301icde
 
Data Collection and Integration, Linked Data Management
Data Collection and Integration, Linked Data ManagementData Collection and Integration, Linked Data Management
Data Collection and Integration, Linked Data Management
 
SQL Saturday 79 Enterprise Data Mining for SQL Server 2008 R2
SQL Saturday 79 Enterprise Data Mining for SQL Server 2008 R2SQL Saturday 79 Enterprise Data Mining for SQL Server 2008 R2
SQL Saturday 79 Enterprise Data Mining for SQL Server 2008 R2
 
COSCUP 2019 - The discussion between Knex.js and PostgreSQL
COSCUP 2019 - The discussion between Knex.js and PostgreSQLCOSCUP 2019 - The discussion between Knex.js and PostgreSQL
COSCUP 2019 - The discussion between Knex.js and PostgreSQL
 
SemTechBiz 2012 Panel on Linking Enterprise Data
SemTechBiz 2012 Panel on Linking Enterprise DataSemTechBiz 2012 Panel on Linking Enterprise Data
SemTechBiz 2012 Panel on Linking Enterprise Data
 
20080115yahoobrickhouse
20080115yahoobrickhouse20080115yahoobrickhouse
20080115yahoobrickhouse
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Understanding Big Data for policy professionals
Understanding Big Data for policy professionalsUnderstanding Big Data for policy professionals
Understanding Big Data for policy professionals
 
Achieving cyber mission assurance with near real-time impact
Achieving cyber mission assurance with near real-time impactAchieving cyber mission assurance with near real-time impact
Achieving cyber mission assurance with near real-time impact
 
Treasure Data and Heroku
Treasure Data and HerokuTreasure Data and Heroku
Treasure Data and Heroku
 
Designing and developing your database for application availability
Designing and developing your database for application availabilityDesigning and developing your database for application availability
Designing and developing your database for application availability
 
Prague data management meetup 2016-01-12 pub
Prague data management meetup 2016-01-12 pubPrague data management meetup 2016-01-12 pub
Prague data management meetup 2016-01-12 pub
 
Myth Busters II: BI Tools and Data Virtualization are Interchangeable
Myth Busters II: BI Tools and Data Virtualization are InterchangeableMyth Busters II: BI Tools and Data Virtualization are Interchangeable
Myth Busters II: BI Tools and Data Virtualization are Interchangeable
 
Implementing MongoDB at Shutterfly (Kenny Gorman)
Implementing MongoDB at Shutterfly (Kenny Gorman)Implementing MongoDB at Shutterfly (Kenny Gorman)
Implementing MongoDB at Shutterfly (Kenny Gorman)
 
Data mining with excel 2010 and power pivot
Data mining with excel 2010 and power pivotData mining with excel 2010 and power pivot
Data mining with excel 2010 and power pivot
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera Hadoop
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
 
AFCEA C4I Symposium: The 4th C in C4I Stands for Cloud:Factors Driving Adopti...
AFCEA C4I Symposium: The 4th C in C4I Stands for Cloud:Factors Driving Adopti...AFCEA C4I Symposium: The 4th C in C4I Stands for Cloud:Factors Driving Adopti...
AFCEA C4I Symposium: The 4th C in C4I Stands for Cloud:Factors Driving Adopti...
 
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio..."Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
"Integration of Hadoop in Business landscape", Michal Alexa, IT and Innovatio...
 
Experiences Evolving a New Analytical Platform: What Works and What's Missing
Experiences Evolving a New Analytical Platform: What Works and What's MissingExperiences Evolving a New Analytical Platform: What Works and What's Missing
Experiences Evolving a New Analytical Platform: What Works and What's Missing
 

More from Jeff Hammerbacher (20)

20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20100714accel
20100714accel20100714accel
20100714accel
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100423sage
20100423sage20100423sage
20100423sage
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091110startup2startup
20091110startup2startup20091110startup2startup
20091110startup2startup
 
20091030nasajpl
20091030nasajpl20091030nasajpl
20091030nasajpl
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
20090622 Velocity
20090622 Velocity20090622 Velocity
20090622 Velocity
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
20081022cca
20081022cca20081022cca
20081022cca
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 
2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 

20100513brown

  • 2. Evolving a New Analytical Platform What Works and What’s Missing Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera May 13, 2010 Thursday, May 13, 2010
  • 3. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Several open source projects and research papers ▪ Founder of Cloudera ▪ Vice President of Products and Chief Scientist ▪ Also, check out the book “Beautiful Data” Thursday, May 13, 2010
  • 4. Presentation Outline ▪ Architectures for large scale data analysis ▪ Reference architecture: ETL, DW, BI, Analytics ▪ New foundations: HDFS and MapReduce ▪ SQL Server 2008 R2 ▪ The new platform emerges ▪ Building a new platform ▪ Motivations ▪ Implementation ▪ Questions and Discussion Thursday, May 13, 2010
  • 5. Summary of the Presentation (I have a short attention span, too) ▪ The abstractions provided by a relational database are no longer useful on their own for analytical data management. ▪ The abstraction layer needs to be redrawn to include the functionality provided by ETL, MDM, stream management, reporting, OLAP, and search tools, with a unified user interface for collaboration on investigation and results. ▪ I don’t think the cloud has much to do with the above, except to kill “scale up” once and for all. Thursday, May 13, 2010
  • 6. Experiences at Facebook Early 2006: The First Research Scientist ▪ Source data living on horizontally partitioned MySQL tier ▪ Intensive historical analysis difficult ▪ No way to assess impact of changes to the site ▪ First try: Python scripts pull data into MySQL ▪ Second try: Python scripts pull data into Oracle ▪ ...and then we turned on impression logging Thursday, May 13, 2010
  • 7. Facebook Data Infrastructure 2007 Scribe Tier MySQL Tier ▪ “Data Warehousing” ▪ Began with Oracle database ▪ Schedule data collection via cron ▪ Collect data every 24 hours ▪ “ETL” scripts: hand-coded Python Data Collection Server ▪ Data volumes quickly grew ▪ Started at tens of GB in early 2006 Oracle Database Server ▪ Up to about 1 TB per day by mid-2007 ▪ Log files largest source of data growth Thursday, May 13, 2010
  • 8. Facebook Data Infrastructure 2008 Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers Thursday, May 13, 2010
  • 9. SQL Server 2008 R2 Old Features ▪ ETL: SQL Server Integration Services ▪ DW: SQL Server ▪ Reporting: SQL Server Reporting Services ▪ Analytics: SQL Server Analysis Services ▪ Search: Full-Text Search Thursday, May 13, 2010
  • 10. SQL Server 2008 R2 New Features ▪ Stream management: StreamInsight ▪ OLAP: PowerPivot ▪ Collaboration: SharePoint ▪ MDM: Master Data Services ▪ Scale-out: Parallel Data Warehouse Thursday, May 13, 2010
  • 11. A New Foundation Motivations and Implementation ▪ Orders of magnitude growth in data volumes and complexity ▪ Often from machine-generated logs ▪ Complex data is vast majority of data ▪ Built by consumer web teams and not enterprise software firms ▪ Open source ▪ Modular collection of tools, not an opaque abstraction ▪ Applications, not just analysis ▪ Solve user needs, don’t implement a spec Thursday, May 13, 2010
  • 12. (c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Thursday, May 13, 2010