2. Evolving a New Analytical Platform
What Works and What’s Missing
Jeff Hammerbacher
Chief Scientist and Vice President of Products, Cloudera
May 13, 2010
Thursday, May 13, 2010
3. My Background
Thanks for Asking
▪ hammer@cloudera.com
▪ Studied Mathematics at Harvard
▪ Worked as a Quant on Wall Street
▪ Conceived, built, and led Data team at Facebook
▪ Nearly 30 amazing engineers and data scientists
▪ Several open source projects and research papers
▪ Founder of Cloudera
▪ Vice President of Products and Chief Scientist
▪ Also, check out the book “Beautiful Data”
Thursday, May 13, 2010
4. Presentation Outline
▪ Architectures for large scale data analysis
▪ Reference architecture: ETL, DW, BI, Analytics
▪ New foundations: HDFS and MapReduce
▪ SQL Server 2008 R2
▪ The new platform emerges
▪ Building a new platform
▪ Motivations
▪ Implementation
▪ Questions and Discussion
Thursday, May 13, 2010
5. Summary of the Presentation
(I have a short attention span, too)
▪ The abstractions provided by a relational database are no longer
useful on their own for analytical data management.
▪ The abstraction layer needs to be redrawn to include the
functionality provided by ETL, MDM, stream management,
reporting, OLAP, and search tools, with a unified user interface
for collaboration on investigation and results.
▪ I don’t think the cloud has much to do with the above, except to
kill “scale up” once and for all.
Thursday, May 13, 2010
6. Experiences at Facebook
Early 2006: The First Research Scientist
▪ Source data living on horizontally partitioned MySQL tier
▪ Intensive historical analysis difficult
▪ No way to assess impact of changes to the site
▪ First try: Python scripts pull data into MySQL
▪ Second try: Python scripts pull data into Oracle
▪ ...and then we turned on impression logging
Thursday, May 13, 2010
7. Facebook Data Infrastructure
2007 Scribe Tier MySQL Tier
▪ “Data Warehousing”
▪ Began with Oracle database
▪ Schedule data collection via cron
▪ Collect data every 24 hours
▪ “ETL” scripts: hand-coded Python Data Collection
Server
▪ Data volumes quickly grew
▪ Started at tens of GB in early 2006 Oracle Database
Server
▪ Up to about 1 TB per day by mid-2007
▪ Log files largest source of data growth
Thursday, May 13, 2010
9. SQL Server 2008 R2
Old Features
▪ ETL: SQL Server Integration Services
▪ DW: SQL Server
▪ Reporting: SQL Server Reporting Services
▪ Analytics: SQL Server Analysis Services
▪ Search: Full-Text Search
Thursday, May 13, 2010
10. SQL Server 2008 R2
New Features
▪ Stream management: StreamInsight
▪ OLAP: PowerPivot
▪ Collaboration: SharePoint
▪ MDM: Master Data Services
▪ Scale-out: Parallel Data Warehouse
Thursday, May 13, 2010
11. A New Foundation
Motivations and Implementation
▪ Orders of magnitude growth in data volumes and complexity
▪ Often from machine-generated logs
▪ Complex data is vast majority of data
▪ Built by consumer web teams and not enterprise software firms
▪ Open source
▪ Modular collection of tools, not an opaque abstraction
▪ Applications, not just analysis
▪ Solve user needs, don’t implement a spec
Thursday, May 13, 2010
12. (c) 2009 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0
Thursday, May 13, 2010