Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.


1.403 Aufrufe

Veröffentlicht am

  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!


  1. 1. Thursday, May 13, 2010
  2. 2. Evolving a New Analytical Platform What Works and What’s Missing Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera May 13, 2010 Thursday, May 13, 2010
  3. 3. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Several open source projects and research papers ▪ Founder of Cloudera ▪ Vice President of Products and Chief Scientist ▪ Also, check out the book “Beautiful Data” Thursday, May 13, 2010
  4. 4. Presentation Outline ▪ Architectures for large scale data analysis ▪ Reference architecture: ETL, DW, BI, Analytics ▪ New foundations: HDFS and MapReduce ▪ SQL Server 2008 R2 ▪ The new platform emerges ▪ Building a new platform ▪ Motivations ▪ Implementation ▪ Questions and Discussion Thursday, May 13, 2010
  5. 5. Summary of the Presentation (I have a short attention span, too) ▪ The abstractions provided by a relational database are no longer useful on their own for analytical data management. ▪ The abstraction layer needs to be redrawn to include the functionality provided by ETL, MDM, stream management, reporting, OLAP, and search tools, with a unified user interface for collaboration on investigation and results. ▪ I don’t think the cloud has much to do with the above, except to kill “scale up” once and for all. Thursday, May 13, 2010
  6. 6. Experiences at Facebook Early 2006: The First Research Scientist ▪ Source data living on horizontally partitioned MySQL tier ▪ Intensive historical analysis difficult ▪ No way to assess impact of changes to the site ▪ First try: Python scripts pull data into MySQL ▪ Second try: Python scripts pull data into Oracle ▪ ...and then we turned on impression logging Thursday, May 13, 2010
  7. 7. Facebook Data Infrastructure 2007 Scribe Tier MySQL Tier ▪ “Data Warehousing” ▪ Began with Oracle database ▪ Schedule data collection via cron ▪ Collect data every 24 hours ▪ “ETL” scripts: hand-coded Python Data Collection Server ▪ Data volumes quickly grew ▪ Started at tens of GB in early 2006 Oracle Database Server ▪ Up to about 1 TB per day by mid-2007 ▪ Log files largest source of data growth Thursday, May 13, 2010
  8. 8. Facebook Data Infrastructure 2008 Scribe Tier MySQL Tier Hadoop Tier Oracle RAC Servers Thursday, May 13, 2010
  9. 9. SQL Server 2008 R2 Old Features ▪ ETL: SQL Server Integration Services ▪ DW: SQL Server ▪ Reporting: SQL Server Reporting Services ▪ Analytics: SQL Server Analysis Services ▪ Search: Full-Text Search Thursday, May 13, 2010
  10. 10. SQL Server 2008 R2 New Features ▪ Stream management: StreamInsight ▪ OLAP: PowerPivot ▪ Collaboration: SharePoint ▪ MDM: Master Data Services ▪ Scale-out: Parallel Data Warehouse Thursday, May 13, 2010
  11. 11. A New Foundation Motivations and Implementation ▪ Orders of magnitude growth in data volumes and complexity ▪ Often from machine-generated logs ▪ Complex data is vast majority of data ▪ Built by consumer web teams and not enterprise software firms ▪ Open source ▪ Modular collection of tools, not an opaque abstraction ▪ Applications, not just analysis ▪ Solve user needs, don’t implement a spec Thursday, May 13, 2010
  12. 12. (c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Thursday, May 13, 2010