SlideShare ist ein Scribd-Unternehmen logo
1 von 21
Downloaden Sie, um offline zu lesen
Monday, March 1, 2010
Open Questions for Building
         An Enterprise Data Platform
         On the Cloud

         Jeff Hammerbacher
         Chief Scientist and Vice President of Products, Cloudera
         March 1, 2010



Monday, March 1, 2010
Presentation Outline
         ▪   Who am I and what am I talking about?
             ▪   My Background
             ▪   Open Questions
             ▪   Data Platforms
             ▪   The Cloud
         ▪   Research Challenges
             ▪   Infrastructure
             ▪   Interface
             ▪   Migration
             ▪   Build something!


Monday, March 1, 2010
My Background
         Thanks for Asking
         ▪   hammer@cloudera.com
         ▪   Studied Mathematics at Harvard
         ▪   Worked as a Quant on Wall Street
         ▪   Conceived, built, and led Data team at Facebook
             ▪   Nearly 30 amazing engineers and data scientists
             ▪   Several open source projects and research papers
         ▪   Founder of Cloudera
             ▪   Vice President of Products and Chief Scientist
             ▪   Also, check out the book “Beautiful Data”

Monday, March 1, 2010
Open Questions
         Some Context
         ▪   I don’t have a PhD
         ▪   In fact, I don’t have a publication history
             ▪   But I read a lot?
         ▪   Have deployed (and sometimes built) several distributed systems
             ▪   Oracle RAC
             ▪   Hadoop + Hive
             ▪   Cassandra
             ▪   New things at Cloudera
         ▪   Sort of like the Cubs GM asking a Cubs fan for advice

Monday, March 1, 2010
Data Platforms
         Circumscribing our Focus
         ▪   Primarily concerned with infrastructure for analytics
         ▪   To borrow a phrase from Ralph Kimball
             ▪   Operational systems “turn the wheels”
             ▪   Analytical systems “watch the wheels turn”
         ▪   Reference architecture
             ▪   ETL/Data Integration
             ▪   DW
             ▪   BI
             ▪   Complex Analytics

Monday, March 1, 2010
Data Platforms
         Another Perspective
         ▪   Analytical infrastructure as a platform
             ▪   Infrastructure providers
                 ▪   Hardware and systems software
             ▪   Platform providers
                 ▪   Suite of software tools to collect, store, manage, and analyze data
             ▪   Content providers
             ▪   Application developers
             ▪   End users




Monday, March 1, 2010
The Cloud
         Some Terminology
         ▪   Layers of providers (looks familiar)
             ▪   Infrastructure as a Service (IaaS)
             ▪   Platform as a Service (PaaS)
             ▪   Software as a Service (SaaS)
         ▪   Where is it deployed?
             ▪   Public cloud
             ▪   Private cloud
             ▪   Hybrid cloud



Monday, March 1, 2010
The Cloud
         Current State
         ▪   Many infrastructure and software providers
             ▪   Rackspace, Terremark, SoftLayer, and friends in infrastructure
             ▪   Salesforce and Workday in traditional enterprise applications
             ▪   SnapLogic, Cast Iron Systems in ETL
             ▪   Kognitio in DW
             ▪   LucidEra, PivotLink, Quantivo, and friends in BI
         ▪   Less developed PaaS market for analytics
             ▪   RightScale + Talend + Vertica + Jaspersoft partnership



Monday, March 1, 2010
Research Challenges
                                 Problem Statement


              What are the research challenges we’ll encounter moving from
               today’s architectures for enterprise analytics to an integrated
             platform-as-a-service model built on public, private, or hybrid
                                   cloud infrastructure?




Monday, March 1, 2010
Research Challenges
         Infrastructure
         ▪   Server and data center design
             ▪   Servers for WSCs project at Michigan
             ▪   FAWN at CMU: low-power CPU and SSD for storage
             ▪   Making use of multi-core and GPUs
             ▪   Power management projects all over
             ▪   Data center design projects
                 ▪   Evolution of containers
                 ▪   Yahoo!’s “chicken coop”
         ▪   OpenFlow, Vyatta, Arista, and Nicira in networking


Monday, March 1, 2010
Research Challenges
         Infrastructure
         ▪   How to achieve isolation while maintaining performance?
             ▪   Failure isolation
             ▪   Performance isolation
             ▪   Security isolation
         ▪   Many interesting projects
             ▪   Process Groups/Containers: Solaris Zones, LXC, Job Objects
             ▪   Lowered VM startup time via cloning: SnowFlock
             ▪   Data locality for VM scheduling: Tashi
             ▪   Resource management for grids: Nexus


Monday, March 1, 2010
Research Challenges
         Infrastructure
         ▪   Configuration Management
             ▪   Lots of work in industry: cfengine, bcfg2, Puppet, Chef
             ▪   Not a lot of research on the topic!
         ▪   Scheduling
             ▪   Benchmarks for concurrent queries and almost-full systems
             ▪   Hybrid cloud (“cloudbursting”) scheduling
             ▪   Scheduling in the presence of variable performance
                 ▪   Continuous version of fault tolerance?




Monday, March 1, 2010
Research Challenges
         Infrastructure
         ▪   Bulk data transfer
             ▪   Moving data over the WAN is scary
             ▪   Aspera, FastSoft, WAM!NET built companies out of this research
             ▪   UDT proposed as a protocol from Chicago
             ▪   Incremental progress indicators and restart would be nice
         ▪   Latency-sensitive requests
             ▪   Lower variability: better DNS?
             ▪   Lower latency: SPDY?



Monday, March 1, 2010
Research Challenges
         Interface
         ▪   Application Developers
             ▪   Incremental query progress visualization
             ▪   Run time simulation and prediction
             ▪   ILLUSTRATE command for sample tuple generation
             ▪   Compile-time rather than run-time checking
             ▪   Libraries of basic operations which present higher-order APIs
             ▪   Performance optimization suggestions
             ▪   Distributed debugging utilities



Monday, March 1, 2010
Research Challenges
         Interface
         ▪   New data models: when to use them and how do they interact?
             ▪   Multi-dimensional hash maps with locality groups: BigTable,
                 HBase
             ▪   Documents: CouchDB, MongoDB, Riak (MarkLogic?)
             ▪   Arrays: SciDB
             ▪   Graphs: SHS
             ▪   Trajectories: TrajStore
         ▪   Cross-language serialization and RPC frameworks
             ▪   ASN.1, XDR, CORBA, ICE, Thrift, Etch, PBs, DataSeries, Avro


Monday, March 1, 2010
Research Challenges
         Interface
         ▪   Query languages
             ▪   Programmer time-to-learn and productivity analysis for:
                 ▪   Various MapReduce implementations
                 ▪   Sawzall, PigLatin, SCOPE, Hive, DryadLINQ, ScalaQL
                 ▪   Existing stuff: PL/SQL, TSQL, SQL*Loader, XQuery, XPath, etc.?
                 ▪   Languages for analytics: R, S, SAS, SPSS, Matlab
             ▪   Can these all target a single execution layer?
             ▪   Should we be embedding our queries in a host language?
                 ▪   LINQ, ScalaQL, Ferry



Monday, March 1, 2010
Research Challenges
         Interface
         ▪   Collaborative analytics
             ▪   User profiles, news feed, message inboxes, recommendations
         ▪   Improve the browser
             ▪   Interactive visualization libraries in JavaScript
             ▪   What does HTML5 mean for the data analyst?
         ▪   How can we leverage multi-touch interfaces?
         ▪   What do new mobile devices mean for data analysts?
             ▪   Netbooks, iPhone, Android phones, Kindle, Nook, etc.



Monday, March 1, 2010
Research Challenges
         Migration
         ▪   How do we get there from here?
             ▪   Workload analysis to identify what can be moved to PaaS first
             ▪   Ethnographic studies of what’s hard for data analysts today
             ▪   Privacy and security considerations
                 ▪   Integration with third-party data sources
                 ▪   Retention policies
             ▪   Cloud interoperability!
             ▪   Tools to prototype locally and deploy to platform later
             ▪   New university courses to build these skills


Monday, March 1, 2010
Research Challenges
         Build Something!
         ▪   “A man who carries a cat by the tail...”
         ▪   Participate in an open source community
         ▪   Build a website and make the data available (e.g. MovieLens)
         ▪   Experience the joys of
             ▪   installation
             ▪   configuration
             ▪   deployment
             ▪   monitoring
             ▪   performance tuning, debugging, upgrades, and more!

Monday, March 1, 2010
(c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0




Monday, March 1, 2010

Weitere ähnliche Inhalte

Ähnlich wie 20100301icde

Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 
OpenStack and Databases
OpenStack and DatabasesOpenStack and Databases
OpenStack and DatabasesTesora
 
Budapest Big Data Meetup Real-time stream processing
Budapest Big Data Meetup Real-time stream processingBudapest Big Data Meetup Real-time stream processing
Budapest Big Data Meetup Real-time stream processingGabor Boros
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Adam Doyle
 
Peter Mell Cloud Standards 20090915
Peter Mell Cloud Standards 20090915Peter Mell Cloud Standards 20090915
Peter Mell Cloud Standards 20090915GovCloud Network
 
ROMA User-Customizable NoSQL Database in Ruby
ROMA User-Customizable NoSQL Database in RubyROMA User-Customizable NoSQL Database in Ruby
ROMA User-Customizable NoSQL Database in RubyRakuten Group, Inc.
 
Implementing MongoDB at Shutterfly (Kenny Gorman)
Implementing MongoDB at Shutterfly (Kenny Gorman)Implementing MongoDB at Shutterfly (Kenny Gorman)
Implementing MongoDB at Shutterfly (Kenny Gorman)MongoSF
 
Agile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational IntelligenceAgile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational IntelligenceInside Analysis
 
Size does not matter (if your data is in a silo)
Size does not matter (if your data is in a silo)Size does not matter (if your data is in a silo)
Size does not matter (if your data is in a silo)Ora Lassila
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
EBSCO Digital Transformation with AWS
EBSCO Digital Transformation with AWS EBSCO Digital Transformation with AWS
EBSCO Digital Transformation with AWS Kenzan
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationCloudera, Inc.
 
AFCEA C4I Symposium: The 4th C in C4I Stands for Cloud:Factors Driving Adopti...
AFCEA C4I Symposium: The 4th C in C4I Stands for Cloud:Factors Driving Adopti...AFCEA C4I Symposium: The 4th C in C4I Stands for Cloud:Factors Driving Adopti...
AFCEA C4I Symposium: The 4th C in C4I Stands for Cloud:Factors Driving Adopti...Patrick Chanezon
 
Metabase lj meetup
Metabase lj meetupMetabase lj meetup
Metabase lj meetupSimon Belak
 

Ähnlich wie 20100301icde (20)

Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
OpenStack and Databases
OpenStack and DatabasesOpenStack and Databases
OpenStack and Databases
 
Budapest Big Data Meetup Real-time stream processing
Budapest Big Data Meetup Real-time stream processingBudapest Big Data Meetup Real-time stream processing
Budapest Big Data Meetup Real-time stream processing
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
 
Treasure Data and Heroku
Treasure Data and HerokuTreasure Data and Heroku
Treasure Data and Heroku
 
Peter Mell Cloud Standards 20090915
Peter Mell Cloud Standards 20090915Peter Mell Cloud Standards 20090915
Peter Mell Cloud Standards 20090915
 
20100423sage
20100423sage20100423sage
20100423sage
 
ROMA User-Customizable NoSQL Database in Ruby
ROMA User-Customizable NoSQL Database in RubyROMA User-Customizable NoSQL Database in Ruby
ROMA User-Customizable NoSQL Database in Ruby
 
Implementing MongoDB at Shutterfly (Kenny Gorman)
Implementing MongoDB at Shutterfly (Kenny Gorman)Implementing MongoDB at Shutterfly (Kenny Gorman)
Implementing MongoDB at Shutterfly (Kenny Gorman)
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Agile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational IntelligenceAgile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational Intelligence
 
Bertenthal
BertenthalBertenthal
Bertenthal
 
Size does not matter (if your data is in a silo)
Size does not matter (if your data is in a silo)Size does not matter (if your data is in a silo)
Size does not matter (if your data is in a silo)
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
EBSCO Digital Transformation with AWS
EBSCO Digital Transformation with AWS EBSCO Digital Transformation with AWS
EBSCO Digital Transformation with AWS
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your Organization
 
The Web of Data: The W3C Semantic Web Initiative
The Web of Data: The W3C Semantic Web InitiativeThe Web of Data: The W3C Semantic Web Initiative
The Web of Data: The W3C Semantic Web Initiative
 
AFCEA C4I Symposium: The 4th C in C4I Stands for Cloud:Factors Driving Adopti...
AFCEA C4I Symposium: The 4th C in C4I Stands for Cloud:Factors Driving Adopti...AFCEA C4I Symposium: The 4th C in C4I Stands for Cloud:Factors Driving Adopti...
AFCEA C4I Symposium: The 4th C in C4I Stands for Cloud:Factors Driving Adopti...
 
Acronym Soup
Acronym SoupAcronym Soup
Acronym Soup
 
Metabase lj meetup
Metabase lj meetupMetabase lj meetup
Metabase lj meetup
 

Mehr von Jeff Hammerbacher (20)

20120223keystone
20120223keystone20120223keystone
20120223keystone
 
20100714accel
20100714accel20100714accel
20100714accel
 
20100608sigmod
20100608sigmod20100608sigmod
20100608sigmod
 
20100418sos
20100418sos20100418sos
20100418sos
 
20100201hplabs
20100201hplabs20100201hplabs
20100201hplabs
 
20100128ebay
20100128ebay20100128ebay
20100128ebay
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091203gemini
20091203gemini20091203gemini
20091203gemini
 
20091110startup2startup
20091110startup2startup20091110startup2startup
20091110startup2startup
 
20091030nasajpl
20091030nasajpl20091030nasajpl
20091030nasajpl
 
20091027genentech
20091027genentech20091027genentech
20091027genentech
 
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
Mårten Mickos's presentation "Open Source: Why Freedom Makes a Better Busines...
 
20090622 Velocity
20090622 Velocity20090622 Velocity
20090622 Velocity
 
20090422 Www
20090422 Www20090422 Www
20090422 Www
 
20090309berkeley
20090309berkeley20090309berkeley
20090309berkeley
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
20081022cca
20081022cca20081022cca
20081022cca
 
20081009nychive
20081009nychive20081009nychive
20081009nychive
 
2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao
 
Data Presentations Cassandra Sigmod
Data  Presentations  Cassandra SigmodData  Presentations  Cassandra Sigmod
Data Presentations Cassandra Sigmod
 

Kürzlich hochgeladen

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

20100301icde

  • 2. Open Questions for Building An Enterprise Data Platform On the Cloud Jeff Hammerbacher Chief Scientist and Vice President of Products, Cloudera March 1, 2010 Monday, March 1, 2010
  • 3. Presentation Outline ▪ Who am I and what am I talking about? ▪ My Background ▪ Open Questions ▪ Data Platforms ▪ The Cloud ▪ Research Challenges ▪ Infrastructure ▪ Interface ▪ Migration ▪ Build something! Monday, March 1, 2010
  • 4. My Background Thanks for Asking ▪ hammer@cloudera.com ▪ Studied Mathematics at Harvard ▪ Worked as a Quant on Wall Street ▪ Conceived, built, and led Data team at Facebook ▪ Nearly 30 amazing engineers and data scientists ▪ Several open source projects and research papers ▪ Founder of Cloudera ▪ Vice President of Products and Chief Scientist ▪ Also, check out the book “Beautiful Data” Monday, March 1, 2010
  • 5. Open Questions Some Context ▪ I don’t have a PhD ▪ In fact, I don’t have a publication history ▪ But I read a lot? ▪ Have deployed (and sometimes built) several distributed systems ▪ Oracle RAC ▪ Hadoop + Hive ▪ Cassandra ▪ New things at Cloudera ▪ Sort of like the Cubs GM asking a Cubs fan for advice Monday, March 1, 2010
  • 6. Data Platforms Circumscribing our Focus ▪ Primarily concerned with infrastructure for analytics ▪ To borrow a phrase from Ralph Kimball ▪ Operational systems “turn the wheels” ▪ Analytical systems “watch the wheels turn” ▪ Reference architecture ▪ ETL/Data Integration ▪ DW ▪ BI ▪ Complex Analytics Monday, March 1, 2010
  • 7. Data Platforms Another Perspective ▪ Analytical infrastructure as a platform ▪ Infrastructure providers ▪ Hardware and systems software ▪ Platform providers ▪ Suite of software tools to collect, store, manage, and analyze data ▪ Content providers ▪ Application developers ▪ End users Monday, March 1, 2010
  • 8. The Cloud Some Terminology ▪ Layers of providers (looks familiar) ▪ Infrastructure as a Service (IaaS) ▪ Platform as a Service (PaaS) ▪ Software as a Service (SaaS) ▪ Where is it deployed? ▪ Public cloud ▪ Private cloud ▪ Hybrid cloud Monday, March 1, 2010
  • 9. The Cloud Current State ▪ Many infrastructure and software providers ▪ Rackspace, Terremark, SoftLayer, and friends in infrastructure ▪ Salesforce and Workday in traditional enterprise applications ▪ SnapLogic, Cast Iron Systems in ETL ▪ Kognitio in DW ▪ LucidEra, PivotLink, Quantivo, and friends in BI ▪ Less developed PaaS market for analytics ▪ RightScale + Talend + Vertica + Jaspersoft partnership Monday, March 1, 2010
  • 10. Research Challenges Problem Statement What are the research challenges we’ll encounter moving from today’s architectures for enterprise analytics to an integrated platform-as-a-service model built on public, private, or hybrid cloud infrastructure? Monday, March 1, 2010
  • 11. Research Challenges Infrastructure ▪ Server and data center design ▪ Servers for WSCs project at Michigan ▪ FAWN at CMU: low-power CPU and SSD for storage ▪ Making use of multi-core and GPUs ▪ Power management projects all over ▪ Data center design projects ▪ Evolution of containers ▪ Yahoo!’s “chicken coop” ▪ OpenFlow, Vyatta, Arista, and Nicira in networking Monday, March 1, 2010
  • 12. Research Challenges Infrastructure ▪ How to achieve isolation while maintaining performance? ▪ Failure isolation ▪ Performance isolation ▪ Security isolation ▪ Many interesting projects ▪ Process Groups/Containers: Solaris Zones, LXC, Job Objects ▪ Lowered VM startup time via cloning: SnowFlock ▪ Data locality for VM scheduling: Tashi ▪ Resource management for grids: Nexus Monday, March 1, 2010
  • 13. Research Challenges Infrastructure ▪ Configuration Management ▪ Lots of work in industry: cfengine, bcfg2, Puppet, Chef ▪ Not a lot of research on the topic! ▪ Scheduling ▪ Benchmarks for concurrent queries and almost-full systems ▪ Hybrid cloud (“cloudbursting”) scheduling ▪ Scheduling in the presence of variable performance ▪ Continuous version of fault tolerance? Monday, March 1, 2010
  • 14. Research Challenges Infrastructure ▪ Bulk data transfer ▪ Moving data over the WAN is scary ▪ Aspera, FastSoft, WAM!NET built companies out of this research ▪ UDT proposed as a protocol from Chicago ▪ Incremental progress indicators and restart would be nice ▪ Latency-sensitive requests ▪ Lower variability: better DNS? ▪ Lower latency: SPDY? Monday, March 1, 2010
  • 15. Research Challenges Interface ▪ Application Developers ▪ Incremental query progress visualization ▪ Run time simulation and prediction ▪ ILLUSTRATE command for sample tuple generation ▪ Compile-time rather than run-time checking ▪ Libraries of basic operations which present higher-order APIs ▪ Performance optimization suggestions ▪ Distributed debugging utilities Monday, March 1, 2010
  • 16. Research Challenges Interface ▪ New data models: when to use them and how do they interact? ▪ Multi-dimensional hash maps with locality groups: BigTable, HBase ▪ Documents: CouchDB, MongoDB, Riak (MarkLogic?) ▪ Arrays: SciDB ▪ Graphs: SHS ▪ Trajectories: TrajStore ▪ Cross-language serialization and RPC frameworks ▪ ASN.1, XDR, CORBA, ICE, Thrift, Etch, PBs, DataSeries, Avro Monday, March 1, 2010
  • 17. Research Challenges Interface ▪ Query languages ▪ Programmer time-to-learn and productivity analysis for: ▪ Various MapReduce implementations ▪ Sawzall, PigLatin, SCOPE, Hive, DryadLINQ, ScalaQL ▪ Existing stuff: PL/SQL, TSQL, SQL*Loader, XQuery, XPath, etc.? ▪ Languages for analytics: R, S, SAS, SPSS, Matlab ▪ Can these all target a single execution layer? ▪ Should we be embedding our queries in a host language? ▪ LINQ, ScalaQL, Ferry Monday, March 1, 2010
  • 18. Research Challenges Interface ▪ Collaborative analytics ▪ User profiles, news feed, message inboxes, recommendations ▪ Improve the browser ▪ Interactive visualization libraries in JavaScript ▪ What does HTML5 mean for the data analyst? ▪ How can we leverage multi-touch interfaces? ▪ What do new mobile devices mean for data analysts? ▪ Netbooks, iPhone, Android phones, Kindle, Nook, etc. Monday, March 1, 2010
  • 19. Research Challenges Migration ▪ How do we get there from here? ▪ Workload analysis to identify what can be moved to PaaS first ▪ Ethnographic studies of what’s hard for data analysts today ▪ Privacy and security considerations ▪ Integration with third-party data sources ▪ Retention policies ▪ Cloud interoperability! ▪ Tools to prototype locally and deploy to platform later ▪ New university courses to build these skills Monday, March 1, 2010
  • 20. Research Challenges Build Something! ▪ “A man who carries a cat by the tail...” ▪ Participate in an open source community ▪ Build a website and make the data available (e.g. MovieLens) ▪ Experience the joys of ▪ installation ▪ configuration ▪ deployment ▪ monitoring ▪ performance tuning, debugging, upgrades, and more! Monday, March 1, 2010
  • 21. (c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0 Monday, March 1, 2010