Jeff Hammerbacher outlines several open research challenges around building an enterprise data analytics platform in the cloud. Some key challenges include improving infrastructure for isolation, performance, and bulk data transfer in cloud environments. Additional challenges involve developing better interfaces for application developers, query languages, and tools to aid in migrating existing workloads to cloud platforms. Hammerbacher calls for researchers and engineers to help address these challenges through open source contributions and building prototype systems.
2. Open Questions for Building
An Enterprise Data Platform
On the Cloud
Jeff Hammerbacher
Chief Scientist and Vice President of Products, Cloudera
March 1, 2010
Monday, March 1, 2010
3. Presentation Outline
▪ Who am I and what am I talking about?
▪ My Background
▪ Open Questions
▪ Data Platforms
▪ The Cloud
▪ Research Challenges
▪ Infrastructure
▪ Interface
▪ Migration
▪ Build something!
Monday, March 1, 2010
4. My Background
Thanks for Asking
▪ hammer@cloudera.com
▪ Studied Mathematics at Harvard
▪ Worked as a Quant on Wall Street
▪ Conceived, built, and led Data team at Facebook
▪ Nearly 30 amazing engineers and data scientists
▪ Several open source projects and research papers
▪ Founder of Cloudera
▪ Vice President of Products and Chief Scientist
▪ Also, check out the book “Beautiful Data”
Monday, March 1, 2010
5. Open Questions
Some Context
▪ I don’t have a PhD
▪ In fact, I don’t have a publication history
▪ But I read a lot?
▪ Have deployed (and sometimes built) several distributed systems
▪ Oracle RAC
▪ Hadoop + Hive
▪ Cassandra
▪ New things at Cloudera
▪ Sort of like the Cubs GM asking a Cubs fan for advice
Monday, March 1, 2010
6. Data Platforms
Circumscribing our Focus
▪ Primarily concerned with infrastructure for analytics
▪ To borrow a phrase from Ralph Kimball
▪ Operational systems “turn the wheels”
▪ Analytical systems “watch the wheels turn”
▪ Reference architecture
▪ ETL/Data Integration
▪ DW
▪ BI
▪ Complex Analytics
Monday, March 1, 2010
7. Data Platforms
Another Perspective
▪ Analytical infrastructure as a platform
▪ Infrastructure providers
▪ Hardware and systems software
▪ Platform providers
▪ Suite of software tools to collect, store, manage, and analyze data
▪ Content providers
▪ Application developers
▪ End users
Monday, March 1, 2010
8. The Cloud
Some Terminology
▪ Layers of providers (looks familiar)
▪ Infrastructure as a Service (IaaS)
▪ Platform as a Service (PaaS)
▪ Software as a Service (SaaS)
▪ Where is it deployed?
▪ Public cloud
▪ Private cloud
▪ Hybrid cloud
Monday, March 1, 2010
9. The Cloud
Current State
▪ Many infrastructure and software providers
▪ Rackspace, Terremark, SoftLayer, and friends in infrastructure
▪ Salesforce and Workday in traditional enterprise applications
▪ SnapLogic, Cast Iron Systems in ETL
▪ Kognitio in DW
▪ LucidEra, PivotLink, Quantivo, and friends in BI
▪ Less developed PaaS market for analytics
▪ RightScale + Talend + Vertica + Jaspersoft partnership
Monday, March 1, 2010
10. Research Challenges
Problem Statement
What are the research challenges we’ll encounter moving from
today’s architectures for enterprise analytics to an integrated
platform-as-a-service model built on public, private, or hybrid
cloud infrastructure?
Monday, March 1, 2010
11. Research Challenges
Infrastructure
▪ Server and data center design
▪ Servers for WSCs project at Michigan
▪ FAWN at CMU: low-power CPU and SSD for storage
▪ Making use of multi-core and GPUs
▪ Power management projects all over
▪ Data center design projects
▪ Evolution of containers
▪ Yahoo!’s “chicken coop”
▪ OpenFlow, Vyatta, Arista, and Nicira in networking
Monday, March 1, 2010
12. Research Challenges
Infrastructure
▪ How to achieve isolation while maintaining performance?
▪ Failure isolation
▪ Performance isolation
▪ Security isolation
▪ Many interesting projects
▪ Process Groups/Containers: Solaris Zones, LXC, Job Objects
▪ Lowered VM startup time via cloning: SnowFlock
▪ Data locality for VM scheduling: Tashi
▪ Resource management for grids: Nexus
Monday, March 1, 2010
13. Research Challenges
Infrastructure
▪ Configuration Management
▪ Lots of work in industry: cfengine, bcfg2, Puppet, Chef
▪ Not a lot of research on the topic!
▪ Scheduling
▪ Benchmarks for concurrent queries and almost-full systems
▪ Hybrid cloud (“cloudbursting”) scheduling
▪ Scheduling in the presence of variable performance
▪ Continuous version of fault tolerance?
Monday, March 1, 2010
14. Research Challenges
Infrastructure
▪ Bulk data transfer
▪ Moving data over the WAN is scary
▪ Aspera, FastSoft, WAM!NET built companies out of this research
▪ UDT proposed as a protocol from Chicago
▪ Incremental progress indicators and restart would be nice
▪ Latency-sensitive requests
▪ Lower variability: better DNS?
▪ Lower latency: SPDY?
Monday, March 1, 2010
15. Research Challenges
Interface
▪ Application Developers
▪ Incremental query progress visualization
▪ Run time simulation and prediction
▪ ILLUSTRATE command for sample tuple generation
▪ Compile-time rather than run-time checking
▪ Libraries of basic operations which present higher-order APIs
▪ Performance optimization suggestions
▪ Distributed debugging utilities
Monday, March 1, 2010
16. Research Challenges
Interface
▪ New data models: when to use them and how do they interact?
▪ Multi-dimensional hash maps with locality groups: BigTable,
HBase
▪ Documents: CouchDB, MongoDB, Riak (MarkLogic?)
▪ Arrays: SciDB
▪ Graphs: SHS
▪ Trajectories: TrajStore
▪ Cross-language serialization and RPC frameworks
▪ ASN.1, XDR, CORBA, ICE, Thrift, Etch, PBs, DataSeries, Avro
Monday, March 1, 2010
17. Research Challenges
Interface
▪ Query languages
▪ Programmer time-to-learn and productivity analysis for:
▪ Various MapReduce implementations
▪ Sawzall, PigLatin, SCOPE, Hive, DryadLINQ, ScalaQL
▪ Existing stuff: PL/SQL, TSQL, SQL*Loader, XQuery, XPath, etc.?
▪ Languages for analytics: R, S, SAS, SPSS, Matlab
▪ Can these all target a single execution layer?
▪ Should we be embedding our queries in a host language?
▪ LINQ, ScalaQL, Ferry
Monday, March 1, 2010
18. Research Challenges
Interface
▪ Collaborative analytics
▪ User profiles, news feed, message inboxes, recommendations
▪ Improve the browser
▪ Interactive visualization libraries in JavaScript
▪ What does HTML5 mean for the data analyst?
▪ How can we leverage multi-touch interfaces?
▪ What do new mobile devices mean for data analysts?
▪ Netbooks, iPhone, Android phones, Kindle, Nook, etc.
Monday, March 1, 2010
19. Research Challenges
Migration
▪ How do we get there from here?
▪ Workload analysis to identify what can be moved to PaaS first
▪ Ethnographic studies of what’s hard for data analysts today
▪ Privacy and security considerations
▪ Integration with third-party data sources
▪ Retention policies
▪ Cloud interoperability!
▪ Tools to prototype locally and deploy to platform later
▪ New university courses to build these skills
Monday, March 1, 2010
20. Research Challenges
Build Something!
▪ “A man who carries a cat by the tail...”
▪ Participate in an open source community
▪ Build a website and make the data available (e.g. MovieLens)
▪ Experience the joys of
▪ installation
▪ configuration
▪ deployment
▪ monitoring
▪ performance tuning, debugging, upgrades, and more!
Monday, March 1, 2010
21. (c) 2009 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0
Monday, March 1, 2010