2. WHO THE HELL IS THIS GUY?
PhD in Large-scale scientific data management
Parallel query processing,
Greenplum Parallel Database
Hadoop and Hive at Qubole
Niche.
Scientific simulation apps
Fortune Companies
Small and medium
enterprises
3. SO YOU WANT TO DO SOME BIG DATA
ANALYTICS…
Want to do targeted marketing campaigns
You want to minimize churn (attrition in customer base)
Want to build a product recommendation engine
Use data to improve your business
4. TYPICAL BIG DATA PROJECT
Buy lots of hardware
Buy / install software
Hire admins who can keep everything running
Hire analysts/data scientists to come up with interesting questions
Productionalize questions into reports
5. PROBLEM 1
Most organizations struggle to achieve > 40% utilization of their
cluster
Exploratory and iterative
Actionable reports produced at best few times a day
Since you have to plan 2-3 years ahead, chances are you will
overprovision
Chen et al,
VLDB 2012
Provision for peak workload
7. RESULT
Big Data projects traditionally done at companies
Who can afford to overprovision
Can hire the right talent
8. LANDSCAPE IS CHANGING
Advent of clouds
Provision 10-100s of machines in minutes
Pay as you go, grow as you please
Free / cheap big-data software
Hadoop
Hive
R
Sqoop
(many more)
9. PUBLIC CLOUDS ARE GROWING
Time
I/ORequests
More people are doing critical stuff in the cloud!
10. CLOUD PRIMITIVES
Persistent object/file store e.g. Amazon’s S3
Ability to provision cluster with pre-built images
Ability to add or remove nodes from the cluster
Hosted operational store like MySQL
Ways to bid for excess capacity (Amazon’s spot instances)
Can get up to 90% discount
11. ENTER HADOOP
Open-source implementation of Map-reduce used by Google to
index trillions of web pages
Allows programmers to write distributed programs using map and
reduce abstractions
Primarily Java, but supports other languages too
Ability to run these programs on large amounts of data
Uses bunch of cheap hardware, can tolerate failures
13. ENTER HIVE
Facebook had a Multi Petabyte Warehouse
Had 80+ engineers writing Hadoop jobs
Quickly realized that files are insufficient abstractions
Need SQL concepts like tables, schemas, partitions, indices
Many, many, many more people know SQL than Hadoop
So, implemented SQL on top of Hadoop
Made data more accessible
Finally, FB open sourced it
14. HIVE
SQL* interface on top of unstructured data
Handles variety of open data formats
JSON, Text, Binary, Avro, ProtoBuf, Thrift
Extreme pluggability
Some things aren’t meant to be done in SQL
Custom Python, PHP, Ruby, Bash code
Production ready
Processes 25PB of data in FB
Hive project started by Qubole founders!
16. RECAP: LANDSCAPE IS CHANGING
Advent of clouds
Free / cheap big-data software
17. THE BIG OPPORTUNITY
Hadoop++ is great for analytics, but designed for data centers
Cloud offers very different tradeoffs and opportunities
Big Data Analytics in the Cloud!
18. ENTER QUBOLE
Spreadsheets* BI tools Custom AppsBrowser
*
*
Other players:
• Amazon’s
EMR
• Treasure Data
• Mortar Data
19. QUBOLE FEATURES
Simple query interface
Automated cluster management
Cloud performance enhancements
Integration with data sources / sinks
Workflows
Scheduler
Programmability
21. CLUSTER MANAGEMENT
Automatic launching, shutting down
clusters at hour boundaries
Recycle bad clusters (it happens,
sometimes)
Save logs for debugging
Spot instances to save costs
Sophisticated auto-scaling algorithm
adjusts to usage
Actual user quote: “I've basically not had to learn *anything* to get my data
feed working “
23. INTEGRATION
ODBC Driver
Tableau
Excel
Database connectors
MySQL
Vertica
MongoDB
Other Sources
Google Analytics
Omniture *
AppNexus
24. WORKFLOWS AND SCHEDULER
Example workflow:
Extract data from operational MySQL DB about customer transactions
Extract FB data on your company or product page
Run report that joins FB data with DB data to see how many people have had
failed transactions have commented in FB page
Push results to reporting DB so that customer support can access in internal
site
Scheduler allows you to run this workflow every night
Dealing with late arrival data
Notifications
26. USE CASE
Current Customer
Most popular Q&A site
Use cases:
A/B testing on new product features and the resulting analysis
Path analysis on application usage
Operational metrics
Within one month, went from 4 to 16 users!
28. CONCLUSION
Big Data Analytics in the Cloud done right
Provision 2 node clusters or 500 node clusters with same ease
Pay as you go, grow as you please
Integrate variety of data sources
Optimized for the cloud
Reduces business risk and time to insight
29. THANK YOU!
QUESTIONS?
Go to http://www.qubole.com to sign up for a free trial!
We are hiring! jobs@qubole.com
snarayanan@qubole.com
@k2_181
30. PERFORMANCE
Columnar cache – 3x speedup
Prefetch files to hide latency – 30% improvement
Optimize split computation – 8x improvement
Multi-part upload of large files
Moving files is expensive, write output directly
Qubole Hive server – 8x speedup for DDL statements
Order-by-limit query optimization – 5x improvement
Cloud optimized: 5x faster than Amazon’s Elastic
Mapreduce