Big dataanalyticsinthecloud

BIG DATA
ANALYTICS IN THE
CLOUD
Siva Narayanan
Qubole
snarayanan@qubole.com
@k2_181

WHO THE HELL IS THIS GUY?
 PhD in Large-scale scientific data management
 Parallel query processing,
Greenplum Parallel Database
 Hadoop and Hive at Qubole
Niche.
Scientific simulation apps
Fortune Companies
Small and medium
enterprises

SO YOU WANT TO DO SOME BIG DATA
ANALYTICS…
 Want to do targeted marketing campaigns
 You want to minimize churn (attrition in customer base)
 Want to build a product recommendation engine
Use data to improve your business

TYPICAL BIG DATA PROJECT
 Buy lots of hardware
 Buy / install software
 Hire admins who can keep everything running
 Hire analysts/data scientists to come up with interesting questions
 Productionalize questions into reports

PROBLEM 1
 Most organizations struggle to achieve > 40% utilization of their
cluster
 Exploratory and iterative
 Actionable reports produced at best few times a day
 Since you have to plan 2-3 years ahead, chances are you will
overprovision
Chen et al,
VLDB 2012
Provision for peak workload

PROBLEM 2
Heterogeneou
s
Data
End Users
(Product
Mgrs, User
Ops etc.)
BOTTLENECK
Ops
Engineers
Data
Scientists

RESULT
 Big Data projects traditionally done at companies
 Who can afford to overprovision
 Can hire the right talent

LANDSCAPE IS CHANGING
 Advent of clouds
 Provision 10-100s of machines in minutes
 Pay as you go, grow as you please
 Free / cheap big-data software
 Hadoop
 Hive
 R
 Sqoop
 (many more)

PUBLIC CLOUDS ARE GROWING
Time
I/ORequests
More people are doing critical stuff in the cloud!

CLOUD PRIMITIVES
 Persistent object/file store e.g. Amazon’s S3
 Ability to provision cluster with pre-built images
 Ability to add or remove nodes from the cluster
 Hosted operational store like MySQL
 Ways to bid for excess capacity (Amazon’s spot instances)
 Can get up to 90% discount

ENTER HADOOP
 Open-source implementation of Map-reduce used by Google to
index trillions of web pages
 Allows programmers to write distributed programs using map and
reduce abstractions
 Primarily Java, but supports other languages too
 Ability to run these programs on large amounts of data
 Uses bunch of cheap hardware, can tolerate failures

ENTER HIVE
 Facebook had a Multi Petabyte Warehouse
 Had 80+ engineers writing Hadoop jobs
 Quickly realized that files are insufficient abstractions
 Need SQL concepts like tables, schemas, partitions, indices
 Many, many, many more people know SQL than Hadoop
 So, implemented SQL on top of Hadoop
 Made data more accessible
 Finally, FB open sourced it

HIVE
 SQL* interface on top of unstructured data
 Handles variety of open data formats
 JSON, Text, Binary, Avro, ProtoBuf, Thrift
 Extreme pluggability
 Some things aren’t meant to be done in SQL
 Custom Python, PHP, Ruby, Bash code
 Production ready
 Processes 25PB of data in FB
Hive project started by Qubole founders!

RECAP: LANDSCAPE IS CHANGING
 Advent of clouds
 Free / cheap big-data software

THE BIG OPPORTUNITY
 Hadoop++ is great for analytics, but designed for data centers
 Cloud offers very different tradeoffs and opportunities
Big Data Analytics in the Cloud!

ENTER QUBOLE
Spreadsheets* BI tools Custom AppsBrowser
*
*
Other players:
• Amazon’s
EMR
• Treasure Data
• Mortar Data

QUBOLE FEATURES
 Simple query interface
 Automated cluster management
 Cloud performance enhancements
 Integration with data sources / sinks
 Workflows
 Scheduler
 Programmability

CLUSTER MANAGEMENT
 Automatic launching, shutting down
clusters at hour boundaries
 Recycle bad clusters (it happens,
sometimes)
 Save logs for debugging
 Spot instances to save costs
 Sophisticated auto-scaling algorithm
adjusts to usage
Actual user quote: “I've basically not had to learn *anything* to get my data
feed working “

PERFORMANCE
Cloud optimized: 5x faster than Amazon’s Elastic
Mapreduce

INTEGRATION
 ODBC Driver
 Tableau
 Excel
 Database connectors
 MySQL
 Vertica
 MongoDB
 Other Sources
 Google Analytics
 Omniture *
 AppNexus

WORKFLOWS AND SCHEDULER
 Example workflow:
 Extract data from operational MySQL DB about customer transactions
 Extract FB data on your company or product page
 Run report that joins FB data with DB data to see how many people have had
failed transactions have commented in FB page
 Push results to reporting DB so that customer support can access in internal
site
 Scheduler allows you to run this workflow every night
 Dealing with late arrival data
 Notifications

PROGRAMMABILITY: REST API
Python SDK to talk to Qubole

USE CASE
 Current Customer
 Most popular Q&A site
 Use cases:
 A/B testing on new product features and the resulting analysis
 Path analysis on application usage
 Operational metrics
Within one month, went from 4 to 16 users!

ABOUT QUBOLE
Ashish Thusoo
CEO/Cofounder
Joydeep Sen Sarma
CTO/Cofounder
Sadiq Shaik
Director Prod Mgmt
Shrikanth Shankar
Head of Engineering
Processed more than
2 Petabytes in August!

CONCLUSION
 Big Data Analytics in the Cloud done right
 Provision 2 node clusters or 500 node clusters with same ease
 Pay as you go, grow as you please
 Integrate variety of data sources
 Optimized for the cloud
 Reduces business risk and time to insight

THANK YOU!
QUESTIONS?
Go to http://www.qubole.com to sign up for a free trial!
We are hiring! jobs@qubole.com
 snarayanan@qubole.com
 @k2_181

PERFORMANCE
 Columnar cache – 3x speedup
 Prefetch files to hide latency – 30% improvement
 Optimize split computation – 8x improvement
 Multi-part upload of large files
 Moving files is expensive, write output directly
 Qubole Hive server – 8x speedup for DDL statements
 Order-by-limit query optimization – 5x improvement
Cloud optimized: 5x faster than Amazon’s Elastic
Mapreduce

Big dataanalyticsinthecloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Big dataanalyticsinthecloud

Similar to Big dataanalyticsinthecloud (20)

Recently uploaded

Recently uploaded (20)

Big dataanalyticsinthecloud

Editor's Notes