SlideShare a Scribd company logo
1 of 30
BIG DATA
ANALYTICS IN THE
CLOUD
Siva Narayanan
Qubole
snarayanan@qubole.com
@k2_181
WHO THE HELL IS THIS GUY?
 PhD in Large-scale scientific data management
 Parallel query processing,
Greenplum Parallel Database
 Hadoop and Hive at Qubole
Niche.
Scientific simulation apps
Fortune Companies
Small and medium
enterprises
SO YOU WANT TO DO SOME BIG DATA
ANALYTICS…
 Want to do targeted marketing campaigns
 You want to minimize churn (attrition in customer base)
 Want to build a product recommendation engine
Use data to improve your business
TYPICAL BIG DATA PROJECT
 Buy lots of hardware
 Buy / install software
 Hire admins who can keep everything running
 Hire analysts/data scientists to come up with interesting questions
 Productionalize questions into reports
PROBLEM 1
 Most organizations struggle to achieve > 40% utilization of their
cluster
 Exploratory and iterative
 Actionable reports produced at best few times a day
 Since you have to plan 2-3 years ahead, chances are you will
overprovision
Chen et al,
VLDB 2012
Provision for peak workload
PROBLEM 2
Heterogeneou
s
Data
End Users
(Product
Mgrs, User
Ops etc.)
BOTTLENECK
Ops
Engineers
Data
Scientists
RESULT
 Big Data projects traditionally done at companies
 Who can afford to overprovision
 Can hire the right talent
LANDSCAPE IS CHANGING
 Advent of clouds
 Provision 10-100s of machines in minutes
 Pay as you go, grow as you please
 Free / cheap big-data software
 Hadoop
 Hive
 R
 Sqoop
 (many more)
PUBLIC CLOUDS ARE GROWING
Time
I/ORequests
More people are doing critical stuff in the cloud!
CLOUD PRIMITIVES
 Persistent object/file store e.g. Amazon’s S3
 Ability to provision cluster with pre-built images
 Ability to add or remove nodes from the cluster
 Hosted operational store like MySQL
 Ways to bid for excess capacity (Amazon’s spot instances)
 Can get up to 90% discount
ENTER HADOOP
 Open-source implementation of Map-reduce used by Google to
index trillions of web pages
 Allows programmers to write distributed programs using map and
reduce abstractions
 Primarily Java, but supports other languages too
 Ability to run these programs on large amounts of data
 Uses bunch of cheap hardware, can tolerate failures
HADOOP SCALES!
ENTER HIVE
 Facebook had a Multi Petabyte Warehouse
 Had 80+ engineers writing Hadoop jobs
 Quickly realized that files are insufficient abstractions
 Need SQL concepts like tables, schemas, partitions, indices
 Many, many, many more people know SQL than Hadoop
 So, implemented SQL on top of Hadoop
 Made data more accessible
 Finally, FB open sourced it
HIVE
 SQL* interface on top of unstructured data
 Handles variety of open data formats
 JSON, Text, Binary, Avro, ProtoBuf, Thrift
 Extreme pluggability
 Some things aren’t meant to be done in SQL
 Custom Python, PHP, Ruby, Bash code
 Production ready
 Processes 25PB of data in FB
Hive project started by Qubole founders!
HIVE SCALES!
RECAP: LANDSCAPE IS CHANGING
 Advent of clouds
 Free / cheap big-data software
THE BIG OPPORTUNITY
 Hadoop++ is great for analytics, but designed for data centers
 Cloud offers very different tradeoffs and opportunities
Big Data Analytics in the Cloud!
ENTER QUBOLE
Spreadsheets* BI tools Custom AppsBrowser
*
*
Other players:
• Amazon’s
EMR
• Treasure Data
• Mortar Data
QUBOLE FEATURES
 Simple query interface
 Automated cluster management
 Cloud performance enhancements
 Integration with data sources / sinks
 Workflows
 Scheduler
 Programmability
QUERY INTERFACE
CLUSTER MANAGEMENT
 Automatic launching, shutting down
clusters at hour boundaries
 Recycle bad clusters (it happens,
sometimes)
 Save logs for debugging
 Spot instances to save costs
 Sophisticated auto-scaling algorithm
adjusts to usage
Actual user quote: “I've basically not had to learn *anything* to get my data
feed working “
PERFORMANCE
Cloud optimized: 5x faster than Amazon’s Elastic
Mapreduce
INTEGRATION
 ODBC Driver
 Tableau
 Excel
 Database connectors
 MySQL
 Vertica
 MongoDB
 Other Sources
 Google Analytics
 Omniture *
 AppNexus
WORKFLOWS AND SCHEDULER
 Example workflow:
 Extract data from operational MySQL DB about customer transactions
 Extract FB data on your company or product page
 Run report that joins FB data with DB data to see how many people have had
failed transactions have commented in FB page
 Push results to reporting DB so that customer support can access in internal
site
 Scheduler allows you to run this workflow every night
 Dealing with late arrival data
 Notifications
PROGRAMMABILITY: REST API
Python SDK to talk to Qubole
USE CASE
 Current Customer
 Most popular Q&A site
 Use cases:
 A/B testing on new product features and the resulting analysis
 Path analysis on application usage
 Operational metrics
Within one month, went from 4 to 16 users!
ABOUT QUBOLE
Ashish Thusoo
CEO/Cofounder
Joydeep Sen Sarma
CTO/Cofounder
Sadiq Shaik
Director Prod Mgmt
Shrikanth Shankar
Head of Engineering
Processed more than
2 Petabytes in August!
CONCLUSION
 Big Data Analytics in the Cloud done right
 Provision 2 node clusters or 500 node clusters with same ease
 Pay as you go, grow as you please
 Integrate variety of data sources
 Optimized for the cloud
 Reduces business risk and time to insight
THANK YOU!
QUESTIONS?
Go to http://www.qubole.com to sign up for a free trial!
We are hiring! jobs@qubole.com
 snarayanan@qubole.com
 @k2_181
PERFORMANCE
 Columnar cache – 3x speedup
 Prefetch files to hide latency – 30% improvement
 Optimize split computation – 8x improvement
 Multi-part upload of large files
 Moving files is expensive, write output directly
 Qubole Hive server – 8x speedup for DDL statements
 Order-by-limit query optimization – 5x improvement
Cloud optimized: 5x faster than Amazon’s Elastic
Mapreduce

More Related Content

What's hot

Spark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu AdunuthulaSpark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu AdunuthulaSpark Summit
 
Spark Summit Keynote by Shaun Connolly
Spark Summit Keynote by Shaun ConnollySpark Summit Keynote by Shaun Connolly
Spark Summit Keynote by Shaun ConnollySpark Summit
 
Intro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - DenverIntro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - DenverSri Ambati
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascadingDataiku
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlSpark Summit
 
Spark Summit East Keynote by Anjul Bhambhri
Spark Summit East Keynote by Anjul BhambhriSpark Summit East Keynote by Anjul Bhambhri
Spark Summit East Keynote by Anjul BhambhriJen Aman
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahDatabricks
 
Strata + Hadoop World: Jump Into the Data Lake with Hadoop-Scale Data Integra...
Strata + Hadoop World: Jump Into the Data Lake with Hadoop-Scale Data Integra...Strata + Hadoop World: Jump Into the Data Lake with Hadoop-Scale Data Integra...
Strata + Hadoop World: Jump Into the Data Lake with Hadoop-Scale Data Integra...SnapLogic
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data ScienceNiko Vuokko
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Databricks
 
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit
 
Intro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize SeattleIntro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize SeattleSri Ambati
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningDatabricks
 
Birds Eye View on Big Data by STKI
Birds Eye View on Big Data by STKIBirds Eye View on Big Data by STKI
Birds Eye View on Big Data by STKIIdan Tohami
 
Self Guiding User Experience
Self Guiding User ExperienceSelf Guiding User Experience
Self Guiding User ExperienceSri Ambati
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineeringThang Bui (Bob)
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanycOpen Analytics
 
In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017SingleStore
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summitOpen Analytics
 
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Data Con LA
 

What's hot (20)

Spark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu AdunuthulaSpark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu Adunuthula
 
Spark Summit Keynote by Shaun Connolly
Spark Summit Keynote by Shaun ConnollySpark Summit Keynote by Shaun Connolly
Spark Summit Keynote by Shaun Connolly
 
Intro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - DenverIntro to Machine Learning with H2O and Python - Denver
Intro to Machine Learning with H2O and Python - Denver
 
Dataiku pig - hive - cascading
Dataiku   pig - hive - cascadingDataiku   pig - hive - cascading
Dataiku pig - hive - cascading
 
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason PohlBuilding a Just in Time Data Warehouse by Dan Morris and Jason Pohl
Building a Just in Time Data Warehouse by Dan Morris and Jason Pohl
 
Spark Summit East Keynote by Anjul Bhambhri
Spark Summit East Keynote by Anjul BhambhriSpark Summit East Keynote by Anjul Bhambhri
Spark Summit East Keynote by Anjul Bhambhri
 
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq AbdullahLeveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
 
Strata + Hadoop World: Jump Into the Data Lake with Hadoop-Scale Data Integra...
Strata + Hadoop World: Jump Into the Data Lake with Hadoop-Scale Data Integra...Strata + Hadoop World: Jump Into the Data Lake with Hadoop-Scale Data Integra...
Strata + Hadoop World: Jump Into the Data Lake with Hadoop-Scale Data Integra...
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data Science
 
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
 
Spark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren NathanSpark Summit Keynote by Suren Nathan
Spark Summit Keynote by Suren Nathan
 
Intro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize SeattleIntro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize Seattle
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 
Birds Eye View on Big Data by STKI
Birds Eye View on Big Data by STKIBirds Eye View on Big Data by STKI
Birds Eye View on Big Data by STKI
 
Self Guiding User Experience
Self Guiding User ExperienceSelf Guiding User Experience
Self Guiding User Experience
 
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 
In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
 

Viewers also liked

Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudQubole
 
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...Data Con LA
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentationCyanny LIANG
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Sadayuki Furuhashi
 
Qubole - Big data in cloud
Qubole - Big data in cloudQubole - Big data in cloud
Qubole - Big data in cloudDmitry Tolpeko
 

Viewers also liked (6)

Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...Big Data Day LA 2015 -  Lessons learned from scaling Big Data in the Cloud by...
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
 
Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014Presto - Hadoop Conference Japan 2014
Presto - Hadoop Conference Japan 2014
 
Qubole - Big data in cloud
Qubole - Big data in cloudQubole - Big data in cloud
Qubole - Big data in cloud
 

Similar to Big dataanalyticsinthecloud

FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)GeeksLab Odessa
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposalQubole
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchHortonworks
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approachesLuxoft
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
 
Big Data and Oracle - 2013
Big Data and Oracle - 2013Big Data and Oracle - 2013
Big Data and Oracle - 2013Connor McDonald
 
Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldCA Technologies
 
Apache hadoop for windows server and windwos azure
Apache hadoop for windows server and windwos azureApache hadoop for windows server and windwos azure
Apache hadoop for windows server and windwos azureBrad Sarsfield
 
Haddop in Business Intelligence
Haddop in Business IntelligenceHaddop in Business Intelligence
Haddop in Business IntelligenceHGanesh
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopJosh Patterson
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouseStephen Alex
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!Cloudera, Inc.
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009yhadoop
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Chandan's_Resume
Chandan's_ResumeChandan's_Resume
Chandan's_ResumeChandan Das
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesQubole
 

Similar to Big dataanalyticsinthecloud (20)

FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
FOSS Sea 2014_DataWarehouse & BigData_Владимир Слободянюк ( Luxoft)
 
Big dataproposal
Big dataproposalBig dataproposal
Big dataproposal
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approaches
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big Data and Oracle - 2013
Big Data and Oracle - 2013Big Data and Oracle - 2013
Big Data and Oracle - 2013
 
Bridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven WorldBridging the Big Data Gap in the Software-Driven World
Bridging the Big Data Gap in the Software-Driven World
 
Apache hadoop for windows server and windwos azure
Apache hadoop for windows server and windwos azureApache hadoop for windows server and windwos azure
Apache hadoop for windows server and windwos azure
 
Haddop in Business Intelligence
Haddop in Business IntelligenceHaddop in Business Intelligence
Haddop in Business Intelligence
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Chandan's_Resume
Chandan's_ResumeChandan's_Resume
Chandan's_Resume
 
Atlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slidesAtlanta Data Science Meetup | Qubole slides
Atlanta Data Science Meetup | Qubole slides
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 

Big dataanalyticsinthecloud

  • 1. BIG DATA ANALYTICS IN THE CLOUD Siva Narayanan Qubole snarayanan@qubole.com @k2_181
  • 2. WHO THE HELL IS THIS GUY?  PhD in Large-scale scientific data management  Parallel query processing, Greenplum Parallel Database  Hadoop and Hive at Qubole Niche. Scientific simulation apps Fortune Companies Small and medium enterprises
  • 3. SO YOU WANT TO DO SOME BIG DATA ANALYTICS…  Want to do targeted marketing campaigns  You want to minimize churn (attrition in customer base)  Want to build a product recommendation engine Use data to improve your business
  • 4. TYPICAL BIG DATA PROJECT  Buy lots of hardware  Buy / install software  Hire admins who can keep everything running  Hire analysts/data scientists to come up with interesting questions  Productionalize questions into reports
  • 5. PROBLEM 1  Most organizations struggle to achieve > 40% utilization of their cluster  Exploratory and iterative  Actionable reports produced at best few times a day  Since you have to plan 2-3 years ahead, chances are you will overprovision Chen et al, VLDB 2012 Provision for peak workload
  • 6. PROBLEM 2 Heterogeneou s Data End Users (Product Mgrs, User Ops etc.) BOTTLENECK Ops Engineers Data Scientists
  • 7. RESULT  Big Data projects traditionally done at companies  Who can afford to overprovision  Can hire the right talent
  • 8. LANDSCAPE IS CHANGING  Advent of clouds  Provision 10-100s of machines in minutes  Pay as you go, grow as you please  Free / cheap big-data software  Hadoop  Hive  R  Sqoop  (many more)
  • 9. PUBLIC CLOUDS ARE GROWING Time I/ORequests More people are doing critical stuff in the cloud!
  • 10. CLOUD PRIMITIVES  Persistent object/file store e.g. Amazon’s S3  Ability to provision cluster with pre-built images  Ability to add or remove nodes from the cluster  Hosted operational store like MySQL  Ways to bid for excess capacity (Amazon’s spot instances)  Can get up to 90% discount
  • 11. ENTER HADOOP  Open-source implementation of Map-reduce used by Google to index trillions of web pages  Allows programmers to write distributed programs using map and reduce abstractions  Primarily Java, but supports other languages too  Ability to run these programs on large amounts of data  Uses bunch of cheap hardware, can tolerate failures
  • 13. ENTER HIVE  Facebook had a Multi Petabyte Warehouse  Had 80+ engineers writing Hadoop jobs  Quickly realized that files are insufficient abstractions  Need SQL concepts like tables, schemas, partitions, indices  Many, many, many more people know SQL than Hadoop  So, implemented SQL on top of Hadoop  Made data more accessible  Finally, FB open sourced it
  • 14. HIVE  SQL* interface on top of unstructured data  Handles variety of open data formats  JSON, Text, Binary, Avro, ProtoBuf, Thrift  Extreme pluggability  Some things aren’t meant to be done in SQL  Custom Python, PHP, Ruby, Bash code  Production ready  Processes 25PB of data in FB Hive project started by Qubole founders!
  • 16. RECAP: LANDSCAPE IS CHANGING  Advent of clouds  Free / cheap big-data software
  • 17. THE BIG OPPORTUNITY  Hadoop++ is great for analytics, but designed for data centers  Cloud offers very different tradeoffs and opportunities Big Data Analytics in the Cloud!
  • 18. ENTER QUBOLE Spreadsheets* BI tools Custom AppsBrowser * * Other players: • Amazon’s EMR • Treasure Data • Mortar Data
  • 19. QUBOLE FEATURES  Simple query interface  Automated cluster management  Cloud performance enhancements  Integration with data sources / sinks  Workflows  Scheduler  Programmability
  • 21. CLUSTER MANAGEMENT  Automatic launching, shutting down clusters at hour boundaries  Recycle bad clusters (it happens, sometimes)  Save logs for debugging  Spot instances to save costs  Sophisticated auto-scaling algorithm adjusts to usage Actual user quote: “I've basically not had to learn *anything* to get my data feed working “
  • 22. PERFORMANCE Cloud optimized: 5x faster than Amazon’s Elastic Mapreduce
  • 23. INTEGRATION  ODBC Driver  Tableau  Excel  Database connectors  MySQL  Vertica  MongoDB  Other Sources  Google Analytics  Omniture *  AppNexus
  • 24. WORKFLOWS AND SCHEDULER  Example workflow:  Extract data from operational MySQL DB about customer transactions  Extract FB data on your company or product page  Run report that joins FB data with DB data to see how many people have had failed transactions have commented in FB page  Push results to reporting DB so that customer support can access in internal site  Scheduler allows you to run this workflow every night  Dealing with late arrival data  Notifications
  • 25. PROGRAMMABILITY: REST API Python SDK to talk to Qubole
  • 26. USE CASE  Current Customer  Most popular Q&A site  Use cases:  A/B testing on new product features and the resulting analysis  Path analysis on application usage  Operational metrics Within one month, went from 4 to 16 users!
  • 27. ABOUT QUBOLE Ashish Thusoo CEO/Cofounder Joydeep Sen Sarma CTO/Cofounder Sadiq Shaik Director Prod Mgmt Shrikanth Shankar Head of Engineering Processed more than 2 Petabytes in August!
  • 28. CONCLUSION  Big Data Analytics in the Cloud done right  Provision 2 node clusters or 500 node clusters with same ease  Pay as you go, grow as you please  Integrate variety of data sources  Optimized for the cloud  Reduces business risk and time to insight
  • 29. THANK YOU! QUESTIONS? Go to http://www.qubole.com to sign up for a free trial! We are hiring! jobs@qubole.com  snarayanan@qubole.com  @k2_181
  • 30. PERFORMANCE  Columnar cache – 3x speedup  Prefetch files to hide latency – 30% improvement  Optimize split computation – 8x improvement  Multi-part upload of large files  Moving files is expensive, write output directly  Qubole Hive server – 8x speedup for DDL statements  Order-by-limit query optimization – 5x improvement Cloud optimized: 5x faster than Amazon’s Elastic Mapreduce

Editor's Notes

  1. Browser based interfaceTable explorerQuery historySyntax highlightingTest mode executionExpression evaluationKilling unwanted queries