SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
Cloud Friendly Hadoop & Hive

         Joydeep Sen Sarma



           Qubole
Agenda

 What is Qubole Data Service

 Hadoop as a Service in Cloud

 Hive as a Service in Cloud




                           2
Qubole Data Service




AWS EC2
                                3
AWS S3
Qubole Data Service




                      API

     Oozie     Hive            Pig   Sqoop



                      Hadoop
AWS EC2
AWS S3
Qubole Data Service




                      API
                                                  Vertica
     Oozie     Hive            Pig   Sqoop

                                                   Mysql
                      Hadoop
AWS EC2
                                                     5
                                             S3://adco/logs
AWS S3
Qubole Data Service

                                             SDK    ODBC




 Explore – Integrate – Analyze – Schedule

                          API
                                                                Vertica
      Oozie        Hive            Pig      Sqoop

                                                                 Mysql
                          Hadoop
AWS EC2
                                   6                               6
AWS S3                                                     S3://adco/logs
Qubole Data Service

                                             SDK    ODBC




 Explore – Integrate – Analyze – Schedule

                          API
                                                                Vertica
      Oozie        Hive            Pig      Sqoop

                                                                 Mysql
                          Hadoop
AWS EC2
                                   7                               7
AWS S3                                                     S3://adco/logs
Agenda

• What is Qubole Data Service

• Hadoop as a Service in Cloud

• Hive as a Service in Cloud




                           8
Step 1(Optional): Setup Hadoop




              9
Step 2: Fire Away




    AdCo Hadoop




          10
Step 2: Fire Away

select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;




                                         AdCo Hadoop




                                               11
Step 2: Fire Away

select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;




                                         AdCo Hadoop




                                               12
Step 2: Fire Away
                                                       hadoop jar –Dmapred.min.split.size=32000000
                                                       myapp.jar –partitioner .org.apache…

select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;




                                         AdCo Hadoop


                                                         insert overwrite table dest
                                                         select a.id, a.zip, count(distinct b.uid)
                                                         from ads a join LARGE_TABLE b on (a.id=b.ad_id)
                                               13        group by a.id, a.zip;
                                                                                                     13
Step 2: Fire Away
                                                       hadoop jar –Dmapred.min.split.size=32000000
                                                       myapp.jar –partitioner .org.apache…

select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;




                                         AdCo Hadoop


                                                         insert overwrite table dest
                                                         select a.id, a.zip, count(distinct b.uid)
                                                         from ads a join LARGE_TABLE b on (a.id=b.ad_id)
                                               14        group by a.id, a.zip;
                                                                                                     14
Step 2: Fire Away
                  hadoop jar –Dmapred.min.split.size=32000000
                  myapp.jar –partitioner .org.apache…




    AdCo Hadoop




          15
Step 2: Fire Away
                  hadoop jar –Dmapred.min.split.size=32000000
                  myapp.jar –partitioner .org.apache…




    AdCo Hadoop




          16
Step 2: Fire Away




    AdCo Hadoop




          17
Come back anytime




       18
Hadoop as Service
1. Detect when cluster is required
  – Not all Hive statements require cluster (EXPLAIN/SHOW/..)


2. Atomically create cluster
  – Long running process, concurrency control using Mysql


3. Shutdown when not in use
  – Do on hour boundary (whose?)
  – Not if User Sessions are active!

                              19
Hadoop as Service
• Archive Job History/Logs to S3
  – Transparent access to Old jobs



• Auto-Config different node types
  – Use ALL ephemeral drives for HDFS/MR
  – Use right number of slots per machine


• Scrub, Scrub, Scrub
  – Bad Nodes, Bad Clusters, AWS timeouts


                                     20
Scaling Up
                                Slaves



Map Tasks

                 Job Tracker


ReduceTasks




 Master           StarCluster


                                   21
                 AWS
Scaling Up
insert overwrite table dest             Slaves
select … from ads join
campaigns on …group by …;



   Map Tasks

                          Job Tracker


   ReduceTasks




    Master                StarCluster


                                           22
                         AWS
Scaling Up
insert overwrite table dest             Slaves
select … from ads join
campaigns on …group by …;



   Map Tasks

                          Job Tracker


   ReduceTasks




    Master                StarCluster


                                           23
                         AWS
Scaling Up
insert overwrite table dest             Slaves
select … from ads join
campaigns on …group by …;



   Map Tasks

                          Job Tracker


   ReduceTasks




    Master                StarCluster


                                           24
                         AWS
Scaling Up
insert overwrite table dest                        Slaves
select … from ads join
campaigns on …group by …;
                                        Progress


   Map Tasks

                          Job Tracker


   ReduceTasks




    Master                StarCluster


                                                      25
                         AWS
Scaling Up
insert overwrite table dest                           Slaves
select … from ads join
campaigns on …group by …;
                                           Progress


   Map Tasks

                          Job Tracker


   ReduceTasks
                                  Supply

                     Demand



    Master                StarCluster


                                                         26
                         AWS
Scaling Up
insert overwrite table dest                           Slaves
select … from ads join
campaigns on …group by …;
                                           Progress


   Map Tasks

                          Job Tracker


   ReduceTasks
                                  Supply

                     Demand



    Master                StarCluster


                                                         27
                         AWS
Scaling Up
insert overwrite table dest                        Slaves
select … from ads join
campaigns on …group by …;
                                        Progress


   Map Tasks

                          Job Tracker


   ReduceTasks




    Master                StarCluster


                                                      28
                         AWS
Scaling Up
insert overwrite table dest                        Slaves
select … from ads join
campaigns on …group by …;
                                        Progress


   Map Tasks

                          Job Tracker


   ReduceTasks




    Master                StarCluster


                                                      29
                         AWS
Scaling Down
1. On hour boundary – check if node is required:
   – Can’t remove nodes with map-outputs (today)
   – Don’t go below minimum cluster size


2. Remove node from Map-Reduce Cluster

3. Request HDFS Decomissioning – fast!
  –   Delete affected cache files instead of re-replicating
  –   One surviving replica and we are Done.


4. Delete Instance
                                  30
Spot Instances




On an average 50-60% cheaper
            31                 31
Spot Instance: Challenges
• Can lose Spot nodes anytime
  – Disastrous for HDFS
  – Hybrid Mode: Use mix of On-Demand and Spot
  – Hybrid Mode: Keep one replica in On-Demand nodes



• Spot Instances may not be available
  – Timeout and use On-Demand nodes as fallback



                           32
Agenda

 What is Qubole Data Service

 Hadoop as a Service in Cloud

 Hive as a Service in Cloud




                          33
Query History/Results




         34
Cheap to Test

           Evaluate expressions on
            sample data




     35
Cheap to Test




           Run Query on Sample




     36
Fastest Hive SaaS
• Works with Small Files!
  – Faster Split Computation (8x)
  – Prefetching S3 files (30%)




                             37
Fastest Hive SaaS
• Works with Small Files!           • Stable JVM Reuse!
  – Faster Split Computation (8x)     – Fix re-entrancy issues
  – Prefetching S3 files (30%)        – 1.2-2x speedup




                             38
Fastest Hive SaaS
• Works with Small Files!           • Stable JVM Reuse!
  – Faster Split Computation (8x)     – Fix re-entrancy issues
  – Prefetching S3 files (30%)        – 1.2-2x speedup


• Direct writes to S3
  – HIVE-1620




                             39
Fastest Hive SaaS
• Works with Small Files!           • Stable JVM Reuse!
  – Faster Split Computation (8x)     – Fix re-entrancy issues
  – Prefetching S3 files (30%)        – 1.2-2x speedup


• Direct writes to S3               • Columnar Cache
  – HIVE-1620                         – Use HDFS as cache for S3
                                      – Upto 5x faster for JSON
                                        data




                             40
Fastest Hive SaaS
• Works with Small Files!           • Stable JVM Reuse!
  – Faster Split Computation (8x)     – Fix re-entrancy issues
  – Prefetching S3 files (30%)        – 1.2-2x speedup


• Direct writes to S3               • Columnar Cache
  – HIVE-1620                         – Use HDFS as cache for S3
                                      – Upto 5x faster for JSON
                                        data
• NEW – Multi-Tenant Hive
  Server

                             41
Questions?


           @Qubole
Free Trial: www.qubole.com

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
DataWorks Summit
 

Was ist angesagt? (20)

The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Optimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public CloudOptimizing Big Data to run in the Public Cloud
Optimizing Big Data to run in the Public Cloud
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
 
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | QuboleEbooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
Ebooks - Accelerating Time to Value of Big Data of Apache Spark | Qubole
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
 
Yahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile PlatformYahoo's Next Generation User Profile Platform
Yahoo's Next Generation User Profile Platform
 
Microsoft Azure Databricks
Microsoft Azure DatabricksMicrosoft Azure Databricks
Microsoft Azure Databricks
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
 
Digital Transformation with Microsoft Azure
Digital Transformation with Microsoft AzureDigital Transformation with Microsoft Azure
Digital Transformation with Microsoft Azure
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 

Andere mochten auch

Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
Qubole
 

Andere mochten auch (20)

Getting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big DataGetting to 1.5M Ads/sec: How DataXu manages Big Data
Getting to 1.5M Ads/sec: How DataXu manages Big Data
 
5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoption5 Crucial Considerations for Big data adoption
5 Crucial Considerations for Big data adoption
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Nw qubole overview_033015
Nw qubole overview_033015Nw qubole overview_033015
Nw qubole overview_033015
 
Unlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWSUnlocking Self-Service Big Data Analytics on AWS
Unlocking Self-Service Big Data Analytics on AWS
 
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup   Qubole presentation for the Cleveland Big Data and Hadoop Meetup
Qubole presentation for the Cleveland Big Data and Hadoop Meetup
 
RDO-Packstack Workshop
RDO-Packstack Workshop RDO-Packstack Workshop
RDO-Packstack Workshop
 
Cortana Analytics Suite
Cortana Analytics SuiteCortana Analytics Suite
Cortana Analytics Suite
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
 
BIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - QuboleBIPD Tech Tuesday Presentation - Qubole
BIPD Tech Tuesday Presentation - Qubole
 
Creating a fortigate vpn network & security blog
Creating a fortigate vpn   network & security blogCreating a fortigate vpn   network & security blog
Creating a fortigate vpn network & security blog
 
Fortinet Automates Migration onto Layered Secure Workloads
Fortinet Automates Migration onto Layered Secure WorkloadsFortinet Automates Migration onto Layered Secure Workloads
Fortinet Automates Migration onto Layered Secure Workloads
 
Azure ARM’d and Ready
Azure ARM’d and ReadyAzure ARM’d and Ready
Azure ARM’d and Ready
 
Azure Document Db
Azure Document DbAzure Document Db
Azure Document Db
 
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
Benjamin Guinebertière - Microsoft Azure: Document DB and other noSQL databas...
 
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
AWS re:Invent 2016: How DataXu scaled its Attribution System to handle billio...
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
DataXu: Programmatic Premium Webinar - June 7, 2012
DataXu: Programmatic Premium Webinar - June 7, 2012DataXu: Programmatic Premium Webinar - June 7, 2012
DataXu: Programmatic Premium Webinar - June 7, 2012
 
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream Analytics
 

Ähnlich wie Qubole hadoop-summit-2013-europe

Cloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and HiveCloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and Hive
DataWorks Summit
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
S S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
elliando dias
 
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop clusterAmazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Asociatia ProLinux
 
Hadoop入門とクラウド利用
Hadoop入門とクラウド利用Hadoop入門とクラウド利用
Hadoop入門とクラウド利用
Naoki Yanai
 
BigData- On - AWS Cloud -1
BigData- On - AWS Cloud -1BigData- On - AWS Cloud -1
BigData- On - AWS Cloud -1
Milind gunjan
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
Joydeep Sen Sarma
 

Ähnlich wie Qubole hadoop-summit-2013-europe (20)

Cloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and HiveCloud Friendly Hadoop and Hive
Cloud Friendly Hadoop and Hive
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter Experience
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
Getting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduceGetting started with Hadoop, Hive, and Elastic MapReduce
Getting started with Hadoop, Hive, and Elastic MapReduce
 
Spark 2013-04-17
Spark 2013-04-17Spark 2013-04-17
Spark 2013-04-17
 
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop clusterAmazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
Hadoop入門とクラウド利用
Hadoop入門とクラウド利用Hadoop入門とクラウド利用
Hadoop入門とクラウド利用
 
BigData- On - AWS Cloud -1
BigData- On - AWS Cloud -1BigData- On - AWS Cloud -1
BigData- On - AWS Cloud -1
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop online training
Hadoop online trainingHadoop online training
Hadoop online training
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
 
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud Computing
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Scala+data
Scala+dataScala+data
Scala+data
 
Characterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learningCharacterization of hadoop jobs using unsupervised learning
Characterization of hadoop jobs using unsupervised learning
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 

Mehr von Joydeep Sen Sarma (7)

Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Hadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspectiveHadoop Scheduling - a 7 year perspective
Hadoop Scheduling - a 7 year perspective
 
The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012The Meta of Hadoop - COMAD 2012
The Meta of Hadoop - COMAD 2012
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Qubole hadoop-summit-2013-europe

  • 1. Cloud Friendly Hadoop & Hive Joydeep Sen Sarma Qubole
  • 2. Agenda  What is Qubole Data Service  Hadoop as a Service in Cloud  Hive as a Service in Cloud 2
  • 3. Qubole Data Service AWS EC2 3 AWS S3
  • 4. Qubole Data Service API Oozie Hive Pig Sqoop Hadoop AWS EC2 AWS S3
  • 5. Qubole Data Service API Vertica Oozie Hive Pig Sqoop Mysql Hadoop AWS EC2 5 S3://adco/logs AWS S3
  • 6. Qubole Data Service SDK ODBC Explore – Integrate – Analyze – Schedule API Vertica Oozie Hive Pig Sqoop Mysql Hadoop AWS EC2 6 6 AWS S3 S3://adco/logs
  • 7. Qubole Data Service SDK ODBC Explore – Integrate – Analyze – Schedule API Vertica Oozie Hive Pig Sqoop Mysql Hadoop AWS EC2 7 7 AWS S3 S3://adco/logs
  • 8. Agenda • What is Qubole Data Service • Hadoop as a Service in Cloud • Hive as a Service in Cloud 8
  • 10. Step 2: Fire Away AdCo Hadoop 10
  • 11. Step 2: Fire Away select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county; AdCo Hadoop 11
  • 12. Step 2: Fire Away select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county; AdCo Hadoop 12
  • 13. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county; AdCo Hadoop insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) 13 group by a.id, a.zip; 13
  • 14. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… select t.county, count(1) from (select transform(a.zip) using ‘geo.py’ as a.county from SMALL_TABLE a) t group by t.county; AdCo Hadoop insert overwrite table dest select a.id, a.zip, count(distinct b.uid) from ads a join LARGE_TABLE b on (a.id=b.ad_id) 14 group by a.id, a.zip; 14
  • 15. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… AdCo Hadoop 15
  • 16. Step 2: Fire Away hadoop jar –Dmapred.min.split.size=32000000 myapp.jar –partitioner .org.apache… AdCo Hadoop 16
  • 17. Step 2: Fire Away AdCo Hadoop 17
  • 19. Hadoop as Service 1. Detect when cluster is required – Not all Hive statements require cluster (EXPLAIN/SHOW/..) 2. Atomically create cluster – Long running process, concurrency control using Mysql 3. Shutdown when not in use – Do on hour boundary (whose?) – Not if User Sessions are active! 19
  • 20. Hadoop as Service • Archive Job History/Logs to S3 – Transparent access to Old jobs • Auto-Config different node types – Use ALL ephemeral drives for HDFS/MR – Use right number of slots per machine • Scrub, Scrub, Scrub – Bad Nodes, Bad Clusters, AWS timeouts 20
  • 21. Scaling Up Slaves Map Tasks Job Tracker ReduceTasks Master StarCluster 21 AWS
  • 22. Scaling Up insert overwrite table dest Slaves select … from ads join campaigns on …group by …; Map Tasks Job Tracker ReduceTasks Master StarCluster 22 AWS
  • 23. Scaling Up insert overwrite table dest Slaves select … from ads join campaigns on …group by …; Map Tasks Job Tracker ReduceTasks Master StarCluster 23 AWS
  • 24. Scaling Up insert overwrite table dest Slaves select … from ads join campaigns on …group by …; Map Tasks Job Tracker ReduceTasks Master StarCluster 24 AWS
  • 25. Scaling Up insert overwrite table dest Slaves select … from ads join campaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Master StarCluster 25 AWS
  • 26. Scaling Up insert overwrite table dest Slaves select … from ads join campaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Supply Demand Master StarCluster 26 AWS
  • 27. Scaling Up insert overwrite table dest Slaves select … from ads join campaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Supply Demand Master StarCluster 27 AWS
  • 28. Scaling Up insert overwrite table dest Slaves select … from ads join campaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Master StarCluster 28 AWS
  • 29. Scaling Up insert overwrite table dest Slaves select … from ads join campaigns on …group by …; Progress Map Tasks Job Tracker ReduceTasks Master StarCluster 29 AWS
  • 30. Scaling Down 1. On hour boundary – check if node is required: – Can’t remove nodes with map-outputs (today) – Don’t go below minimum cluster size 2. Remove node from Map-Reduce Cluster 3. Request HDFS Decomissioning – fast! – Delete affected cache files instead of re-replicating – One surviving replica and we are Done. 4. Delete Instance 30
  • 31. Spot Instances On an average 50-60% cheaper 31 31
  • 32. Spot Instance: Challenges • Can lose Spot nodes anytime – Disastrous for HDFS – Hybrid Mode: Use mix of On-Demand and Spot – Hybrid Mode: Keep one replica in On-Demand nodes • Spot Instances may not be available – Timeout and use On-Demand nodes as fallback 32
  • 33. Agenda  What is Qubole Data Service  Hadoop as a Service in Cloud  Hive as a Service in Cloud 33
  • 35. Cheap to Test  Evaluate expressions on sample data 35
  • 36. Cheap to Test  Run Query on Sample 36
  • 37. Fastest Hive SaaS • Works with Small Files! – Faster Split Computation (8x) – Prefetching S3 files (30%) 37
  • 38. Fastest Hive SaaS • Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup 38
  • 39. Fastest Hive SaaS • Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup • Direct writes to S3 – HIVE-1620 39
  • 40. Fastest Hive SaaS • Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup • Direct writes to S3 • Columnar Cache – HIVE-1620 – Use HDFS as cache for S3 – Upto 5x faster for JSON data 40
  • 41. Fastest Hive SaaS • Works with Small Files! • Stable JVM Reuse! – Faster Split Computation (8x) – Fix re-entrancy issues – Prefetching S3 files (30%) – 1.2-2x speedup • Direct writes to S3 • Columnar Cache – HIVE-1620 – Use HDFS as cache for S3 – Upto 5x faster for JSON data • NEW – Multi-Tenant Hive Server 41
  • 42. Questions? @Qubole Free Trial: www.qubole.com