Qubole hadoop-summit-2013-europe

Cloud Friendly Hadoop & Hive

Joydeep Sen Sarma

Qubole

Agenda

 What is Qubole Data Service

 Hadoop as a Service in Cloud

 Hive as a Service in Cloud

2

Qubole Data Service

AWS EC2
3
AWS S3

Qubole Data Service

API

Oozie Hive Pig Sqoop

Hadoop
AWS EC2
AWS S3

Qubole Data Service

API
Vertica

Mysql
Hadoop
AWS EC2
5
S3://adco/logs
AWS S3

Qubole Data Service

SDK ODBC

Explore – Integrate – Analyze – Schedule

API
Vertica

Mysql
Hadoop
AWS EC2
6 6
AWS S3 S3://adco/logs

Qubole Data Service

SDK ODBC

Explore – Integrate – Analyze – Schedule

API
Vertica

Mysql
Hadoop
AWS EC2
7 7
AWS S3 S3://adco/logs

Agenda

• What is Qubole Data Service

• Hadoop as a Service in Cloud

• Hive as a Service in Cloud

8

Step 1(Optional): Setup Hadoop

9

Step 2: Fire Away

AdCo Hadoop

10

Step 2: Fire Away

select t.county, count(1) from (select
transform(a.zip) using ‘geo.py’ as
a.county from SMALL_TABLE a) t
group by t.county;

AdCo Hadoop

11

Step 2: Fire Away

group by t.county;

AdCo Hadoop

12

Step 2: Fire Away
hadoop jar –Dmapred.min.split.size=32000000
myapp.jar –partitioner .org.apache…

group by t.county;

AdCo Hadoop

insert overwrite table dest
select a.id, a.zip, count(distinct b.uid)
from ads a join LARGE_TABLE b on (a.id=b.ad_id)
13 group by a.id, a.zip;
13

Step 2: Fire Away

group by t.county;

AdCo Hadoop

insert overwrite table dest
select a.id, a.zip, count(distinct b.uid)
from ads a join LARGE_TABLE b on (a.id=b.ad_id)
14 group by a.id, a.zip;
14

Step 2: Fire Away

AdCo Hadoop

15

Step 2: Fire Away

AdCo Hadoop

16

Step 2: Fire Away

AdCo Hadoop

17

Come back anytime

18

Hadoop as Service
1. Detect when cluster is required
– Not all Hive statements require cluster (EXPLAIN/SHOW/..)

2. Atomically create cluster
– Long running process, concurrency control using Mysql

3. Shutdown when not in use
– Do on hour boundary (whose?)
– Not if User Sessions are active!

19

Hadoop as Service
• Archive Job History/Logs to S3
– Transparent access to Old jobs

• Auto-Config different node types
– Use ALL ephemeral drives for HDFS/MR
– Use right number of slots per machine

• Scrub, Scrub, Scrub
– Bad Nodes, Bad Clusters, AWS timeouts

20

Scaling Up
Slaves

Map Tasks

Job Tracker

ReduceTasks

Master StarCluster

21
AWS

Scaling Up
insert overwrite table dest Slaves
select … from ads join
campaigns on …group by …;

Map Tasks

Job Tracker

ReduceTasks

Master StarCluster

22
AWS

Scaling Up

Map Tasks

Job Tracker

ReduceTasks

Master StarCluster

23
AWS

Scaling Up

Map Tasks

Job Tracker

ReduceTasks

Master StarCluster

24
AWS

Scaling Up
Progress

Map Tasks

Job Tracker

ReduceTasks

Master StarCluster

25
AWS

Scaling Up
Progress

Map Tasks

Job Tracker

ReduceTasks
Supply

Demand

Master StarCluster

26
AWS

Scaling Up
Progress

Map Tasks

Job Tracker

ReduceTasks
Supply

Demand

Master StarCluster

27
AWS

Scaling Up
Progress

Map Tasks

Job Tracker

ReduceTasks

Master StarCluster

28
AWS

Scaling Up
Progress

Map Tasks

Job Tracker

ReduceTasks

Master StarCluster

29
AWS

Scaling Down
1. On hour boundary – check if node is required:
– Can’t remove nodes with map-outputs (today)
– Don’t go below minimum cluster size

2. Remove node from Map-Reduce Cluster

3. Request HDFS Decomissioning – fast!
– Delete affected cache files instead of re-replicating
– One surviving replica and we are Done.

4. Delete Instance
30

Spot Instances

On an average 50-60% cheaper
31 31

Spot Instance: Challenges
• Can lose Spot nodes anytime
– Disastrous for HDFS
– Hybrid Mode: Use mix of On-Demand and Spot
– Hybrid Mode: Keep one replica in On-Demand nodes

• Spot Instances may not be available
– Timeout and use On-Demand nodes as fallback

32

Agenda

 What is Qubole Data Service

 Hadoop as a Service in Cloud

 Hive as a Service in Cloud

33

Query History/Results

34

Cheap to Test

 Evaluate expressions on
sample data

35

Cheap to Test

 Run Query on Sample

36

Fastest Hive SaaS
• Works with Small Files!
– Faster Split Computation (8x)
– Prefetching S3 files (30%)

37

Fastest Hive SaaS
• Works with Small Files! • Stable JVM Reuse!
– Faster Split Computation (8x) – Fix re-entrancy issues
– Prefetching S3 files (30%) – 1.2-2x speedup

38

Fastest Hive SaaS

• Direct writes to S3
– HIVE-1620

39

Fastest Hive SaaS

• Direct writes to S3 • Columnar Cache
– HIVE-1620 – Use HDFS as cache for S3
– Upto 5x faster for JSON
data

40

Fastest Hive SaaS

• Direct writes to S3 • Columnar Cache
– HIVE-1620 – Use HDFS as cache for S3
– Upto 5x faster for JSON
data
• NEW – Multi-Tenant Hive
Server

41

Questions?

@Qubole
Free Trial: www.qubole.com

Qubole hadoop-summit-2013-europe

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Qubole hadoop-summit-2013-europe

Ähnlich wie Qubole hadoop-summit-2013-europe (20)

Mehr von Joydeep Sen Sarma

Mehr von Joydeep Sen Sarma (7)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Qubole hadoop-summit-2013-europe