February 2014 HUG : Hive On Tez

Hive on Tez

Gunther Hagleitner
(gunther@apache.org)

© Hortonworks Inc. 2013.

Batch AND Interactive SQL-IN-Hadoop
Stinger Initiative

A broad, community-based effort to
drive the next generation of HIVE

Stinger Project
(announced February 2013)
Hive 0.11, May 2013:
• Base Optimizations
• SQL Analytic Functions
• ORCFile, Modern File Format

Goals:

Speed
Improve Hive query performance by 100X to
allow for interactive query times (seconds)

Scale
The only SQL interface to Hadoop designed
for queries that scale from TB to PB

Hive 0.12, October 2013:
•
•
•
•

VARCHAR, DATE Types
ORCFile predicate pushdown
Advanced Optimizations
Performance Boosts via YARN

Coming Soon:

SQL
Support broadest range of SQL semantics for
analytic applications running against Hadoop

…all IN Hadoop

•
•
•
•
•

Hive on Apache Tez
Query Service
Buffer Cache
Cost Based Optimizer (Optiq)
Vectorized Processing

SQL: Enhancing SQL Semantics
Hive SQL Datatypes

Hive SQL Semantics

SQL Compliance

INT

SELECT, INSERT

TINYINT/SMALLINT/BIGINT

GROUP BY, ORDER BY, SORT BY

BOOLEAN

JOIN on explicit join key

FLOAT

Inner, outer, cross and semi joins

DOUBLE

Sub-queries in FROM clause

Hive 12 provides a wide
array of SQL data types
and semantics so your
existing tools integrate
more seamlessly with
Hadoop

STRING

ROLLUP and CUBE

TIMESTAMP

UNION

BINARY

Windowing Functions (OVER, RANK, etc)

DECIMAL

Custom Java UDFs

ARRAY, MAP, STRUCT, UNION

Standard Aggregation (SUM, AVG, etc.)

DATE

Advanced UDFs (ngram, Xpath, URL)

VARCHAR

Sub-queries in WHERE, HAVING

CHAR

Expanded JOIN Syntax
SQL Compliant Security (GRANT, etc.)
INSERT/UPDATE/DELETE (ACID)

Available
Hive 0.12
Roadmap

Stinger: Hive performance

Feature

Description

Benefit

Tez Integration

Tez is significantly better engine than MapReduce

Latency

Vectorized Query

Take advantage of modern hardware by processing
thousand-row blocks rather than row-at-a-time.

Throughput

Query Planner
ORC File

Using extensive statistics now available in Metastore
to better plan and optimize query, including
predicate pushdown during compilation to eliminate
portions of input (beyond partition pruning)

Latency

Columnar, type aware format with indices

Latency

Cost Based Optimizer Join re-ordering and other optimizations based on
(Optiq)
column statistics including histograms etc. (future)


Latency

Page 4

Hive on Tez – Basics
•
•
•
•
•
•

Hive/Tez integration in Hive 0.13 (Tez 0.2.0)
Hive/Tez 0.3.0 available in branch
Hive on MR works unchanged on hadoop 1 and 2
Hive on Tez is only available on hadoop 2
Turn on: hive.execution.engine=tez
Hive looks and feels the same on both MR and Tez
– CLI/JDBC/UDFs/SQL/Metastore

• Explain/reporting is similar to MR, but reflects differences in
plan/execution
• Deployment is simple (Tez comes as a client-side library)
• Job client can submit MR via Tez
– And emulate NES too (duck hunt anyone?)


Page 5

Query 88
select *
from
(select count(*) h8_30_to_9
from store_sales
JOIN household_demographics ON store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk
JOIN time_dim ON store_sales.ss_sold_time_sk = time_dim.t_time_sk
JOIN store ON store_sales.ss_store_sk = store.s_store_sk
where
time_dim.t_hour = 8
and time_dim.t_minute >= 30
and ((household_demographics.hd_dep_count = 3 and
household_demographics.hd_vehicle_count<=3+2) or
(household_demographics.hd_dep_count = 0 and
household_demographics.hd_vehicle_count<=0+2) or
(household_demographics.hd_dep_count = 1 and
household_demographics.hd_vehicle_count<=1+2))
and store.s_store_name = 'ese') s1 JOIN
(select count(*) h9_to_9_30
from store_sales
...

• 8 full table scans


Page 6

Query 88: M/R

Total MapReduce jobs = 29
...
Total MapReduce CPU Time Spent: 0 days 2 hours 52 minutes 39
seconds 380 msec
OK
345617 687625 686131 1032842 1030364 606859 604232 692428
Time taken: 403.28 seconds, Fetched: 1 row(s)


Page 7

Query 88: Tez
Map 1: 1/1
Map 11: 1/1
Map 12: 1/1
Map 14: 1/1
Map 15: 1/1
Map 16: 241/241
Map 19: 1/1
Map 2: 1/1
Map 20: 1/1
Map 22: 1/1
Map 23: 1/1
Map 24: 241/241
Map 27: 1/1
Map 28: 1/1
Map 29: 1/1
Map 30: 240/240 Map 32: 241/241 Map 34: 1/1
Map 36: 1/1
Map 37: 1/1
Map 38: 1/1
Map 42: 1/1
Map 43: 1/1
Map 44: 240/240
Reducer 10: 1/1 Reducer 17: 1/1 Reducer 25: 1/1
Status: Finished successfully
OK
345617 687625 686131 1032842 1030364 606859
Time taken: 90.233 seconds, Fetched: 1 row(s)


Map 13: 1/1
Map 18: 1/1
Map 21: 1/1
Map 26: 1/1
Map 3: 241/241
Map 35: 1/1
Map 39: 241/241
Map 46: 241/241
Reducer 31: 1/1
Reducer 41: 1/1
Reducer 6: 1/1

604232

692428

Page 8

HIVE-4660

Hive-on-MR vs. Hive-on-Tez
SELECT g1.x, g1.avg, g2.cnt
FROM (SELECT a.x, AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1
JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2
ON (g1.x = g2.x)
ORDER BY avg;

Hive – MR
M

M

Hive – Tez
GROUP b BY b.x

M

M

GROUP a BY a.x
R

Tez avoids
unnecessary writes
to HDFS

GROUP BY x

M
M

R

M

M

M

R
R

HDFS

JOIN (a,b)

HDFS

M

GROUP BY a.x
JOIN (a,b)
R

M

R

R

ORDER BY
HDFS

M

ORDER BY
R


R

M

Hive on Tez - Execution
Feature
Tez Session
Tez Container PreLaunch

Description
Overcomes Map-Reduce job-launch latency by prelaunching Tez AppMaster

Latency

Overcomes Map-Reduce latency by pre-launching
hot containers ready to serve queries.

Latency

Finished maps and reduces pick up more work
Tez Container Re-Use rather than exiting. Reduces latency and eliminates
difficult split-size tuning. Out of box performance!
Runtime reRuntime query tuning by picking aggregation
configuration of DAG parallelism using online query statistics
Tez In-Memory Cache Hot data kept in RAM for fast access.
Complex DAGs

Benefit

Tez Broadcast Edge and Map-Reduce-Reduce
pattern improve query scale and throughput.

Latency

Throughput

Latency
Throughput
Page 10

Pipelined Splits
Reduce start-up latency by launching tasks as soon as they are ready

hive.orc.compute.splits.num.threads=10

Map task
Tez AM
tez.am.grouping.split-waves=1.4


Pipelined Early Exit (Limit)
hive.orc.compute.splits.num.threads=10
Map task

Done!
tez.am.grouping.split-waves=1.4


HIVE-5775

Statistics and Cost-based optimization
• Statistics:
– Hive has table and column level statistics
– Used to determine parallelism, join selection

• Optiq: Open source, Apache licensed query execution framework in Java
– Used by Apache Drill, Apache Cascade, Lucene DB, …
– Based on Volcano paper
– 20 man years dev, more than 50 optimization rules

• Goals for hive
–
–
–
–

Ease of Use – no manual tuning for queries, make choices automatically based on cost
View Chaining/Ad hoc queries involving multiple views
Help enable BI Tools front-ending Hive
Emphasis on latency reduction

• Cost computation will be used for
 Join ordering
 Join algorithm selection
 Tez vertex boundary selection


Page 13

Broadcast Join
• Similar to map-join w/o the need to build a hash table on the
client
• Will work with any level of sub-query nesting
• Uses stats to determine if applicable
• How it works:
– Broadcast result set is computed in parallel on the cluster
– Join processor are spun up in parallel
– Broadcast set is streamed to join processor
– Join processors build hash table
– Other relation is joined with hashtable

• Tez handles:
– Best parallelism
– Best data transfer of the hashed relation
– Best scheduling to avoid latencies


1-1 Edge
• Typical star schema join involve join between large number of
tables
• Dimension aren’t always tiny (Customer dimension)
• Might not be able to handle all dimensions in single vertex as
broadcast joins
• Tez allows streaming records from one processor to the next via
a 1-1 Edge
– Transfer details (streaming, files, etc) are handled transparently
– Scheduling/cluster capacity is worked out by Tez

• Allows hive to build a pipeline of in memory joins which we can
stream records through


Dynamically partitioned Hash join
• Kicks in when large table is bucketed
– Bucketed table
– Dynamic as part of query processing

• Will use custom edge to match the partitioning on the smaller
table
• Allows hash-join in cases where broadcast would be too large
• Tez gives us the option of building custom edges and vertex
managers
– Fine grained control over how the data is replicated and partitioned
– Scheduling and actual data transfer is handled by Tez


Shuffle join and Grouping w/o sort
• In the MR model joins, grouping and window functions are
typically mapped onto a shuffle phase
• That means sorting will be used to implement the operator
• With Tez it’s easy to switch out the implementation
– Decouples partitioning, algorithm and transfer of the data
– The nitty-gritty details are still abstracted away (data movement,
scheduling, parallelism)

• With the right statistics that let’s hive pick the right
implementation for the query at hand


Union all
• Common operation in decision support queries
• Caused additional no-op stages in MR plans
– Last stage spins up multi-input mapper to write result
– Intermediate unions have to be materialized before additional processing

• Tez has union that handles these cases transparently w/o any
intermediate steps


Multi-insert queries
• Allows the same input to be split and written to different tables
or partitions
– Avoids duplicate scans/processing
– Useful for ETL
– Similar to “Splits” in PIG

• In MR a “split” in the operator pipeline has to be written to HDFS
and processed by multiple additional MR jobs
• Tez allows to send the mulitple outputs directly to downstream
processors


Putting it all together
Tez Session populates
container pool

Dimension table
calculation and HDFS
split generation in
parallel

Dimension tables
broadcasted to Hive
MapJoin tasks

Final Reducer prelaunched and fetches
completed inputs

TPCDS – Query-27 with Hive on Tez
Architecting the Future of Big Data
© Hortonworks Inc. 2013

Page 20

TPC-DS 10 TB Query Times (Tuned Queries)

Data: 10 TB data loaded into ORCFile using defaults.
Hardware: 20 nodes: 24 CPU threads, 64 GB RAM, 5 local SATA drives each.
Queries: TPC-DS queries with partition filters added.

Page 21

TPC-DS 30 TB Query Times (Tuned Queries)

Data: 30 TB data loaded into ORCFile using defaults.
Hardware: 14 nodes: 2x Intel E5-2660 v2, 256GB RAM, 23x 7.2kRPM drives each.
Queries: TPC-DS queries with partition filters added.

Page 22

Tez Concurrency Over 30 TB Dataset
• Test:
– Launch 20 queries concurrently and measure total time to completion.
– Used the 20 tuned TPC-DS queries from previous slide.
– Each query used once.
– Queries hit multiple different fact tables with different access patterns.

• Results:
– The 20 queries finished within 27.5 minutes.
– Average 82.65 seconds per query.

• Data and Hardware details:
– 30 Terabytes of data loaded into ORCFile using all defaults.
– Hardware: 14 nodes: 2x Intel E5-2660 v2, 256GB RAM, 23x 7.2kRPM drives each.


Page 23

February 2014 HUG : Hive On Tez

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to February 2014 HUG : Hive On Tez

Similar to February 2014 HUG : Hive On Tez (20)

More from Yahoo Developer Network

More from Yahoo Developer Network (20)

Recently uploaded

Recently uploaded (20)

February 2014 HUG : Hive On Tez

Editor's Notes