SlideShare a Scribd company logo
1 of 23
Hive on Tez

Gunther Hagleitner
(gunther@apache.org)

© Hortonworks Inc. 2013.
Batch AND Interactive SQL-IN-Hadoop
Stinger Initiative

A broad, community-based effort to
drive the next generation of HIVE

Stinger Project
(announced February 2013)
Hive 0.11, May 2013:
• Base Optimizations
• SQL Analytic Functions
• ORCFile, Modern File Format

Goals:

Speed
Improve Hive query performance by 100X to
allow for interactive query times (seconds)

Scale
The only SQL interface to Hadoop designed
for queries that scale from TB to PB

Hive 0.12, October 2013:
•
•
•
•

VARCHAR, DATE Types
ORCFile predicate pushdown
Advanced Optimizations
Performance Boosts via YARN

Coming Soon:

SQL
Support broadest range of SQL semantics for
analytic applications running against Hadoop

…all IN Hadoop
© Hortonworks Inc. 2013.

•
•
•
•
•

Hive on Apache Tez
Query Service
Buffer Cache
Cost Based Optimizer (Optiq)
Vectorized Processing
SQL: Enhancing SQL Semantics
Hive SQL Datatypes

Hive SQL Semantics

SQL Compliance

INT

SELECT, INSERT

TINYINT/SMALLINT/BIGINT

GROUP BY, ORDER BY, SORT BY

BOOLEAN

JOIN on explicit join key

FLOAT

Inner, outer, cross and semi joins

DOUBLE

Sub-queries in FROM clause

Hive 12 provides a wide
array of SQL data types
and semantics so your
existing tools integrate
more seamlessly with
Hadoop

STRING

ROLLUP and CUBE

TIMESTAMP

UNION

BINARY

Windowing Functions (OVER, RANK, etc)

DECIMAL

Custom Java UDFs

ARRAY, MAP, STRUCT, UNION

Standard Aggregation (SUM, AVG, etc.)

DATE

Advanced UDFs (ngram, Xpath, URL)

VARCHAR

Sub-queries in WHERE, HAVING

CHAR

Expanded JOIN Syntax
SQL Compliant Security (GRANT, etc.)
INSERT/UPDATE/DELETE (ACID)
© Hortonworks Inc. 2013.

Available
Hive 0.12
Roadmap
Stinger: Hive performance

Feature

Description

Benefit

Tez Integration

Tez is significantly better engine than MapReduce

Latency

Vectorized Query

Take advantage of modern hardware by processing
thousand-row blocks rather than row-at-a-time.

Throughput

Query Planner
ORC File

Using extensive statistics now available in Metastore
to better plan and optimize query, including
predicate pushdown during compilation to eliminate
portions of input (beyond partition pruning)

Latency

Columnar, type aware format with indices

Latency

Cost Based Optimizer Join re-ordering and other optimizations based on
(Optiq)
column statistics including histograms etc. (future)

© Hortonworks Inc. 2013.

Latency

Page 4
Hive on Tez – Basics
•
•
•
•
•
•

Hive/Tez integration in Hive 0.13 (Tez 0.2.0)
Hive/Tez 0.3.0 available in branch
Hive on MR works unchanged on hadoop 1 and 2
Hive on Tez is only available on hadoop 2
Turn on: hive.execution.engine=tez
Hive looks and feels the same on both MR and Tez
– CLI/JDBC/UDFs/SQL/Metastore

• Explain/reporting is similar to MR, but reflects differences in
plan/execution
• Deployment is simple (Tez comes as a client-side library)
• Job client can submit MR via Tez
– And emulate NES too (duck hunt anyone?)

© Hortonworks Inc. 2013.

Page 5
Query 88
select *
from
(select count(*) h8_30_to_9
from store_sales
JOIN household_demographics ON store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk
JOIN time_dim ON store_sales.ss_sold_time_sk = time_dim.t_time_sk
JOIN store ON store_sales.ss_store_sk = store.s_store_sk
where
time_dim.t_hour = 8
and time_dim.t_minute >= 30
and ((household_demographics.hd_dep_count = 3 and
household_demographics.hd_vehicle_count<=3+2) or
(household_demographics.hd_dep_count = 0 and
household_demographics.hd_vehicle_count<=0+2) or
(household_demographics.hd_dep_count = 1 and
household_demographics.hd_vehicle_count<=1+2))
and store.s_store_name = 'ese') s1 JOIN
(select count(*) h9_to_9_30
from store_sales
...

• 8 full table scans

© Hortonworks Inc. 2013.

Page 6
Query 88: M/R

Total MapReduce jobs = 29
...
Total MapReduce CPU Time Spent: 0 days 2 hours 52 minutes 39
seconds 380 msec
OK
345617 687625 686131 1032842 1030364 606859 604232 692428
Time taken: 403.28 seconds, Fetched: 1 row(s)

© Hortonworks Inc. 2013.

Page 7
Query 88: Tez
Map 1: 1/1
Map 11: 1/1
Map 12: 1/1
Map 14: 1/1
Map 15: 1/1
Map 16: 241/241
Map 19: 1/1
Map 2: 1/1
Map 20: 1/1
Map 22: 1/1
Map 23: 1/1
Map 24: 241/241
Map 27: 1/1
Map 28: 1/1
Map 29: 1/1
Map 30: 240/240 Map 32: 241/241 Map 34: 1/1
Map 36: 1/1
Map 37: 1/1
Map 38: 1/1
Map 42: 1/1
Map 43: 1/1
Map 44: 240/240
Reducer 10: 1/1 Reducer 17: 1/1 Reducer 25: 1/1
Reducer 33: 1/1 Reducer 4: 1/1 Reducer 40: 1/1
Reducer 45: 1/1 Reducer 47: 1/1 Reducer 5: 1/1
Reducer 7: 1/1 Reducer 8: 1/1 Reducer 9: 1/1
Status: Finished successfully
OK
345617 687625 686131 1032842 1030364 606859
Time taken: 90.233 seconds, Fetched: 1 row(s)

© Hortonworks Inc. 2013.

Map 13: 1/1
Map 18: 1/1
Map 21: 1/1
Map 26: 1/1
Map 3: 241/241
Map 35: 1/1
Map 39: 241/241
Map 46: 241/241
Reducer 31: 1/1
Reducer 41: 1/1
Reducer 6: 1/1

604232

692428

Page 8
HIVE-4660

Hive-on-MR vs. Hive-on-Tez
SELECT g1.x, g1.avg, g2.cnt
FROM (SELECT a.x, AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1
JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2
ON (g1.x = g2.x)
ORDER BY avg;

Hive – MR
M

M

Hive – Tez
GROUP b BY b.x

M

M

GROUP a BY a.x
R

Tez avoids
unnecessary writes
to HDFS

GROUP BY x

M
M

R

M

M

M

R
R

HDFS

JOIN (a,b)

HDFS

M

GROUP BY a.x
JOIN (a,b)
R

M

R

R

ORDER BY
HDFS

M

ORDER BY
R

© Hortonworks Inc. 2013.

R

M
Hive on Tez - Execution
Feature
Tez Session
Tez Container PreLaunch

Description
Overcomes Map-Reduce job-launch latency by prelaunching Tez AppMaster

Latency

Overcomes Map-Reduce latency by pre-launching
hot containers ready to serve queries.

Latency

Finished maps and reduces pick up more work
Tez Container Re-Use rather than exiting. Reduces latency and eliminates
difficult split-size tuning. Out of box performance!
Runtime reRuntime query tuning by picking aggregation
configuration of DAG parallelism using online query statistics
Tez In-Memory Cache Hot data kept in RAM for fast access.
Complex DAGs
© Hortonworks Inc. 2013.

Benefit

Tez Broadcast Edge and Map-Reduce-Reduce
pattern improve query scale and throughput.

Latency

Throughput

Latency
Throughput
Page 10
Pipelined Splits
Reduce start-up latency by launching tasks as soon as they are ready

hive.orc.compute.splits.num.threads=10

Map task
Tez AM
tez.am.grouping.split-waves=1.4

© Hortonworks Inc. 2013.
Pipelined Early Exit (Limit)
hive.orc.compute.splits.num.threads=10
Map task

Done!
tez.am.grouping.split-waves=1.4

© Hortonworks Inc. 2013.
HIVE-5775

Statistics and Cost-based optimization
• Statistics:
– Hive has table and column level statistics
– Used to determine parallelism, join selection

• Optiq: Open source, Apache licensed query execution framework in Java
– Used by Apache Drill, Apache Cascade, Lucene DB, …
– Based on Volcano paper
– 20 man years dev, more than 50 optimization rules

• Goals for hive
–
–
–
–

Ease of Use – no manual tuning for queries, make choices automatically based on cost
View Chaining/Ad hoc queries involving multiple views
Help enable BI Tools front-ending Hive
Emphasis on latency reduction

• Cost computation will be used for
 Join ordering
 Join algorithm selection
 Tez vertex boundary selection

© Hortonworks Inc. 2013.

Page 13
Broadcast Join
• Similar to map-join w/o the need to build a hash table on the
client
• Will work with any level of sub-query nesting
• Uses stats to determine if applicable
• How it works:
– Broadcast result set is computed in parallel on the cluster
– Join processor are spun up in parallel
– Broadcast set is streamed to join processor
– Join processors build hash table
– Other relation is joined with hashtable

• Tez handles:
– Best parallelism
– Best data transfer of the hashed relation
– Best scheduling to avoid latencies

© Hortonworks Inc. 2013.
1-1 Edge
• Typical star schema join involve join between large number of
tables
• Dimension aren’t always tiny (Customer dimension)
• Might not be able to handle all dimensions in single vertex as
broadcast joins
• Tez allows streaming records from one processor to the next via
a 1-1 Edge
– Transfer details (streaming, files, etc) are handled transparently
– Scheduling/cluster capacity is worked out by Tez

• Allows hive to build a pipeline of in memory joins which we can
stream records through

© Hortonworks Inc. 2013.
Dynamically partitioned Hash join
• Kicks in when large table is bucketed
– Bucketed table
– Dynamic as part of query processing

• Will use custom edge to match the partitioning on the smaller
table
• Allows hash-join in cases where broadcast would be too large
• Tez gives us the option of building custom edges and vertex
managers
– Fine grained control over how the data is replicated and partitioned
– Scheduling and actual data transfer is handled by Tez

© Hortonworks Inc. 2013.
Shuffle join and Grouping w/o sort
• In the MR model joins, grouping and window functions are
typically mapped onto a shuffle phase
• That means sorting will be used to implement the operator
• With Tez it’s easy to switch out the implementation
– Decouples partitioning, algorithm and transfer of the data
– The nitty-gritty details are still abstracted away (data movement,
scheduling, parallelism)

• With the right statistics that let’s hive pick the right
implementation for the query at hand

© Hortonworks Inc. 2013.
Union all
• Common operation in decision support queries
• Caused additional no-op stages in MR plans
– Last stage spins up multi-input mapper to write result
– Intermediate unions have to be materialized before additional processing

• Tez has union that handles these cases transparently w/o any
intermediate steps

© Hortonworks Inc. 2013.
Multi-insert queries
• Allows the same input to be split and written to different tables
or partitions
– Avoids duplicate scans/processing
– Useful for ETL
– Similar to “Splits” in PIG

• In MR a “split” in the operator pipeline has to be written to HDFS
and processed by multiple additional MR jobs
• Tez allows to send the mulitple outputs directly to downstream
processors

© Hortonworks Inc. 2013.
Putting it all together
Tez Session populates
container pool

Dimension table
calculation and HDFS
split generation in
parallel

Dimension tables
broadcasted to Hive
MapJoin tasks

Final Reducer prelaunched and fetches
completed inputs

TPCDS – Query-27 with Hive on Tez
Architecting the Future of Big Data
© Hortonworks Inc. 2013

Page 20
TPC-DS 10 TB Query Times (Tuned Queries)

Data: 10 TB data loaded into ORCFile using defaults.
Hardware: 20 nodes: 24 CPU threads, 64 GB RAM, 5 local SATA drives each.
Queries: TPC-DS queries with partition filters added.
© Hortonworks Inc. 2013.

Page 21
TPC-DS 30 TB Query Times (Tuned Queries)

Data: 30 TB data loaded into ORCFile using defaults.
Hardware: 14 nodes: 2x Intel E5-2660 v2, 256GB RAM, 23x 7.2kRPM drives each.
Queries: TPC-DS queries with partition filters added.
© Hortonworks Inc. 2013.

Page 22
Tez Concurrency Over 30 TB Dataset
• Test:
– Launch 20 queries concurrently and measure total time to completion.
– Used the 20 tuned TPC-DS queries from previous slide.
– Each query used once.
– Queries hit multiple different fact tables with different access patterns.

• Results:
– The 20 queries finished within 27.5 minutes.
– Average 82.65 seconds per query.

• Data and Hardware details:
– 30 Terabytes of data loaded into ORCFile using all defaults.
– Hardware: 14 nodes: 2x Intel E5-2660 v2, 256GB RAM, 23x 7.2kRPM drives each.

© Hortonworks Inc. 2013.

Page 23

More Related Content

What's hot

Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
DataWorks Summit
 
Performance Hive+Tez 2
Performance Hive+Tez 2Performance Hive+Tez 2
Performance Hive+Tez 2
t3rmin4t0r
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
Modern Data Stack France
 

What's hot (20)

Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Tune up Yarn and Hive
Tune up Yarn and HiveTune up Yarn and Hive
Tune up Yarn and Hive
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Performance Hive+Tez 2
Performance Hive+Tez 2Performance Hive+Tez 2
Performance Hive+Tez 2
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processingHave your Cake and Eat it Too - Architecture for Batch and Real-time processing
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 

Viewers also liked

apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
Thejas Nair
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Yahoo Developer Network
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
DataWorks Summit
 

Viewers also liked (6)

Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 

Similar to February 2014 HUG : Hive On Tez

Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
alanfgates
 
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterApril 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
Yahoo Developer Network
 

Similar to February 2014 HUG : Hive On Tez (20)

La big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixitLa big datacamp2014_vikram_dixit
La big datacamp2014_vikram_dixit
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in Hive
 
Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times FasterApril 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
April 2013 HUG: The Stinger Initiative - Making Apache Hive 100 Times Faster
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
Stinger Initiative: Leveraging Hive & Yarn for High-Performance/Interactive Q...
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014Big SQL 3.0 - Toronto Meetup -- May 2014
Big SQL 3.0 - Toronto Meetup -- May 2014
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 
Apache Tez – Present and Future
Apache Tez – Present and FutureApache Tez – Present and Future
Apache Tez – Present and Future
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 

More from Yahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
Yahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
Yahoo Developer Network
 

More from Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Recently uploaded

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
MateoGardella
 

Recently uploaded (20)

Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 

February 2014 HUG : Hive On Tez

  • 1. Hive on Tez Gunther Hagleitner (gunther@apache.org) © Hortonworks Inc. 2013.
  • 2. Batch AND Interactive SQL-IN-Hadoop Stinger Initiative A broad, community-based effort to drive the next generation of HIVE Stinger Project (announced February 2013) Hive 0.11, May 2013: • Base Optimizations • SQL Analytic Functions • ORCFile, Modern File Format Goals: Speed Improve Hive query performance by 100X to allow for interactive query times (seconds) Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB Hive 0.12, October 2013: • • • • VARCHAR, DATE Types ORCFile predicate pushdown Advanced Optimizations Performance Boosts via YARN Coming Soon: SQL Support broadest range of SQL semantics for analytic applications running against Hadoop …all IN Hadoop © Hortonworks Inc. 2013. • • • • • Hive on Apache Tez Query Service Buffer Cache Cost Based Optimizer (Optiq) Vectorized Processing
  • 3. SQL: Enhancing SQL Semantics Hive SQL Datatypes Hive SQL Semantics SQL Compliance INT SELECT, INSERT TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY BOOLEAN JOIN on explicit join key FLOAT Inner, outer, cross and semi joins DOUBLE Sub-queries in FROM clause Hive 12 provides a wide array of SQL data types and semantics so your existing tools integrate more seamlessly with Hadoop STRING ROLLUP and CUBE TIMESTAMP UNION BINARY Windowing Functions (OVER, RANK, etc) DECIMAL Custom Java UDFs ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.) DATE Advanced UDFs (ngram, Xpath, URL) VARCHAR Sub-queries in WHERE, HAVING CHAR Expanded JOIN Syntax SQL Compliant Security (GRANT, etc.) INSERT/UPDATE/DELETE (ACID) © Hortonworks Inc. 2013. Available Hive 0.12 Roadmap
  • 4. Stinger: Hive performance Feature Description Benefit Tez Integration Tez is significantly better engine than MapReduce Latency Vectorized Query Take advantage of modern hardware by processing thousand-row blocks rather than row-at-a-time. Throughput Query Planner ORC File Using extensive statistics now available in Metastore to better plan and optimize query, including predicate pushdown during compilation to eliminate portions of input (beyond partition pruning) Latency Columnar, type aware format with indices Latency Cost Based Optimizer Join re-ordering and other optimizations based on (Optiq) column statistics including histograms etc. (future) © Hortonworks Inc. 2013. Latency Page 4
  • 5. Hive on Tez – Basics • • • • • • Hive/Tez integration in Hive 0.13 (Tez 0.2.0) Hive/Tez 0.3.0 available in branch Hive on MR works unchanged on hadoop 1 and 2 Hive on Tez is only available on hadoop 2 Turn on: hive.execution.engine=tez Hive looks and feels the same on both MR and Tez – CLI/JDBC/UDFs/SQL/Metastore • Explain/reporting is similar to MR, but reflects differences in plan/execution • Deployment is simple (Tez comes as a client-side library) • Job client can submit MR via Tez – And emulate NES too (duck hunt anyone?) © Hortonworks Inc. 2013. Page 5
  • 6. Query 88 select * from (select count(*) h8_30_to_9 from store_sales JOIN household_demographics ON store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk JOIN time_dim ON store_sales.ss_sold_time_sk = time_dim.t_time_sk JOIN store ON store_sales.ss_store_sk = store.s_store_sk where time_dim.t_hour = 8 and time_dim.t_minute >= 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s1 JOIN (select count(*) h9_to_9_30 from store_sales ... • 8 full table scans © Hortonworks Inc. 2013. Page 6
  • 7. Query 88: M/R Total MapReduce jobs = 29 ... Total MapReduce CPU Time Spent: 0 days 2 hours 52 minutes 39 seconds 380 msec OK 345617 687625 686131 1032842 1030364 606859 604232 692428 Time taken: 403.28 seconds, Fetched: 1 row(s) © Hortonworks Inc. 2013. Page 7
  • 8. Query 88: Tez Map 1: 1/1 Map 11: 1/1 Map 12: 1/1 Map 14: 1/1 Map 15: 1/1 Map 16: 241/241 Map 19: 1/1 Map 2: 1/1 Map 20: 1/1 Map 22: 1/1 Map 23: 1/1 Map 24: 241/241 Map 27: 1/1 Map 28: 1/1 Map 29: 1/1 Map 30: 240/240 Map 32: 241/241 Map 34: 1/1 Map 36: 1/1 Map 37: 1/1 Map 38: 1/1 Map 42: 1/1 Map 43: 1/1 Map 44: 240/240 Reducer 10: 1/1 Reducer 17: 1/1 Reducer 25: 1/1 Reducer 33: 1/1 Reducer 4: 1/1 Reducer 40: 1/1 Reducer 45: 1/1 Reducer 47: 1/1 Reducer 5: 1/1 Reducer 7: 1/1 Reducer 8: 1/1 Reducer 9: 1/1 Status: Finished successfully OK 345617 687625 686131 1032842 1030364 606859 Time taken: 90.233 seconds, Fetched: 1 row(s) © Hortonworks Inc. 2013. Map 13: 1/1 Map 18: 1/1 Map 21: 1/1 Map 26: 1/1 Map 3: 241/241 Map 35: 1/1 Map 39: 241/241 Map 46: 241/241 Reducer 31: 1/1 Reducer 41: 1/1 Reducer 6: 1/1 604232 692428 Page 8
  • 9. HIVE-4660 Hive-on-MR vs. Hive-on-Tez SELECT g1.x, g1.avg, g2.cnt FROM (SELECT a.x, AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1 JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2 ON (g1.x = g2.x) ORDER BY avg; Hive – MR M M Hive – Tez GROUP b BY b.x M M GROUP a BY a.x R Tez avoids unnecessary writes to HDFS GROUP BY x M M R M M M R R HDFS JOIN (a,b) HDFS M GROUP BY a.x JOIN (a,b) R M R R ORDER BY HDFS M ORDER BY R © Hortonworks Inc. 2013. R M
  • 10. Hive on Tez - Execution Feature Tez Session Tez Container PreLaunch Description Overcomes Map-Reduce job-launch latency by prelaunching Tez AppMaster Latency Overcomes Map-Reduce latency by pre-launching hot containers ready to serve queries. Latency Finished maps and reduces pick up more work Tez Container Re-Use rather than exiting. Reduces latency and eliminates difficult split-size tuning. Out of box performance! Runtime reRuntime query tuning by picking aggregation configuration of DAG parallelism using online query statistics Tez In-Memory Cache Hot data kept in RAM for fast access. Complex DAGs © Hortonworks Inc. 2013. Benefit Tez Broadcast Edge and Map-Reduce-Reduce pattern improve query scale and throughput. Latency Throughput Latency Throughput Page 10
  • 11. Pipelined Splits Reduce start-up latency by launching tasks as soon as they are ready hive.orc.compute.splits.num.threads=10 Map task Tez AM tez.am.grouping.split-waves=1.4 © Hortonworks Inc. 2013.
  • 12. Pipelined Early Exit (Limit) hive.orc.compute.splits.num.threads=10 Map task Done! tez.am.grouping.split-waves=1.4 © Hortonworks Inc. 2013.
  • 13. HIVE-5775 Statistics and Cost-based optimization • Statistics: – Hive has table and column level statistics – Used to determine parallelism, join selection • Optiq: Open source, Apache licensed query execution framework in Java – Used by Apache Drill, Apache Cascade, Lucene DB, … – Based on Volcano paper – 20 man years dev, more than 50 optimization rules • Goals for hive – – – – Ease of Use – no manual tuning for queries, make choices automatically based on cost View Chaining/Ad hoc queries involving multiple views Help enable BI Tools front-ending Hive Emphasis on latency reduction • Cost computation will be used for  Join ordering  Join algorithm selection  Tez vertex boundary selection © Hortonworks Inc. 2013. Page 13
  • 14. Broadcast Join • Similar to map-join w/o the need to build a hash table on the client • Will work with any level of sub-query nesting • Uses stats to determine if applicable • How it works: – Broadcast result set is computed in parallel on the cluster – Join processor are spun up in parallel – Broadcast set is streamed to join processor – Join processors build hash table – Other relation is joined with hashtable • Tez handles: – Best parallelism – Best data transfer of the hashed relation – Best scheduling to avoid latencies © Hortonworks Inc. 2013.
  • 15. 1-1 Edge • Typical star schema join involve join between large number of tables • Dimension aren’t always tiny (Customer dimension) • Might not be able to handle all dimensions in single vertex as broadcast joins • Tez allows streaming records from one processor to the next via a 1-1 Edge – Transfer details (streaming, files, etc) are handled transparently – Scheduling/cluster capacity is worked out by Tez • Allows hive to build a pipeline of in memory joins which we can stream records through © Hortonworks Inc. 2013.
  • 16. Dynamically partitioned Hash join • Kicks in when large table is bucketed – Bucketed table – Dynamic as part of query processing • Will use custom edge to match the partitioning on the smaller table • Allows hash-join in cases where broadcast would be too large • Tez gives us the option of building custom edges and vertex managers – Fine grained control over how the data is replicated and partitioned – Scheduling and actual data transfer is handled by Tez © Hortonworks Inc. 2013.
  • 17. Shuffle join and Grouping w/o sort • In the MR model joins, grouping and window functions are typically mapped onto a shuffle phase • That means sorting will be used to implement the operator • With Tez it’s easy to switch out the implementation – Decouples partitioning, algorithm and transfer of the data – The nitty-gritty details are still abstracted away (data movement, scheduling, parallelism) • With the right statistics that let’s hive pick the right implementation for the query at hand © Hortonworks Inc. 2013.
  • 18. Union all • Common operation in decision support queries • Caused additional no-op stages in MR plans – Last stage spins up multi-input mapper to write result – Intermediate unions have to be materialized before additional processing • Tez has union that handles these cases transparently w/o any intermediate steps © Hortonworks Inc. 2013.
  • 19. Multi-insert queries • Allows the same input to be split and written to different tables or partitions – Avoids duplicate scans/processing – Useful for ETL – Similar to “Splits” in PIG • In MR a “split” in the operator pipeline has to be written to HDFS and processed by multiple additional MR jobs • Tez allows to send the mulitple outputs directly to downstream processors © Hortonworks Inc. 2013.
  • 20. Putting it all together Tez Session populates container pool Dimension table calculation and HDFS split generation in parallel Dimension tables broadcasted to Hive MapJoin tasks Final Reducer prelaunched and fetches completed inputs TPCDS – Query-27 with Hive on Tez Architecting the Future of Big Data © Hortonworks Inc. 2013 Page 20
  • 21. TPC-DS 10 TB Query Times (Tuned Queries) Data: 10 TB data loaded into ORCFile using defaults. Hardware: 20 nodes: 24 CPU threads, 64 GB RAM, 5 local SATA drives each. Queries: TPC-DS queries with partition filters added. © Hortonworks Inc. 2013. Page 21
  • 22. TPC-DS 30 TB Query Times (Tuned Queries) Data: 30 TB data loaded into ORCFile using defaults. Hardware: 14 nodes: 2x Intel E5-2660 v2, 256GB RAM, 23x 7.2kRPM drives each. Queries: TPC-DS queries with partition filters added. © Hortonworks Inc. 2013. Page 22
  • 23. Tez Concurrency Over 30 TB Dataset • Test: – Launch 20 queries concurrently and measure total time to completion. – Used the 20 tuned TPC-DS queries from previous slide. – Each query used once. – Queries hit multiple different fact tables with different access patterns. • Results: – The 20 queries finished within 27.5 minutes. – Average 82.65 seconds per query. • Data and Hardware details: – 30 Terabytes of data loaded into ORCFile using all defaults. – Hardware: 14 nodes: 2x Intel E5-2660 v2, 256GB RAM, 23x 7.2kRPM drives each. © Hortonworks Inc. 2013. Page 23

Editor's Notes

  1. With Hive and Stinger we are focused on enabling the SQL ecosystem and to do that we’ve put Hive on a clear roadmap to SQL compliance.That includes adding critical datatypes like character and date types as well as implementing common SQL semantics seen in most databases.