SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Apache Tez: Accelerating Hadoop Query 
Processing 
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Hortonworks. We do Hadoop.
Who am I ? 
Olivier Renault – orenault@hortonworks.com 
Solution engineer – Hortonworks EMEA 
Hadoop specialist: 
- platform 
- security 
- tuning 
Trying to tame the elephant ! 
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Introduction 
Distributed execution framework 
targeted towards data-processing 
applications. 
Based on expressing a 
computation as a dataflow graph. 
Highly customizable to meet a 
broad spectrum of use cases. 
Built on top of YARN – the 
resource management framework 
for Hadoop. 
Open source Apache project and 
Apache licensed. 
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop 1 -> Hadoop 2 
HADOOP 1.0 
Hive 
(sql) 
MapReduce 
Pig 
(data flow) 
Others 
(cascading) 
(cluster resource management 
& data processing) 
HDFS 
(redundant, reliable storage) 
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
HADOOP 2.0 
YARN 
Tez 
(execution engine) 
(cluster resource management) 
HDFS2 
(redundant, reliable storage) 
Data Flow 
Pig 
SQL 
Hive 
Others 
(cascading) 
Batch 
MapReduce Real Time 
Stream 
Processing 
Storm 
Online 
Data 
Processing 
HBase, 
Accumulo 
Monolithic 
- Resource management 
- Execution Engine 
- User API 
Layered 
- Resource Management – YARN 
- Execution Engine – Tez 
- User API – Hive, Pig, Cascading, …
Tez – Empowering Applications 
Tez solves hard problem of running on a distributed Hadoop environment 
Apps can focus on solving their domain specific problems 
This design is important to be a platform for a variety of applications 
App - Custom application logic 
- Custom data format 
- Custom data transfer technology 
Tez - Distributed parallel execution 
- Negotiating resources from the hadoop framework 
- Fault tolerance and recovery 
- Horizontal scalability 
- Resource elasticity 
- Shared library of ready-to-use components 
- Built-in performance optimizations 
- Security 
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – End User Benefits 
Better performance of application 
- Built-in performance + Application define optimizations 
Better predictability of results 
- Minimization of overheads and queuing delays 
Better utilization of compute capacity 
- Efficient use of allocated resources 
Reduced load on distributed filesystem (HDFS) 
- Reduce unnecessary replicated writes 
Reduced network usage 
- Better locality and data transfer using new data patterns 
Higher application developer productivity 
- Focus on application business logic rather than Hadoop internals 
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Design considerations 
Leverage discrete task based compute model for elasticity, scalability and 
fault tolerance 
Leverage several man years of work in Hadoop Map Reduce data shuffle 
operations 
Leverage proven resource sharing and multi-tenancy model for Hadoop 
and YARN 
Leverage built-in security mechanism in Hadoop for privacy and isolation 
Look to the Future with an eye on the Past 
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Problems that it addresses 
Expressing the computation 
- Direct and elegant representation of the data processing flow 
- Interfacing with application code and new technologies 
Performance 
- Late binding: Make decisions as late as possible 
- Leverage the resources of the cluster efficiently 
- Just work out of the box 
- Customizable engine to let applications tailor the job to meet their specific requirements 
Operation simplicity 
- Painless to operate, experiment and upgrade 
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Simplifying Operations 
No deployments to do. No side effects. Easy and safe to try it out! 
- Tez is a completely client side application. 
- Simply upload to any accessible FileSystem and change local Tez configuration to point to 
that. 
- Enables running different versions concurrently. Easy to test new functionality while keeping 
stable versions for production. 
- Leverages YARN local resources. 
TezClient TezTask 
Client 
Machine 
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
TezTask 
Node 
Manager 
Node 
Manager 
HDFS 
Tez Lib 1 Tez Lib 2 
TezClient 
Client 
Machine
Tez – Expressing the computation 
Distributed data processing job typically look like DAGs ( Direct Acyclic 
Graph) 
- Vertices in the graph represent data transformation 
- Edges represent data movement from producers to consumers 
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Preprocessor Stage 
Partition Stage 
Aggregate Stage 
Sampler 
Task-1 Task-2 
Task-1 Task-2 
Task-1 Task-2 
Samples 
Ranges 
Distributed sort
Tez – Expressing the computation 
Tez provides the following APIs to define the processing 
DAG API 
- Defines the structure of the data processing and the relationship between producers and 
consumers 
- Enable definition of complex data flow pipelines using simple graph connection API’s. Tez 
expands the logical DAG at runtime 
- Specify all the tasks in the job 
Runtime API 
- Defines how the framework and app code interact with each other 
- App code transforms data and moves it between tasks 
- Specify what actually executes in each task on the cluster nodes 
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Deep Dive – API 
// Define DAG 
DAG dag = new DAG(); 
// Define Vertex 
Vertex map1 = new Vertex(MapProcessor.class); 
Vertex reduce1 = new Vertex(ReduceProcessor.class); 
// Define Edge 
Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER, 
PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); 
// Connect them 
dag.addVertex(map1).addVertex(map2).addEdge(edge1)… 
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
reduce1 
map2 
reduce2 
join1 
map1 
Scatter_Gather 
Bipartite Sequential 
Scatter_Gather 
Bipartite Sequential 
Simple DAG definition API
Tez – Deep Dive – API 
Edge properties define the connection between producer and consumer 
vertices in the DAG 
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
© Hortonworks Inc. 2013 
Page 14 
• Data movement – Defines routing of data between tasks 
– One-To-One: Data from the ith producer task routes to the ith consumer task. 
– Broadcast: Data from a producer task routes to all consumer tasks. 
– Scatter-Gather: Producer tasks scatter data into shards and consumer tasks gather the data. The ith 
shard from all producer tasks routes to the ith consumer task. 
– Custom: Define your own 
• Scheduling – Defines when a consumer task is scheduled 
– Sequential: Consumer task may be scheduled after a producer task completes. 
– Concurrent: Consumer task must be co-scheduled with a producer task. 
• Data source – Defines the lifetime/reliability of a task output 
– Persisted: Output will be available after the task exits. Output may be lost later on. 
– Persisted-Reliable: Output is reliably stored and will always be available 
– Ephemeral: Output is available only while the producer task is running
Tez – Logical DAG expansion at Runtime 
map1 map1 map2 
reduce1 
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
map2 
reduce2 
join1 
Red1 Red2 
Join1
Tez – Runtime API 
Flexible Inputs-Processors-Outputs Model 
- Thin API to wrap around arbitrary application code 
- Compose inputs, processor and outputs to execute arbitrary procesing 
- Event routing based control plane architecture 
- Applications decide logical data format and data transfer technology 
- Customize for performance 
- Built-in implementation for Hadoop 2.0 data services – HDFS and YARN ShuffleService 
Input Processor Output 
initialize(tezInputContext ctxt) initialize(tezProcessorContext ctxt) initialize(tezOutputContext ctxt) 
reader getReader() num(List<input> inputs, 
List<output> outputs) 
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
writer getWriter() 
handleEvents(list <event> evts) handleEvents(list <event> evts) handleEvents(list <event> evts) 
close() close() close()
Tez: Library of Inputs and Outputs 
Classical ‘Map’ Classical ‘Reduce’ 
HDFS Input Map Processor 
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
© Hortonworks Inc. 2013 
Intermediate ‘Reduce’ for 
Map-Reduce-Reduce 
Sorted 
Output 
Reduce 
Processor 
Shuffle 
Input 
HDFS 
Output 
Reduce 
Processor 
Shuffle 
Input 
Sorted 
Output 
What is build in ? 
–Hadoop InputFormat / OutputFormat 
–SortedGroupPartitioned Key-Value Input / 
Output 
–UnsortedGroupedPartitioned Key-Value 
Input / Output 
–Key-Value Input / Output
Tez - Performance 
Benefits of expressing the data processing as a DAG 
- Reducing overheads and queuing effects 
- Gives system the global pictures for better planning 
Efficient use of resources 
- Re-use resources to maximize utilisation 
- Pre-Launch, pre-warm and cache 
- Locality & resource aware scheduling 
Support for application defined DAG modification at runtime 
- Change task concurrency 
- Change task scheduling 
- Change DAG edges 
- Change DAG Vertices 
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Benefits of DAG execution 
Faster Execution and Higher predicabliity 
– Eliminate replicated write barrier between successive computations. 
–Eliminate job launch overhead of workflow jobs. 
– Eliminate extra stage of map reads in every workflow job. 
– Eliminate queue and resource contention suffered by workflow jobs that are started after a 
predecessor job completes. 
–Better locality because the engine has got the overall picture 
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
© Hortonworks Inc. 2013 
Page 19 
Pig/Hive - MR 
Pig/Hive - Tez
Hive-on-MR vs. Hive-on-Tez 
SELECT a.x, AVERAGE(b.y) AS avg 
FROM a JOIN b ON (a.id = b.id) GROUP BY a 
UNION SELECT x, AVERAGE(y) AS AVG 
FROM c GROUP BY x 
ORDER BY AVG; 
Hive – MR Hive – Tez 
SELECT a.state 
JOIN (a, c) 
SELECT c.price 
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
SELECT b.id 
JOIN(a, b) 
GROUP BY a.state 
COUNT(*) 
AVERAGE(c.price) 
M M M 
R R 
M M 
R 
M M 
R 
M M 
R 
HDFS 
HDFS 
HDFS 
M M M 
R R 
R 
Tez avoids unneeded 
writes to HDFS 
M M 
R 
R 
SELECT a.state, 
c.itemId 
JOIN (a, c) 
JOIN(a, b) 
GROUP BY a.state 
COUNT(*) 
AVERAGE(c.price) 
SELECT b.id
Tez – Container Re-Use 
- Reuse YARN containers/JVMs to launch new tasks 
- Reduce scheduling and launching delays 
- Shared JVM objects across tasks. 
- JVM JIT Friendly execution 
Tez 
Application Master 
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
© Hortonworks Inc. 2013 
Page 21 
TezTask Host 
TezTask1 
TezTask2 
YARN Container 
Shared Objects 
YARN Container 
Start Task 
Task Done 
Start Task
Tez - Sessions 
Sessions 
- Standard concepts of pre-launch and pre-warm 
applied 
- Key for interactive queries 
- Represents a connection between the 
user and the cluster 
- Multiple DAGs executed in the same 
session 
- Container re-used across queries 
- Takes care of data locality and releasing 
resources when idle 
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Client 
Application Master 
Task Scheduler 
Start 
Session 
Submit 
DAG 
Pre 
Warmed 
JVM 
Shared 
Object 
Registry 
Container Pool
Tez – Deep Dive – Scheduling 
Vertex-1 
Vertex-2 
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Start 
vertex 
Vertex Manager 
Start 
tasks 
DAG 
Scheduler 
Get Priority 
Get Priority 
Start 
vertex 
Task 
Scheduler 
Get container 
Get container 
Vertex Manager 
• Determines task 
parallelism 
• Determines when tasks 
in a vertex can start 
DAG Scheduler 
• Determines priority of 
task 
Task Scheduler 
• Allocates containers from 
YARN and assigns them 
to tasks
Tez – Event Based Control Plane 
Events used to communicate between the tasks and between task and 
framework 
Data Movement Event used by producer task to inform the consumer 
tasks about data location, size, etc.. 
Input Error event sent by task to the engine to inform about errors in 
reading input. The engine then takes action by re-generating the input 
Other events to send task completion notification, data statistic and other 
control plane information 
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Automatic Reduce Parallelism 
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved 
Map Vertex 
Reduce Vertex 
Vertex Manager 
Set Parallelism 
App Master 
Data Size Statistics 
Vertex State 
Machine 
Cancel Task 
Re-Route 
Event Model 
Map tasks send data 
statistics events to the 
Reduce Vertex Manager. 
Vertex Manager 
Pluggable user logic that 
understands the data 
statistics and can formulate 
the correct parallelism. 
Advises vertex controller on 
parallelism
Theory to practice 
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Performance 
30TB Scale factor – Hive 10 RC File, Hive 13 ORC 
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Observations on Performance 
Number of stage in the DAG 
- High number of stages in the DAG 
Cluster / Queue capacity 
- Congested queue - container re-use 
Size of intermediate output 
- Large size of intermediate output – less HDFS usage 
Size of data in the job 
- Small data and lot of stages – Less overhead than MR 
Offload work to the cluster 
- Use DAG – utilize parallelism and resources of the cluster 
Vertex caching 
- Reduce re-computation 
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Adoption Path 
Pre-requisite: Hadoop 2 with YARN 
Simple client-side install ( no admin support needed ) 
- Need a folder with write permission HDFS 
- No side effects or traces left behind on your cluster 
Apache Hive – Available in 0.13 
- Set “hive.execution.engine” to ”tez” 
Apache Pig – Available in 0.14 
Cascading – Version 3.0 
Run your MapReduce jobs using Tez runtime 
- Set “mapreduce.framework.name” to “yarn-tez” 
Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez - Roadmap 
Richer DAG support 
- Addition of vertices at runtime 
- Shared edges for shared outputs 
- Enhance Input / Output library 
Performance optimizations 
- Improve support for high concurrency 
- Improve locality aware scheduling 
- Add framework level data statistics 
- HDFS memory storage integration 
Usability 
- Tez UI 
- API ease of use 
Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Weitere ähnliche Inhalte

Was ist angesagt?

Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
DataWorks Summit
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
DataWorks Summit
 

Was ist angesagt? (20)

Stinger Initiative - Deep Dive
Stinger Initiative - Deep DiveStinger Initiative - Deep Dive
Stinger Initiative - Deep Dive
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN Clusters
 
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
PayPal merchant ecosystem using Apache Spark, Hive, Druid, and HBase
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012Writing Yarn Applications Hadoop Summit 2012
Writing Yarn Applications Hadoop Summit 2012
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Node labels in YARN
Node labels in YARNNode labels in YARN
Node labels in YARN
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - Stamped...
 
Greenplum Database Open Source December 2015
Greenplum Database Open Source December 2015Greenplum Database Open Source December 2015
Greenplum Database Open Source December 2015
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Next Generation Execution Engine for Apache Storm
Next Generation Execution Engine for Apache StormNext Generation Execution Engine for Apache Storm
Next Generation Execution Engine for Apache Storm
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupYARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User Group
 

Ähnlich wie Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 

Ähnlich wie Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014 (20)

Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.02013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
2013 Nov 20 Toronto Hadoop User Group (THUG) - Hadoop 2.2.0
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
How YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in HadoopHow YARN Enables Multiple Data Processing Engines in Hadoop
How YARN Enables Multiple Data Processing Engines in Hadoop
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Prashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEWPrashanth Kumar_Hadoop_NEW
Prashanth Kumar_Hadoop_NEW
 

Mehr von Modern Data Stack France

Mehr von Modern Data Stack France (20)

Stash - Data FinOPS
Stash - Data FinOPSStash - Data FinOPS
Stash - Data FinOPS
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Talend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark MeetupTalend spark meetup 03042017 - Paris Spark Meetup
Talend spark meetup 03042017 - Paris Spark Meetup
 
Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017Paris Spark Meetup - Trifacta - 03_04_2017
Paris Spark Meetup - Trifacta - 03_04_2017
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Hug janvier 2016 -EDF
Hug   janvier 2016 -EDFHug   janvier 2016 -EDF
Hug janvier 2016 -EDF
 
HUG France - 20160114 industrialisation_process_big_data CanalPlus
HUG France -  20160114 industrialisation_process_big_data CanalPlusHUG France -  20160114 industrialisation_process_big_data CanalPlus
HUG France - 20160114 industrialisation_process_big_data CanalPlus
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015Record linkage, a real use case with spark ml  - Paris Spark meetup Dec 2015
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
 
Spark dataframe
Spark dataframeSpark dataframe
Spark dataframe
 
June Spark meetup : search as recommandation
June Spark meetup : search as recommandationJune Spark meetup : search as recommandation
June Spark meetup : search as recommandation
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)Spark ML par Xebia (Spark Meetup du 11/06/2015)
Spark ML par Xebia (Spark Meetup du 11/06/2015)
 
Spark meetup at viadeo
Spark meetup at viadeoSpark meetup at viadeo
Spark meetup at viadeo
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielParis Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
 

Kürzlich hochgeladen

6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
@Chandigarh #call #Girls 9053900678 @Call #Girls in @Punjab 9053900678
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
imonikaupta
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Chandigarh Call girls 9053900678 Call girls in Chandigarh
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Kürzlich hochgeladen (20)

Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...Russian Call Girls Pune  (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
Russian Call Girls Pune (Adult Only) 8005736733 Escort Service 24x7 Cash Pay...
 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
 
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
6.High Profile Call Girls In Punjab +919053900678 Punjab Call GirlHigh Profil...
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
 
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Model Towh Delhi 💯Call Us 🔝8264348440🔝
 
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
 
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
 
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
VVIP Pune Call Girls Sinhagad WhatSapp Number 8005736733 With Elite Staff And...
 

Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014

  • 1. Apache Tez: Accelerating Hadoop Query Processing Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Hortonworks. We do Hadoop.
  • 2. Who am I ? Olivier Renault – orenault@hortonworks.com Solution engineer – Hortonworks EMEA Hadoop specialist: - platform - security - tuning Trying to tame the elephant ! Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 3. Tez – Introduction Distributed execution framework targeted towards data-processing applications. Based on expressing a computation as a dataflow graph. Highly customizable to meet a broad spectrum of use cases. Built on top of YARN – the resource management framework for Hadoop. Open source Apache project and Apache licensed. Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 4. Hadoop 1 -> Hadoop 2 HADOOP 1.0 Hive (sql) MapReduce Pig (data flow) Others (cascading) (cluster resource management & data processing) HDFS (redundant, reliable storage) Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved HADOOP 2.0 YARN Tez (execution engine) (cluster resource management) HDFS2 (redundant, reliable storage) Data Flow Pig SQL Hive Others (cascading) Batch MapReduce Real Time Stream Processing Storm Online Data Processing HBase, Accumulo Monolithic - Resource management - Execution Engine - User API Layered - Resource Management – YARN - Execution Engine – Tez - User API – Hive, Pig, Cascading, …
  • 5. Tez – Empowering Applications Tez solves hard problem of running on a distributed Hadoop environment Apps can focus on solving their domain specific problems This design is important to be a platform for a variety of applications App - Custom application logic - Custom data format - Custom data transfer technology Tez - Distributed parallel execution - Negotiating resources from the hadoop framework - Fault tolerance and recovery - Horizontal scalability - Resource elasticity - Shared library of ready-to-use components - Built-in performance optimizations - Security Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 6. Tez – End User Benefits Better performance of application - Built-in performance + Application define optimizations Better predictability of results - Minimization of overheads and queuing delays Better utilization of compute capacity - Efficient use of allocated resources Reduced load on distributed filesystem (HDFS) - Reduce unnecessary replicated writes Reduced network usage - Better locality and data transfer using new data patterns Higher application developer productivity - Focus on application business logic rather than Hadoop internals Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 7. Tez – Design considerations Leverage discrete task based compute model for elasticity, scalability and fault tolerance Leverage several man years of work in Hadoop Map Reduce data shuffle operations Leverage proven resource sharing and multi-tenancy model for Hadoop and YARN Leverage built-in security mechanism in Hadoop for privacy and isolation Look to the Future with an eye on the Past Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 8. Tez – Problems that it addresses Expressing the computation - Direct and elegant representation of the data processing flow - Interfacing with application code and new technologies Performance - Late binding: Make decisions as late as possible - Leverage the resources of the cluster efficiently - Just work out of the box - Customizable engine to let applications tailor the job to meet their specific requirements Operation simplicity - Painless to operate, experiment and upgrade Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 9. Tez – Simplifying Operations No deployments to do. No side effects. Easy and safe to try it out! - Tez is a completely client side application. - Simply upload to any accessible FileSystem and change local Tez configuration to point to that. - Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. - Leverages YARN local resources. TezClient TezTask Client Machine Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved TezTask Node Manager Node Manager HDFS Tez Lib 1 Tez Lib 2 TezClient Client Machine
  • 10. Tez – Expressing the computation Distributed data processing job typically look like DAGs ( Direct Acyclic Graph) - Vertices in the graph represent data transformation - Edges represent data movement from producers to consumers Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Preprocessor Stage Partition Stage Aggregate Stage Sampler Task-1 Task-2 Task-1 Task-2 Task-1 Task-2 Samples Ranges Distributed sort
  • 11. Tez – Expressing the computation Tez provides the following APIs to define the processing DAG API - Defines the structure of the data processing and the relationship between producers and consumers - Enable definition of complex data flow pipelines using simple graph connection API’s. Tez expands the logical DAG at runtime - Specify all the tasks in the job Runtime API - Defines how the framework and app code interact with each other - App code transforms data and moves it between tasks - Specify what actually executes in each task on the cluster nodes Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 12. Tez – Deep Dive – API // Define DAG DAG dag = new DAG(); // Define Vertex Vertex map1 = new Vertex(MapProcessor.class); Vertex reduce1 = new Vertex(ReduceProcessor.class); // Define Edge Edge edge1 = Edge(map1, reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, MOutput.class, RInput.class); // Connect them dag.addVertex(map1).addVertex(map2).addEdge(edge1)… Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved reduce1 map2 reduce2 join1 map1 Scatter_Gather Bipartite Sequential Scatter_Gather Bipartite Sequential Simple DAG definition API
  • 13. Tez – Deep Dive – API Edge properties define the connection between producer and consumer vertices in the DAG Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013 Page 14 • Data movement – Defines routing of data between tasks – One-To-One: Data from the ith producer task routes to the ith consumer task. – Broadcast: Data from a producer task routes to all consumer tasks. – Scatter-Gather: Producer tasks scatter data into shards and consumer tasks gather the data. The ith shard from all producer tasks routes to the ith consumer task. – Custom: Define your own • Scheduling – Defines when a consumer task is scheduled – Sequential: Consumer task may be scheduled after a producer task completes. – Concurrent: Consumer task must be co-scheduled with a producer task. • Data source – Defines the lifetime/reliability of a task output – Persisted: Output will be available after the task exits. Output may be lost later on. – Persisted-Reliable: Output is reliably stored and will always be available – Ephemeral: Output is available only while the producer task is running
  • 14. Tez – Logical DAG expansion at Runtime map1 map1 map2 reduce1 Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved map2 reduce2 join1 Red1 Red2 Join1
  • 15. Tez – Runtime API Flexible Inputs-Processors-Outputs Model - Thin API to wrap around arbitrary application code - Compose inputs, processor and outputs to execute arbitrary procesing - Event routing based control plane architecture - Applications decide logical data format and data transfer technology - Customize for performance - Built-in implementation for Hadoop 2.0 data services – HDFS and YARN ShuffleService Input Processor Output initialize(tezInputContext ctxt) initialize(tezProcessorContext ctxt) initialize(tezOutputContext ctxt) reader getReader() num(List<input> inputs, List<output> outputs) Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved writer getWriter() handleEvents(list <event> evts) handleEvents(list <event> evts) handleEvents(list <event> evts) close() close() close()
  • 16. Tez: Library of Inputs and Outputs Classical ‘Map’ Classical ‘Reduce’ HDFS Input Map Processor Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013 Intermediate ‘Reduce’ for Map-Reduce-Reduce Sorted Output Reduce Processor Shuffle Input HDFS Output Reduce Processor Shuffle Input Sorted Output What is build in ? –Hadoop InputFormat / OutputFormat –SortedGroupPartitioned Key-Value Input / Output –UnsortedGroupedPartitioned Key-Value Input / Output –Key-Value Input / Output
  • 17. Tez - Performance Benefits of expressing the data processing as a DAG - Reducing overheads and queuing effects - Gives system the global pictures for better planning Efficient use of resources - Re-use resources to maximize utilisation - Pre-Launch, pre-warm and cache - Locality & resource aware scheduling Support for application defined DAG modification at runtime - Change task concurrency - Change task scheduling - Change DAG edges - Change DAG Vertices Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 18. Tez – Benefits of DAG execution Faster Execution and Higher predicabliity – Eliminate replicated write barrier between successive computations. –Eliminate job launch overhead of workflow jobs. – Eliminate extra stage of map reads in every workflow job. – Eliminate queue and resource contention suffered by workflow jobs that are started after a predecessor job completes. –Better locality because the engine has got the overall picture Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013 Page 19 Pig/Hive - MR Pig/Hive - Tez
  • 19. Hive-on-MR vs. Hive-on-Tez SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x ORDER BY AVG; Hive – MR Hive – Tez SELECT a.state JOIN (a, c) SELECT c.price Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved SELECT b.id JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M M R R M M R M M R M M R HDFS HDFS HDFS M M M R R R Tez avoids unneeded writes to HDFS M M R R SELECT a.state, c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) SELECT b.id
  • 20. Tez – Container Re-Use - Reuse YARN containers/JVMs to launch new tasks - Reduce scheduling and launching delays - Shared JVM objects across tasks. - JVM JIT Friendly execution Tez Application Master Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved © Hortonworks Inc. 2013 Page 21 TezTask Host TezTask1 TezTask2 YARN Container Shared Objects YARN Container Start Task Task Done Start Task
  • 21. Tez - Sessions Sessions - Standard concepts of pre-launch and pre-warm applied - Key for interactive queries - Represents a connection between the user and the cluster - Multiple DAGs executed in the same session - Container re-used across queries - Takes care of data locality and releasing resources when idle Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Client Application Master Task Scheduler Start Session Submit DAG Pre Warmed JVM Shared Object Registry Container Pool
  • 22. Tez – Deep Dive – Scheduling Vertex-1 Vertex-2 Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Start vertex Vertex Manager Start tasks DAG Scheduler Get Priority Get Priority Start vertex Task Scheduler Get container Get container Vertex Manager • Determines task parallelism • Determines when tasks in a vertex can start DAG Scheduler • Determines priority of task Task Scheduler • Allocates containers from YARN and assigns them to tasks
  • 23. Tez – Event Based Control Plane Events used to communicate between the tasks and between task and framework Data Movement Event used by producer task to inform the consumer tasks about data location, size, etc.. Input Error event sent by task to the engine to inform about errors in reading input. The engine then takes action by re-generating the input Other events to send task completion notification, data statistic and other control plane information Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 24. Tez – Automatic Reduce Parallelism Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Map Vertex Reduce Vertex Vertex Manager Set Parallelism App Master Data Size Statistics Vertex State Machine Cancel Task Re-Route Event Model Map tasks send data statistics events to the Reduce Vertex Manager. Vertex Manager Pluggable user logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism
  • 25. Theory to practice Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 26. Tez – Performance 30TB Scale factor – Hive 10 RC File, Hive 13 ORC Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 27. Tez – Observations on Performance Number of stage in the DAG - High number of stages in the DAG Cluster / Queue capacity - Congested queue - container re-use Size of intermediate output - Large size of intermediate output – less HDFS usage Size of data in the job - Small data and lot of stages – Less overhead than MR Offload work to the cluster - Use DAG – utilize parallelism and resources of the cluster Vertex caching - Reduce re-computation Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 28. Tez – Adoption Path Pre-requisite: Hadoop 2 with YARN Simple client-side install ( no admin support needed ) - Need a folder with write permission HDFS - No side effects or traces left behind on your cluster Apache Hive – Available in 0.13 - Set “hive.execution.engine” to ”tez” Apache Pig – Available in 0.14 Cascading – Version 3.0 Run your MapReduce jobs using Tez runtime - Set “mapreduce.framework.name” to “yarn-tez” Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
  • 29. Tez - Roadmap Richer DAG support - Addition of vertices at runtime - Shared edges for shared outputs - Enhance Input / Output library Performance optimizations - Improve support for high concurrency - Improve locality aware scheduling - Add framework level data statistics - HDFS memory storage integration Usability - Tez UI - API ease of use Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hinweis der Redaktion

  1. Look to the future with an eye on the past Re-use our learning
  2. Expressing the computation MapReduce – sometime hard to express the algo Able to change the source / sink of data – advance source RDBMs, RMA, … Performance Late binding = using cluster information from runtime real data at runtime Bulid in the framework a sol for user to change their mechanism
  3. Really simple / really safe to deploy
  4. User - Define a workflow as a DAG Define – movement data from the source to sink via a set of consumer and producer Vertex – transformtion of data – transform, filter, compute, … Edge – movement of data - could be writing to local disk, HDFS, streaming to one place to another, DB, … Preprocessor stage – eg text data – send to sampler – calculating ranges for splitting data in partition a – c, d – f, … Partition to aggregate stage = Scater / Gather
  5. DAG: Let you define graph Runtime : run code of customer
  6. Logical plan
  7. Logical plan
  8. Physical DAG
  9. Divide a Task is a triplet – Input / Processor / Output Input – Read data -> transform data from source data to an input that Processor can understand Processor -> Biz logic Output – Write Data You can simply swap any part, if you change
  10. You could switch input or output – For eg, swap HDFS output for reducer to Kafka Q or RDBMS DB Input could be an in memory DB for perf Input built in HDFS / YARN shuffle service – read data from local disk
  11. Queuing effect on busy cluster
  12. If you’ve define – 100 reducer, it can be shred down auto – ( reduces resources )
  13. Enable the deep dive scheduling Send statistic where are task launch – eg colocation