YARN Ready webinar series helps developers integrate their applications to YARN. Tez is one vehicle to do that. We take a deep dive including code review to help you get started.
SQL Database Design For Developers at php[tek] 2024
Â
YARN Ready: Integrating to YARN with Tez
1. Apache Tez : Accelerating Hadoop
Data Processing
Bikas Saha@bikassaha
Š Hortonworks Inc. 2013 Page 1
2. Tez â Introduction
Š Hortonworks Inc. 2013
Page 2
⢠Distributed execution framework targeted towards data-processing
applications.
⢠Based on expressing a computation as a dataflow graph.
⢠Highly customizable to meet a broad spectrum of use cases.
⢠Built on top of YARN â the resource management framework
for Hadoop.
⢠Open source Apache project and Apache licensed.
4. Tez â Empowering Applications
⢠Tez solves hard problems of running on a distributed Hadoop environment
⢠Apps can focus on solving their domain specific problems
⢠This design is important to be a platform for a variety of applications
Š Hortonworks Inc. 2013
Page 4
App
Tez
⢠Custom application logic
⢠Custom data format
⢠Custom data transfer technology
⢠Distributed parallel execution
⢠Negotiating resources from the Hadoop framework
⢠Fault tolerance and recovery
⢠Horizontal scalability
⢠Resource elasticity
⢠Shared library of ready-to-use components
⢠Built-in performance optimizations
⢠Security
5. Tez â End User Benefits
⢠Better performance of applications
⢠Built-in performance + Application define optimizations
⢠Better predictability of results
⢠Minimization of overheads and queuing delays
⢠Better utilization of compute capacity
⢠Efficient use of allocated resources
⢠Reduced load on distributed filesystem (HDFS)
⢠Reduce unnecessary replicated writes
Š Hortonworks Inc. 2013
⢠Reduced network usage
⢠Better locality and data transfer using new data patterns
⢠Higher application developer productivity
⢠Focus on application business logic rather than Hadoop internals
Page 5
6. Tez â Design considerations
Donât solve problems that have already been solved. Or else
you will have to solve them again!
⢠Leverage discrete task based compute model for elasticity, scalability
and fault tolerance
⢠Leverage several man years of work in Hadoop Map-Reduce data
shuffling operations
⢠Leverage proven resource sharing and multi-tenancy model for Hadoop
and YARN
⢠Leverage built-in security mechanisms in Hadoop for privacy and
isolation
Š Hortonworks Inc. 2013
Page 6
Look to the Future with an eye on the Past
7. Tez â Problems that it addresses
⢠Expressing the computation
⢠Direct and elegant representation of the data processing flow
⢠Interfacing with application code and new technologies
Š Hortonworks Inc. 2013
⢠Performance
⢠Late Binding : Make decisions as late as possible using real data from at
runtime
⢠Leverage the resources of the cluster efficiently
⢠Just work out of the box!
⢠Customizable engine to let applications tailor the job to meet their specific
requirements
⢠Operation simplicity
⢠Painless to operate, experiment and upgrade
Page 7
8. Tez â Simplifying Operations
⢠No deployments to do. No side effects. Easy and safe to try it out!
⢠Tez is a completely client side application.
⢠Simply upload to any accessible FileSystem and change local Tez configuration to
point to that.
⢠Enables running different versions concurrently. Easy to test new functionality
while keeping stable versions for production.
⢠Leverages YARN local resources.
TezClient TezTask
TezTask
Š Hortonworks Inc. 2013
Page 8
Client
Machine
Node
Manager
Node
Manager
HDFS
Tez Lib 1 Tez Lib 2
TezClient
Client
Machine
9. Tez â Expressing the computation
Distributed data processing jobs typically look like DAGs (Directed Acyclic
Graph).
⢠Vertices in the graph represent data transformations
⢠Edges represent data movement from producers to consumers
Š Hortonworks Inc. 2013
Page 9
Preprocessor Stage
Partition Stage
Aggregate Stage
Sampler
Task-1 Task-2
Task-1 Task-2
Task-1 Task-2
Samples
Ranges
Distributed Sort
10. Tez â Expressing the computation
Tez provides the following APIs to define the processing
⢠DAG API
⢠Defines the structure of the data processing and the relationship
between producers and consumers
⢠Enable definition of complex data flow pipelines using simple graph
connection APIâs. Tez expands the logical DAG at runtime
⢠This is how the connection of tasks in the job gets specified
Š Hortonworks Inc. 2013
⢠Runtime API
⢠Defines the interfaces using which the framework and app code interact
with each other
⢠App code transforms data and moves it between tasks
⢠This is how we specify what actually executes in each task on the cluster
nodes
Page 10
11. Š Hortonworks Inc. 2013
Tez â DAG API
// Define DAG
DAG dag = DAG.create();
// Define Vertex
Vertex Map1 = Vertex.create(Processor.class);
// Define Edge
Edge edge = Edge.create(Map1, Reduce1,
SCATTER_GATHER, PERSISTED, SEQUENTIAL,
Output.class, Input.class);
// Connect them
dag.addVertex(Map1).addEdge(edge)âŚ
Page 11
Defines the global processing flow
Map1 Map2
Scatter
Gather
Reduce1 Reduce2
Scatter
Gather
Join
12. Š Hortonworks Inc. 2013
Tez â DAG API
Data movement â Defines routing of data between
tasks
⢠One-To-One : Data from the ith producer task routes to
the ith consumer task.
⢠Broadcast : Data from a producer task routes to all
consumer tasks.
⢠Scatter-Gather : Producer tasks scatter data into shards
and consumer tasks gather the data. The ith shard from
all producer tasks routes to the ith consumer task.
Page 12
Edge properties define the connection between producer and
consumer tasks in the DAG
13. Tez â Logical DAG expansion at
Runtime
Š Hortonworks Inc. 2013
Page 13
Reduce1
Map2
Reduce2
Join
Map1
14. Tez â Runtime API
Flexible Inputs-Processor-Outputs Model
⢠Thin API layer to wrap around arbitrary application code
⢠Compose inputs, processor and outputs to execute arbitrary processing
⢠Applications decide logical data format and data transfer technology
⢠Customize for performance
⢠Built-in implementations for Hadoop 2.0 data services â HDFS and
YARN ShuffleService. Built on the same API. Your impls are as first class
as ours!
Š Hortonworks Inc. 2013
Page 14
15. Tez â Library of Inputs and Outputs
Sorted
Output
Š Hortonworks Inc. 2013
Classical âMapâ
Page 15
Classical âReduceâ
Map
Processor
HDFS
Input
Intermediate âReduceâ for
Map-Reduce-Reduce
Reduce
Processor
Shuffle
Input
HDFS
Output
Reduce
Processor
Shuffle
Input
Sorted
Output
⢠What is built in?
â Hadoop InputFormat/OutputFormat
â SortedGroupedPartitioned Key-Value
Input/Output
â UnsortedGroupedPartitioned Key-Value
Input/Output
â Key-Value Input/Output
16. Tez â Composable Task Model
Adopt Evolve Optimize
HDFS
Input
Remote
File
Server
Input
Š Hortonworks Inc. 2013
Native
DB
Input
Page 16
HDFS
Input
Remote
File
Server
Input
Hive Processor
HDFS
Output
Local
Disk
Output
Your Processor
HDFS
Output
Local
Disk
Output
RDMA
Input
Your Processor
Kakfa
Pub-Sub
Output
Amazon
S3
Output
17. Tez â Performance
⢠Benefits of expressing the data processing as a DAG
⢠Reducing overheads and queuing effects
⢠Gives system the global picture for better planning
⢠Efficient use of resources
⢠Re-use resources to maximize utilization
⢠Pre-launch, pre-warm and cache
⢠Locality & resource aware scheduling
⢠Support for application defined DAG modifications at runtime
for optimized execution
⢠Change task concurrency
⢠Change task scheduling
⢠Change DAG edges
⢠Change DAG vertices (TBD)
Š Hortonworks Inc. 2013
Page 17
18. Tez â Benefits of DAG execution
Faster Execution and Higher Predictability
⢠Eliminate replicated write barrier between successive computations.
⢠Eliminate job launch overhead of workflow jobs.
⢠Eliminate extra stage of map reads in every workflow job.
⢠Eliminate queue and resource contention suffered by workflow jobs that
are started after a predecessor job completes.
⢠Better locality because the engine has the global picture
Š Hortonworks Inc. 2013
Page 18
Pig/Hive - MR
Pig/Hive - Tez
19. Tez â Container Re-Use
⢠Reuse YARN containers/JVMs to launch new tasks
⢠Reduce scheduling and launching delays
⢠Shared in-memory data across tasks
⢠JVM JIT friendly execution
Š Hortonworks Inc. 2013
Page 19
TezTask Host
TezTask1
TezTask2
Shared Objects
YARN Container / JVM
Tez
Application Master
YARN Container
Start Task
Task Done
Start Task
20. Š Hortonworks Inc. 2013
Tez â Sessions
Page 20
Client
Application Master
Start
Session
Submit
DAG
Task Scheduler
Container Pool
Shared
Object
Registry
Pre
Warmed
JVM
Sessions
⢠Standard concepts of pre-launch
and pre-warm applied
⢠Key for interactive queries
⢠Represents a connection between
the user and the cluster
⢠Multiple DAGs executed in the
same session
⢠Containers re-used across queries
⢠Takes care of data locality and
releasing resources when idle
21. Tez â Current status
Š Hortonworks Inc. 2013
⢠Apache Project
âRapid development. Over 1100 jiras opened. Over 800 resolved
âGrowing community of contributors and users
⢠0.5 being released. In voting for release
âDeveloper release focused ease of development
â Stable API
âBetter debugging
âDocumentation and code samples
⢠Support for a vast topology of DAGs
⢠Being used by multiple applications such as Apache Hive,
Apache Pig, Cascading, Scalding
Page 21
22. Tez â Adoption Path
Pre-requisite : Hadoop 2 with YARN
Tez has zero deployment pain. No side effects or traces left
behind on your cluster. Low risk and low effort to try out.
⢠Using Hive, Pig, Cascading, Scalding
⢠Try them with Tez as execution engine
⢠Already have MapReduce based pipeline
⢠Use configuration to change MapReduce to run on Tez by setting
âmapreduce.framework.nameâ to âyarn-tezâ in mapred-site.xml
⢠Consolidate MR workflow into MR-DAG on Tez
⢠Change MR-DAG to use more efficient Tez constructs
Š Hortonworks Inc. 2013
⢠Have custom pipeline
⢠Wrap custom code/scripts into Tez inputs-processor-outputs
⢠Translate your custom pipeline topology into Tez DAG
⢠Change custom topology to use more efficient Tez constructs
Page 22
23. Tez â Theory to Practice
Š Hortonworks Inc. 2013
⢠Performance
⢠Scalability
Page 23
24. Tez â Hive TPC-DS Scale 200GB latency
Š Hortonworks Inc. 2013
25. Tez â Pig performance gains
⢠Demonstrates performance gains from a basic translation to a
Tez DAG
160
140
120
100
80
60
40
20
0
Prod script 1
25m vs 10m
5 MR Jobs
Prod script 2
34m vs 16m
5 MR Jobs
Prod script 3
1h 46m vs 48m
12 MR Jobs
Prod script 4
2h 22m vs 1h
21m
15 MR jobs
Time in mins
MR
Tez
26. Tez â Observations on Performance
Š Hortonworks Inc. 2013
⢠Number of stages in the DAG
⢠Higher the number of stages in the DAG, performance of Tez (over MR) will be
better.
⢠Cluster/queue capacity
⢠More congested a queue is, the performance of Tez (over MR) will be better due
to container reuse.
⢠Size of intermediate output
⢠More the size of intermediate output, the performance of Tez (over MR) will be
better due to reduced HDFS usage.
⢠Size of data in the job
⢠For smaller data and more stages, the performance of Tez (over MR) will be
better as percentage of launch overhead in the total time is high for smaller
jobs.
⢠Offload work to the cluster
⢠Move as much work as possible to the cluster by modelling it via the job DAG.
Exploit the parallelism and resources of the cluster. E.g. MR split calculation.
⢠Vertex caching
⢠The more re-computation can be avoided the better is the performance.
Page 26
27. Tez â Data at scale
Š Hortonworks Inc. 2013
Hive TPC-DS
Scale 10TB
Page 27
28. Tez â DAG definition at scale
Š Hortonworks Inc. 2013
Page 28
Hive : TPC-DS Query 64 Logical DAG
29. Tez â Container Reuse at Scale
⢠78 vertices + 8374 tasks on 50 containers (TPC-DS Query 4)
Š Hortonworks Inc. 2013
Page 29
30. Tez â Bridging the Data Spectrum
Š Hortonworks Inc. 2013
Page 30
Fact Table
Dimension
Table 1
Result
Table 1
Dimension
Table 2
Result
Table 2
Dimension
Table 3
Result
Table 3
Broadcast
Join
Shuffle
Join
Typical pattern in a
TPC-DS query
Fact Table
Dimension
Table 1
Dimension
Table 1
Dimension
Table 1
Broadcast join
for small data sets
Based on data size,
the query optimizer
can run either plan as
a single Tez job
Broadcast
Join
31. Tez â Community
⢠Early adopters and code contributors welcome
â Adopters to drive more scenarios. Contributors to make them happen.
⢠Tez meetup for developers and users
â http://www.meetup.com/Apache-Tez-User-Group
Š Hortonworks Inc. 2013
⢠Technical blog series
â http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing
⢠Useful links
âWork tracking: https://issues.apache.org/jira/browse/TEZ
â Code: https://github.com/apache/tez
â Developer list: dev@tez.apache.org
User list: user@tez.apache.org
Issues list: issues@tez.apache.org
Page 31
32. Tez â Takeaways
⢠Distributed execution framework that works on computations
represented as dataflow graphs
⢠Customizable execution architecture designed to enable
dynamic performance optimizations at runtime
â˘Works out of the box with the platform figuring out the hard
stuff
⢠Span the spectrum of interactive latency to batch, small to
large
⢠Open source Apache project â your use-cases and code are
welcome
⢠It works and is already being used by Hive and Pig
Š Hortonworks Inc. 2013
Page 32
33. Š Hortonworks Inc. 2013
Tez
Thanks for your time and attention!
Video with Deep Dive on Tez
http://goo.gl/BL67o7
http://www.infoq.com/presentations/apache-tez
Questions?
@bikassaha
Page 33
35. Next Steps
1. Review YARN Resources
2. Review webinar recording
3. Attend Office Hours
4. Attend the next YARN
webinar
36. Resources
Setup HDP 2.1 environment
⢠Leverage Sandbox: Hortonworks.com/Sandbox
Get Started with YARN
⢠http://hortonworks.com/get-started/YARN
Video Deep Dive on Tez
⢠http://goo.gl/BL67o7 and http://www.infoq.com/presentations/apache-tez
Tez meetup for developers and users
⢠http://www.meetup.com/Apache-Tez-User-Group
Technical blog series
http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing
Useful links
Work tracking: https://issues.apache.org/jira/browse/TEZ
Code: https://github.com/apache/tez
Developer list: dev@tez.apache.org
User list: user@tez.apache.org
37. Hortonworks Office Hours
YARN Office Hours
Dial in and chat with YARN experts
We plan Office Hours for September 11th and October 9th @ 10am PT (2nd
Thursdays)
Invitations will go out to those that attended or reviewed YARN webinars
These will also be posted to hortonworks.com/webinars
Registration required.
38. YARN Ready Webinar Schedule
Native
Integration
Slider
Office Hours
August 14
Tez
August 21
Ambari
Sept. 4
Office Hours
Sept. 11
Scalding
Sept. 18
Spark
Oct. 2
Office Hours
Oct. 9
Upcoming Webinars
Office Hours
Timeline
Recorded
Webinars:
Introduction
to YARN Ready
You can also visit
http://hortonworks.com/webinars/#librar
y
Sign up here:
For the Series, Individual webinar or office
hourshttp://info.hortonworks.com/YarnReady-BigData-Webcast-Series.html