SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Apache Tez : Accelerating Hadoop 
Data Processing 
Bikas Saha@bikassaha 
Š Hortonworks Inc. 2013 Page 1
Tez – Introduction 
Š Hortonworks Inc. 2013 
Page 2 
• Distributed execution framework targeted towards data-processing 
applications. 
• Based on expressing a computation as a dataflow graph. 
• Highly customizable to meet a broad spectrum of use cases. 
• Built on top of YARN – the resource management framework 
for Hadoop. 
• Open source Apache project and Apache licensed.
Tez – Hadoop 1 ---> Hadoop 2 Monolithic 
• Resource Management – MR 
• Execution Engine – MR 
Layered 
• Resource Management – YARN 
• Engines – Hive, Pig, Cascading, Your App!
Tez – Empowering Applications 
• Tez solves hard problems of running on a distributed Hadoop environment 
• Apps can focus on solving their domain specific problems 
• This design is important to be a platform for a variety of applications 
Š Hortonworks Inc. 2013 
Page 4 
App 
Tez 
• Custom application logic 
• Custom data format 
• Custom data transfer technology 
• Distributed parallel execution 
• Negotiating resources from the Hadoop framework 
• Fault tolerance and recovery 
• Horizontal scalability 
• Resource elasticity 
• Shared library of ready-to-use components 
• Built-in performance optimizations 
• Security
Tez – End User Benefits 
• Better performance of applications 
• Built-in performance + Application define optimizations 
• Better predictability of results 
• Minimization of overheads and queuing delays 
• Better utilization of compute capacity 
• Efficient use of allocated resources 
• Reduced load on distributed filesystem (HDFS) 
• Reduce unnecessary replicated writes 
Š Hortonworks Inc. 2013 
• Reduced network usage 
• Better locality and data transfer using new data patterns 
• Higher application developer productivity 
• Focus on application business logic rather than Hadoop internals 
Page 5
Tez – Design considerations 
Don’t solve problems that have already been solved. Or else 
you will have to solve them again! 
• Leverage discrete task based compute model for elasticity, scalability 
and fault tolerance 
• Leverage several man years of work in Hadoop Map-Reduce data 
shuffling operations 
• Leverage proven resource sharing and multi-tenancy model for Hadoop 
and YARN 
• Leverage built-in security mechanisms in Hadoop for privacy and 
isolation 
Š Hortonworks Inc. 2013 
Page 6 
Look to the Future with an eye on the Past
Tez – Problems that it addresses 
• Expressing the computation 
• Direct and elegant representation of the data processing flow 
• Interfacing with application code and new technologies 
Š Hortonworks Inc. 2013 
• Performance 
• Late Binding : Make decisions as late as possible using real data from at 
runtime 
• Leverage the resources of the cluster efficiently 
• Just work out of the box! 
• Customizable engine to let applications tailor the job to meet their specific 
requirements 
• Operation simplicity 
• Painless to operate, experiment and upgrade 
Page 7
Tez – Simplifying Operations 
• No deployments to do. No side effects. Easy and safe to try it out! 
• Tez is a completely client side application. 
• Simply upload to any accessible FileSystem and change local Tez configuration to 
point to that. 
• Enables running different versions concurrently. Easy to test new functionality 
while keeping stable versions for production. 
• Leverages YARN local resources. 
TezClient TezTask 
TezTask 
Š Hortonworks Inc. 2013 
Page 8 
Client 
Machine 
Node 
Manager 
Node 
Manager 
HDFS 
Tez Lib 1 Tez Lib 2 
TezClient 
Client 
Machine
Tez – Expressing the computation 
Distributed data processing jobs typically look like DAGs (Directed Acyclic 
Graph). 
• Vertices in the graph represent data transformations 
• Edges represent data movement from producers to consumers 
Š Hortonworks Inc. 2013 
Page 9 
Preprocessor Stage 
Partition Stage 
Aggregate Stage 
Sampler 
Task-1 Task-2 
Task-1 Task-2 
Task-1 Task-2 
Samples 
Ranges 
Distributed Sort
Tez – Expressing the computation 
Tez provides the following APIs to define the processing 
• DAG API 
• Defines the structure of the data processing and the relationship 
between producers and consumers 
• Enable definition of complex data flow pipelines using simple graph 
connection API’s. Tez expands the logical DAG at runtime 
• This is how the connection of tasks in the job gets specified 
Š Hortonworks Inc. 2013 
• Runtime API 
• Defines the interfaces using which the framework and app code interact 
with each other 
• App code transforms data and moves it between tasks 
• This is how we specify what actually executes in each task on the cluster 
nodes 
Page 10
Š Hortonworks Inc. 2013 
Tez – DAG API 
// Define DAG 
DAG dag = DAG.create(); 
// Define Vertex 
Vertex Map1 = Vertex.create(Processor.class); 
// Define Edge 
Edge edge = Edge.create(Map1, Reduce1, 
SCATTER_GATHER, PERSISTED, SEQUENTIAL, 
Output.class, Input.class); 
// Connect them 
dag.addVertex(Map1).addEdge(edge)… 
Page 11 
Defines the global processing flow 
Map1 Map2 
Scatter 
Gather 
Reduce1 Reduce2 
Scatter 
Gather 
Join
Š Hortonworks Inc. 2013 
Tez – DAG API 
Data movement – Defines routing of data between 
tasks 
• One-To-One : Data from the ith producer task routes to 
the ith consumer task. 
• Broadcast : Data from a producer task routes to all 
consumer tasks. 
• Scatter-Gather : Producer tasks scatter data into shards 
and consumer tasks gather the data. The ith shard from 
all producer tasks routes to the ith consumer task. 
Page 12 
Edge properties define the connection between producer and 
consumer tasks in the DAG
Tez – Logical DAG expansion at 
Runtime 
Š Hortonworks Inc. 2013 
Page 13 
Reduce1 
Map2 
Reduce2 
Join 
Map1
Tez – Runtime API 
Flexible Inputs-Processor-Outputs Model 
• Thin API layer to wrap around arbitrary application code 
• Compose inputs, processor and outputs to execute arbitrary processing 
• Applications decide logical data format and data transfer technology 
• Customize for performance 
• Built-in implementations for Hadoop 2.0 data services – HDFS and 
YARN ShuffleService. Built on the same API. Your impls are as first class 
as ours! 
Š Hortonworks Inc. 2013 
Page 14
Tez – Library of Inputs and Outputs 
Sorted 
Output 
Š Hortonworks Inc. 2013 
Classical ‘Map’ 
Page 15 
Classical ‘Reduce’ 
Map 
Processor 
HDFS 
Input 
Intermediate ‘Reduce’ for 
Map-Reduce-Reduce 
Reduce 
Processor 
Shuffle 
Input 
HDFS 
Output 
Reduce 
Processor 
Shuffle 
Input 
Sorted 
Output 
• What is built in? 
– Hadoop InputFormat/OutputFormat 
– SortedGroupedPartitioned Key-Value 
Input/Output 
– UnsortedGroupedPartitioned Key-Value 
Input/Output 
– Key-Value Input/Output
Tez – Composable Task Model 
Adopt Evolve Optimize 
HDFS 
Input 
Remote 
File 
Server 
Input 
Š Hortonworks Inc. 2013 
Native 
DB 
Input 
Page 16 
HDFS 
Input 
Remote 
File 
Server 
Input 
Hive Processor 
HDFS 
Output 
Local 
Disk 
Output 
Your Processor 
HDFS 
Output 
Local 
Disk 
Output 
RDMA 
Input 
Your Processor 
Kakfa 
Pub-Sub 
Output 
Amazon 
S3 
Output
Tez – Performance 
• Benefits of expressing the data processing as a DAG 
• Reducing overheads and queuing effects 
• Gives system the global picture for better planning 
• Efficient use of resources 
• Re-use resources to maximize utilization 
• Pre-launch, pre-warm and cache 
• Locality & resource aware scheduling 
• Support for application defined DAG modifications at runtime 
for optimized execution 
• Change task concurrency 
• Change task scheduling 
• Change DAG edges 
• Change DAG vertices (TBD) 
Š Hortonworks Inc. 2013 
Page 17
Tez – Benefits of DAG execution 
Faster Execution and Higher Predictability 
• Eliminate replicated write barrier between successive computations. 
• Eliminate job launch overhead of workflow jobs. 
• Eliminate extra stage of map reads in every workflow job. 
• Eliminate queue and resource contention suffered by workflow jobs that 
are started after a predecessor job completes. 
• Better locality because the engine has the global picture 
Š Hortonworks Inc. 2013 
Page 18 
Pig/Hive - MR 
Pig/Hive - Tez
Tez – Container Re-Use 
• Reuse YARN containers/JVMs to launch new tasks 
• Reduce scheduling and launching delays 
• Shared in-memory data across tasks 
• JVM JIT friendly execution 
Š Hortonworks Inc. 2013 
Page 19 
TezTask Host 
TezTask1 
TezTask2 
Shared Objects 
YARN Container / JVM 
Tez 
Application Master 
YARN Container 
Start Task 
Task Done 
Start Task
Š Hortonworks Inc. 2013 
Tez – Sessions 
Page 20 
Client 
Application Master 
Start 
Session 
Submit 
DAG 
Task Scheduler 
Container Pool 
Shared 
Object 
Registry 
Pre 
Warmed 
JVM 
Sessions 
• Standard concepts of pre-launch 
and pre-warm applied 
• Key for interactive queries 
• Represents a connection between 
the user and the cluster 
• Multiple DAGs executed in the 
same session 
• Containers re-used across queries 
• Takes care of data locality and 
releasing resources when idle
Tez – Current status 
Š Hortonworks Inc. 2013 
• Apache Project 
–Rapid development. Over 1100 jiras opened. Over 800 resolved 
–Growing community of contributors and users 
• 0.5 being released. In voting for release 
–Developer release focused ease of development 
– Stable API 
–Better debugging 
–Documentation and code samples 
• Support for a vast topology of DAGs 
• Being used by multiple applications such as Apache Hive, 
Apache Pig, Cascading, Scalding 
Page 21
Tez – Adoption Path 
Pre-requisite : Hadoop 2 with YARN 
Tez has zero deployment pain. No side effects or traces left 
behind on your cluster. Low risk and low effort to try out. 
• Using Hive, Pig, Cascading, Scalding 
• Try them with Tez as execution engine 
• Already have MapReduce based pipeline 
• Use configuration to change MapReduce to run on Tez by setting 
‘mapreduce.framework.name’ to ‘yarn-tez’ in mapred-site.xml 
• Consolidate MR workflow into MR-DAG on Tez 
• Change MR-DAG to use more efficient Tez constructs 
Š Hortonworks Inc. 2013 
• Have custom pipeline 
• Wrap custom code/scripts into Tez inputs-processor-outputs 
• Translate your custom pipeline topology into Tez DAG 
• Change custom topology to use more efficient Tez constructs 
Page 22
Tez – Theory to Practice 
Š Hortonworks Inc. 2013 
• Performance 
• Scalability 
Page 23
Tez – Hive TPC-DS Scale 200GB latency 
Š Hortonworks Inc. 2013
Tez – Pig performance gains 
• Demonstrates performance gains from a basic translation to a 
Tez DAG 
160 
140 
120 
100 
80 
60 
40 
20 
0 
Prod script 1 
25m vs 10m 
5 MR Jobs 
Prod script 2 
34m vs 16m 
5 MR Jobs 
Prod script 3 
1h 46m vs 48m 
12 MR Jobs 
Prod script 4 
2h 22m vs 1h 
21m 
15 MR jobs 
Time in mins 
MR 
Tez
Tez – Observations on Performance 
Š Hortonworks Inc. 2013 
• Number of stages in the DAG 
• Higher the number of stages in the DAG, performance of Tez (over MR) will be 
better. 
• Cluster/queue capacity 
• More congested a queue is, the performance of Tez (over MR) will be better due 
to container reuse. 
• Size of intermediate output 
• More the size of intermediate output, the performance of Tez (over MR) will be 
better due to reduced HDFS usage. 
• Size of data in the job 
• For smaller data and more stages, the performance of Tez (over MR) will be 
better as percentage of launch overhead in the total time is high for smaller 
jobs. 
• Offload work to the cluster 
• Move as much work as possible to the cluster by modelling it via the job DAG. 
Exploit the parallelism and resources of the cluster. E.g. MR split calculation. 
• Vertex caching 
• The more re-computation can be avoided the better is the performance. 
Page 26
Tez – Data at scale 
Š Hortonworks Inc. 2013 
Hive TPC-DS 
Scale 10TB 
Page 27
Tez – DAG definition at scale 
Š Hortonworks Inc. 2013 
Page 28 
Hive : TPC-DS Query 64 Logical DAG
Tez – Container Reuse at Scale 
• 78 vertices + 8374 tasks on 50 containers (TPC-DS Query 4) 
Š Hortonworks Inc. 2013 
Page 29
Tez – Bridging the Data Spectrum 
Š Hortonworks Inc. 2013 
Page 30 
Fact Table 
Dimension 
Table 1 
Result 
Table 1 
Dimension 
Table 2 
Result 
Table 2 
Dimension 
Table 3 
Result 
Table 3 
Broadcast 
Join 
Shuffle 
Join 
Typical pattern in a 
TPC-DS query 
Fact Table 
Dimension 
Table 1 
Dimension 
Table 1 
Dimension 
Table 1 
Broadcast join 
for small data sets 
Based on data size, 
the query optimizer 
can run either plan as 
a single Tez job 
Broadcast 
Join
Tez – Community 
• Early adopters and code contributors welcome 
– Adopters to drive more scenarios. Contributors to make them happen. 
• Tez meetup for developers and users 
– http://www.meetup.com/Apache-Tez-User-Group 
Š Hortonworks Inc. 2013 
• Technical blog series 
– http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing 
• Useful links 
–Work tracking: https://issues.apache.org/jira/browse/TEZ 
– Code: https://github.com/apache/tez 
– Developer list: dev@tez.apache.org 
User list: user@tez.apache.org 
Issues list: issues@tez.apache.org 
Page 31
Tez – Takeaways 
• Distributed execution framework that works on computations 
represented as dataflow graphs 
• Customizable execution architecture designed to enable 
dynamic performance optimizations at runtime 
•Works out of the box with the platform figuring out the hard 
stuff 
• Span the spectrum of interactive latency to batch, small to 
large 
• Open source Apache project – your use-cases and code are 
welcome 
• It works and is already being used by Hive and Pig 
Š Hortonworks Inc. 2013 
Page 32
Š Hortonworks Inc. 2013 
Tez 
Thanks for your time and attention! 
Video with Deep Dive on Tez 
http://goo.gl/BL67o7 
http://www.infoq.com/presentations/apache-tez 
Questions? 
@bikassaha 
Page 33
Next Steps
Next Steps 
1. Review YARN Resources 
2. Review webinar recording 
3. Attend Office Hours 
4. Attend the next YARN 
webinar
Resources 
Setup HDP 2.1 environment 
• Leverage Sandbox: Hortonworks.com/Sandbox 
Get Started with YARN 
• http://hortonworks.com/get-started/YARN 
Video Deep Dive on Tez 
• http://goo.gl/BL67o7 and http://www.infoq.com/presentations/apache-tez 
Tez meetup for developers and users 
• http://www.meetup.com/Apache-Tez-User-Group 
Technical blog series 
http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing 
Useful links 
Work tracking: https://issues.apache.org/jira/browse/TEZ 
Code: https://github.com/apache/tez 
Developer list: dev@tez.apache.org 
User list: user@tez.apache.org
Hortonworks Office Hours 
YARN Office Hours 
Dial in and chat with YARN experts 
We plan Office Hours for September 11th and October 9th @ 10am PT (2nd 
Thursdays) 
Invitations will go out to those that attended or reviewed YARN webinars 
These will also be posted to hortonworks.com/webinars 
Registration required.
YARN Ready Webinar Schedule 
Native 
Integration 
Slider 
Office Hours 
August 14 
Tez 
August 21 
Ambari 
Sept. 4 
Office Hours 
Sept. 11 
Scalding 
Sept. 18 
Spark 
Oct. 2 
Office Hours 
Oct. 9 
Upcoming Webinars 
Office Hours 
Timeline 
Recorded 
Webinars: 
Introduction 
to YARN Ready 
You can also visit 
http://hortonworks.com/webinars/#librar 
y 
Sign up here: 
For the Series, Individual webinar or office 
hourshttp://info.hortonworks.com/YarnReady-BigData-Webcast-Series.html
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing DataWorks Summit
 
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話Yahoo!デベロッパーネットワーク
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
HBase replication
HBase replicationHBase replication
HBase replicationwchevreuil
 
BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)
BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)
BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)NTT DATA Technology & Innovation
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache RangerDataWorks Summit
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理Cloudera Japan
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBasealexbaranau
 
Apache Spark の紹介(前半:Sparkのキホン)
Apache Spark の紹介(前半:Sparkのキホン)Apache Spark の紹介(前半:Sparkのキホン)
Apache Spark の紹介(前半:Sparkのキホン)NTT DATA OSS Professional Services
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimDatabricks
 
Cephのベンチマークをしました
CephのベンチマークをしましたCephのベンチマークをしました
CephのベンチマークをしましたOSSラボ株式会社
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
Apache Bigtopによるオープンなビッグデータ処理基盤の構築(オープンデベロッパーズカンファレンス 2021 Online 発表資料)
Apache Bigtopによるオープンなビッグデータ処理基盤の構築(オープンデベロッパーズカンファレンス 2021 Online 発表資料)Apache Bigtopによるオープンなビッグデータ処理基盤の構築(オープンデベロッパーズカンファレンス 2021 Online 発表資料)
Apache Bigtopによるオープンなビッグデータ処理基盤の構築(オープンデベロッパーズカンファレンス 2021 Online 発表資料)NTT DATA Technology & Innovation
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesDataWorks Summit
 
Hadoop -NameNode HAの仕組み-
Hadoop -NameNode HAの仕組み-Hadoop -NameNode HAの仕組み-
Hadoop -NameNode HAの仕組み-Yuki Gonda
 

Was ist angesagt? (20)

HDFS Selective Wire Encryption
HDFS Selective Wire EncryptionHDFS Selective Wire Encryption
HDFS Selective Wire Encryption
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
1000台規模のHadoopクラスタをHive/Tezアプリケーションにあわせてパフォーマンスチューニングした話
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
HBase replication
HBase replicationHBase replication
HBase replication
 
BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)
BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)
BigtopでHadoopをビルドする(Open Source Conference 2021 Online/Spring 発表資料)
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Apache Spark の紹介(前半:Sparkのキホン)
Apache Spark の紹介(前半:Sparkのキホン)Apache Spark の紹介(前半:Sparkのキホン)
Apache Spark の紹介(前半:Sparkのキホン)
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
 
Cephのベンチマークをしました
CephのベンチマークをしましたCephのベンチマークをしました
Cephのベンチマークをしました
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Apache Bigtopによるオープンなビッグデータ処理基盤の構築(オープンデベロッパーズカンファレンス 2021 Online 発表資料)
Apache Bigtopによるオープンなビッグデータ処理基盤の構築(オープンデベロッパーズカンファレンス 2021 Online 発表資料)Apache Bigtopによるオープンなビッグデータ処理基盤の構築(オープンデベロッパーズカンファレンス 2021 Online 発表資料)
Apache Bigtopによるオープンなビッグデータ処理基盤の構築(オープンデベロッパーズカンファレンス 2021 Online 発表資料)
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
 
Hadoop -NameNode HAの仕組み-
Hadoop -NameNode HAの仕組み-Hadoop -NameNode HAの仕組み-
Hadoop -NameNode HAの仕組み-
 

Andere mochten auch

Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...Hortonworks
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramHortonworks
 
Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks
 
YARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider WebinarYARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider WebinarHortonworks
 
Hortonworks Technical Workshop - build a yarn ready application with apache ...
Hortonworks Technical Workshop -  build a yarn ready application with apache ...Hortonworks Technical Workshop -  build a yarn ready application with apache ...
Hortonworks Technical Workshop - build a yarn ready application with apache ...Hortonworks
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextHortonworks
 
Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014Hortonworks
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSHortonworks
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopHortonworks
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopHortonworks
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Hortonworks
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchHortonworks
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN ApplicationsHortonworks
 
Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25Hortonworks
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks
 
Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNHortonworks
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNHortonworks
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...Hortonworks
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceHortonworks
 

Andere mochten auch (20)

Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
Accelerate Big Data Application Development with Cascading and HDP, Hortonwor...
 
Introduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready ProgramIntroduction to the Hortonworks YARN Ready Program
Introduction to the Hortonworks YARN Ready Program
 
Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014
 
YARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider WebinarYARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider Webinar
 
Hortonworks Technical Workshop - build a yarn ready application with apache ...
Hortonworks Technical Workshop -  build a yarn ready application with apache ...Hortonworks Technical Workshop -  build a yarn ready application with apache ...
Hortonworks Technical Workshop - build a yarn ready application with apache ...
 
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.nextDiscover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
Discover HDP 2.2: Even Faster SQL Queries with Apache Hive and Stinger.next
 
Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
 
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in HadoopDiscover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
Discover HDP2.1: Apache Storm for Stream Data Processing in Hadoop
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
 
Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop Search
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN Applications
 
Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25Dataguise hortonworks insurance_feb25
Dataguise hortonworks insurance_feb25
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 
Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptx
 
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARNYARN webinar series: Using Scalding to write applications to Hadoop and YARN
YARN webinar series: Using Scalding to write applications to Hadoop and YARN
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
 
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
C-BAG Big Data Meetup Chennai Oct.29-2014 Hortonworks and Concurrent on Casca...
 
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data GovernanceDiscover HDP 2.2: Apache Falcon for Hadoop Data Governance
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
 

Ähnlich wie YARN Ready: Integrating to YARN with Tez

Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processinghitesh1892
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaData Con LA
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingTeddy Choi
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesYahoo Developer Network
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Data Con LA
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Modern Data Stack France
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : BeginnersShweta Patnaik
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutionssolarisyougood
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2Raul Chong
 

Ähnlich wie YARN Ready: Integrating to YARN with Tez (20)

Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
IBM - Introduction to Cloudant
IBM - Introduction to CloudantIBM - Introduction to Cloudant
IBM - Introduction to Cloudant
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 

Mehr von Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

Mehr von Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

KĂźrzlich hochgeladen

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel AraĂşjo
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 

KĂźrzlich hochgeladen (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 

YARN Ready: Integrating to YARN with Tez

  • 1. Apache Tez : Accelerating Hadoop Data Processing Bikas Saha@bikassaha Š Hortonworks Inc. 2013 Page 1
  • 2. Tez – Introduction Š Hortonworks Inc. 2013 Page 2 • Distributed execution framework targeted towards data-processing applications. • Based on expressing a computation as a dataflow graph. • Highly customizable to meet a broad spectrum of use cases. • Built on top of YARN – the resource management framework for Hadoop. • Open source Apache project and Apache licensed.
  • 3. Tez – Hadoop 1 ---> Hadoop 2 Monolithic • Resource Management – MR • Execution Engine – MR Layered • Resource Management – YARN • Engines – Hive, Pig, Cascading, Your App!
  • 4. Tez – Empowering Applications • Tez solves hard problems of running on a distributed Hadoop environment • Apps can focus on solving their domain specific problems • This design is important to be a platform for a variety of applications Š Hortonworks Inc. 2013 Page 4 App Tez • Custom application logic • Custom data format • Custom data transfer technology • Distributed parallel execution • Negotiating resources from the Hadoop framework • Fault tolerance and recovery • Horizontal scalability • Resource elasticity • Shared library of ready-to-use components • Built-in performance optimizations • Security
  • 5. Tez – End User Benefits • Better performance of applications • Built-in performance + Application define optimizations • Better predictability of results • Minimization of overheads and queuing delays • Better utilization of compute capacity • Efficient use of allocated resources • Reduced load on distributed filesystem (HDFS) • Reduce unnecessary replicated writes Š Hortonworks Inc. 2013 • Reduced network usage • Better locality and data transfer using new data patterns • Higher application developer productivity • Focus on application business logic rather than Hadoop internals Page 5
  • 6. Tez – Design considerations Don’t solve problems that have already been solved. Or else you will have to solve them again! • Leverage discrete task based compute model for elasticity, scalability and fault tolerance • Leverage several man years of work in Hadoop Map-Reduce data shuffling operations • Leverage proven resource sharing and multi-tenancy model for Hadoop and YARN • Leverage built-in security mechanisms in Hadoop for privacy and isolation Š Hortonworks Inc. 2013 Page 6 Look to the Future with an eye on the Past
  • 7. Tez – Problems that it addresses • Expressing the computation • Direct and elegant representation of the data processing flow • Interfacing with application code and new technologies Š Hortonworks Inc. 2013 • Performance • Late Binding : Make decisions as late as possible using real data from at runtime • Leverage the resources of the cluster efficiently • Just work out of the box! • Customizable engine to let applications tailor the job to meet their specific requirements • Operation simplicity • Painless to operate, experiment and upgrade Page 7
  • 8. Tez – Simplifying Operations • No deployments to do. No side effects. Easy and safe to try it out! • Tez is a completely client side application. • Simply upload to any accessible FileSystem and change local Tez configuration to point to that. • Enables running different versions concurrently. Easy to test new functionality while keeping stable versions for production. • Leverages YARN local resources. TezClient TezTask TezTask Š Hortonworks Inc. 2013 Page 8 Client Machine Node Manager Node Manager HDFS Tez Lib 1 Tez Lib 2 TezClient Client Machine
  • 9. Tez – Expressing the computation Distributed data processing jobs typically look like DAGs (Directed Acyclic Graph). • Vertices in the graph represent data transformations • Edges represent data movement from producers to consumers Š Hortonworks Inc. 2013 Page 9 Preprocessor Stage Partition Stage Aggregate Stage Sampler Task-1 Task-2 Task-1 Task-2 Task-1 Task-2 Samples Ranges Distributed Sort
  • 10. Tez – Expressing the computation Tez provides the following APIs to define the processing • DAG API • Defines the structure of the data processing and the relationship between producers and consumers • Enable definition of complex data flow pipelines using simple graph connection API’s. Tez expands the logical DAG at runtime • This is how the connection of tasks in the job gets specified Š Hortonworks Inc. 2013 • Runtime API • Defines the interfaces using which the framework and app code interact with each other • App code transforms data and moves it between tasks • This is how we specify what actually executes in each task on the cluster nodes Page 10
  • 11. Š Hortonworks Inc. 2013 Tez – DAG API // Define DAG DAG dag = DAG.create(); // Define Vertex Vertex Map1 = Vertex.create(Processor.class); // Define Edge Edge edge = Edge.create(Map1, Reduce1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, Output.class, Input.class); // Connect them dag.addVertex(Map1).addEdge(edge)… Page 11 Defines the global processing flow Map1 Map2 Scatter Gather Reduce1 Reduce2 Scatter Gather Join
  • 12. Š Hortonworks Inc. 2013 Tez – DAG API Data movement – Defines routing of data between tasks • One-To-One : Data from the ith producer task routes to the ith consumer task. • Broadcast : Data from a producer task routes to all consumer tasks. • Scatter-Gather : Producer tasks scatter data into shards and consumer tasks gather the data. The ith shard from all producer tasks routes to the ith consumer task. Page 12 Edge properties define the connection between producer and consumer tasks in the DAG
  • 13. Tez – Logical DAG expansion at Runtime Š Hortonworks Inc. 2013 Page 13 Reduce1 Map2 Reduce2 Join Map1
  • 14. Tez – Runtime API Flexible Inputs-Processor-Outputs Model • Thin API layer to wrap around arbitrary application code • Compose inputs, processor and outputs to execute arbitrary processing • Applications decide logical data format and data transfer technology • Customize for performance • Built-in implementations for Hadoop 2.0 data services – HDFS and YARN ShuffleService. Built on the same API. Your impls are as first class as ours! Š Hortonworks Inc. 2013 Page 14
  • 15. Tez – Library of Inputs and Outputs Sorted Output Š Hortonworks Inc. 2013 Classical ‘Map’ Page 15 Classical ‘Reduce’ Map Processor HDFS Input Intermediate ‘Reduce’ for Map-Reduce-Reduce Reduce Processor Shuffle Input HDFS Output Reduce Processor Shuffle Input Sorted Output • What is built in? – Hadoop InputFormat/OutputFormat – SortedGroupedPartitioned Key-Value Input/Output – UnsortedGroupedPartitioned Key-Value Input/Output – Key-Value Input/Output
  • 16. Tez – Composable Task Model Adopt Evolve Optimize HDFS Input Remote File Server Input Š Hortonworks Inc. 2013 Native DB Input Page 16 HDFS Input Remote File Server Input Hive Processor HDFS Output Local Disk Output Your Processor HDFS Output Local Disk Output RDMA Input Your Processor Kakfa Pub-Sub Output Amazon S3 Output
  • 17. Tez – Performance • Benefits of expressing the data processing as a DAG • Reducing overheads and queuing effects • Gives system the global picture for better planning • Efficient use of resources • Re-use resources to maximize utilization • Pre-launch, pre-warm and cache • Locality & resource aware scheduling • Support for application defined DAG modifications at runtime for optimized execution • Change task concurrency • Change task scheduling • Change DAG edges • Change DAG vertices (TBD) Š Hortonworks Inc. 2013 Page 17
  • 18. Tez – Benefits of DAG execution Faster Execution and Higher Predictability • Eliminate replicated write barrier between successive computations. • Eliminate job launch overhead of workflow jobs. • Eliminate extra stage of map reads in every workflow job. • Eliminate queue and resource contention suffered by workflow jobs that are started after a predecessor job completes. • Better locality because the engine has the global picture Š Hortonworks Inc. 2013 Page 18 Pig/Hive - MR Pig/Hive - Tez
  • 19. Tez – Container Re-Use • Reuse YARN containers/JVMs to launch new tasks • Reduce scheduling and launching delays • Shared in-memory data across tasks • JVM JIT friendly execution Š Hortonworks Inc. 2013 Page 19 TezTask Host TezTask1 TezTask2 Shared Objects YARN Container / JVM Tez Application Master YARN Container Start Task Task Done Start Task
  • 20. Š Hortonworks Inc. 2013 Tez – Sessions Page 20 Client Application Master Start Session Submit DAG Task Scheduler Container Pool Shared Object Registry Pre Warmed JVM Sessions • Standard concepts of pre-launch and pre-warm applied • Key for interactive queries • Represents a connection between the user and the cluster • Multiple DAGs executed in the same session • Containers re-used across queries • Takes care of data locality and releasing resources when idle
  • 21. Tez – Current status Š Hortonworks Inc. 2013 • Apache Project –Rapid development. Over 1100 jiras opened. Over 800 resolved –Growing community of contributors and users • 0.5 being released. In voting for release –Developer release focused ease of development – Stable API –Better debugging –Documentation and code samples • Support for a vast topology of DAGs • Being used by multiple applications such as Apache Hive, Apache Pig, Cascading, Scalding Page 21
  • 22. Tez – Adoption Path Pre-requisite : Hadoop 2 with YARN Tez has zero deployment pain. No side effects or traces left behind on your cluster. Low risk and low effort to try out. • Using Hive, Pig, Cascading, Scalding • Try them with Tez as execution engine • Already have MapReduce based pipeline • Use configuration to change MapReduce to run on Tez by setting ‘mapreduce.framework.name’ to ‘yarn-tez’ in mapred-site.xml • Consolidate MR workflow into MR-DAG on Tez • Change MR-DAG to use more efficient Tez constructs Š Hortonworks Inc. 2013 • Have custom pipeline • Wrap custom code/scripts into Tez inputs-processor-outputs • Translate your custom pipeline topology into Tez DAG • Change custom topology to use more efficient Tez constructs Page 22
  • 23. Tez – Theory to Practice Š Hortonworks Inc. 2013 • Performance • Scalability Page 23
  • 24. Tez – Hive TPC-DS Scale 200GB latency Š Hortonworks Inc. 2013
  • 25. Tez – Pig performance gains • Demonstrates performance gains from a basic translation to a Tez DAG 160 140 120 100 80 60 40 20 0 Prod script 1 25m vs 10m 5 MR Jobs Prod script 2 34m vs 16m 5 MR Jobs Prod script 3 1h 46m vs 48m 12 MR Jobs Prod script 4 2h 22m vs 1h 21m 15 MR jobs Time in mins MR Tez
  • 26. Tez – Observations on Performance Š Hortonworks Inc. 2013 • Number of stages in the DAG • Higher the number of stages in the DAG, performance of Tez (over MR) will be better. • Cluster/queue capacity • More congested a queue is, the performance of Tez (over MR) will be better due to container reuse. • Size of intermediate output • More the size of intermediate output, the performance of Tez (over MR) will be better due to reduced HDFS usage. • Size of data in the job • For smaller data and more stages, the performance of Tez (over MR) will be better as percentage of launch overhead in the total time is high for smaller jobs. • Offload work to the cluster • Move as much work as possible to the cluster by modelling it via the job DAG. Exploit the parallelism and resources of the cluster. E.g. MR split calculation. • Vertex caching • The more re-computation can be avoided the better is the performance. Page 26
  • 27. Tez – Data at scale Š Hortonworks Inc. 2013 Hive TPC-DS Scale 10TB Page 27
  • 28. Tez – DAG definition at scale Š Hortonworks Inc. 2013 Page 28 Hive : TPC-DS Query 64 Logical DAG
  • 29. Tez – Container Reuse at Scale • 78 vertices + 8374 tasks on 50 containers (TPC-DS Query 4) Š Hortonworks Inc. 2013 Page 29
  • 30. Tez – Bridging the Data Spectrum Š Hortonworks Inc. 2013 Page 30 Fact Table Dimension Table 1 Result Table 1 Dimension Table 2 Result Table 2 Dimension Table 3 Result Table 3 Broadcast Join Shuffle Join Typical pattern in a TPC-DS query Fact Table Dimension Table 1 Dimension Table 1 Dimension Table 1 Broadcast join for small data sets Based on data size, the query optimizer can run either plan as a single Tez job Broadcast Join
  • 31. Tez – Community • Early adopters and code contributors welcome – Adopters to drive more scenarios. Contributors to make them happen. • Tez meetup for developers and users – http://www.meetup.com/Apache-Tez-User-Group Š Hortonworks Inc. 2013 • Technical blog series – http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing • Useful links –Work tracking: https://issues.apache.org/jira/browse/TEZ – Code: https://github.com/apache/tez – Developer list: dev@tez.apache.org User list: user@tez.apache.org Issues list: issues@tez.apache.org Page 31
  • 32. Tez – Takeaways • Distributed execution framework that works on computations represented as dataflow graphs • Customizable execution architecture designed to enable dynamic performance optimizations at runtime •Works out of the box with the platform figuring out the hard stuff • Span the spectrum of interactive latency to batch, small to large • Open source Apache project – your use-cases and code are welcome • It works and is already being used by Hive and Pig Š Hortonworks Inc. 2013 Page 32
  • 33. Š Hortonworks Inc. 2013 Tez Thanks for your time and attention! Video with Deep Dive on Tez http://goo.gl/BL67o7 http://www.infoq.com/presentations/apache-tez Questions? @bikassaha Page 33
  • 35. Next Steps 1. Review YARN Resources 2. Review webinar recording 3. Attend Office Hours 4. Attend the next YARN webinar
  • 36. Resources Setup HDP 2.1 environment • Leverage Sandbox: Hortonworks.com/Sandbox Get Started with YARN • http://hortonworks.com/get-started/YARN Video Deep Dive on Tez • http://goo.gl/BL67o7 and http://www.infoq.com/presentations/apache-tez Tez meetup for developers and users • http://www.meetup.com/Apache-Tez-User-Group Technical blog series http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing Useful links Work tracking: https://issues.apache.org/jira/browse/TEZ Code: https://github.com/apache/tez Developer list: dev@tez.apache.org User list: user@tez.apache.org
  • 37. Hortonworks Office Hours YARN Office Hours Dial in and chat with YARN experts We plan Office Hours for September 11th and October 9th @ 10am PT (2nd Thursdays) Invitations will go out to those that attended or reviewed YARN webinars These will also be posted to hortonworks.com/webinars Registration required.
  • 38. YARN Ready Webinar Schedule Native Integration Slider Office Hours August 14 Tez August 21 Ambari Sept. 4 Office Hours Sept. 11 Scalding Sept. 18 Spark Oct. 2 Office Hours Oct. 9 Upcoming Webinars Office Hours Timeline Recorded Webinars: Introduction to YARN Ready You can also visit http://hortonworks.com/webinars/#librar y Sign up here: For the Series, Individual webinar or office hourshttp://info.hortonworks.com/YarnReady-BigData-Webcast-Series.html

Hinweis der Redaktion

  1. 1.5x to 3x speedup on some of the Pigmix queries.
  2. Register for Office Hours: https://hortonworks.webex.com/hortonworks/onstage/g.php?t=a&d=628190636