SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Tez
Bikas Saha @bikassaha
Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Hadoop YARN and HDFS
Flexible
Enables other purpose-built data processing
models beyond MapReduce (batch), such as
interactive and streaming
Efficient
Double processing IN Hadoop on the same
hardware while providing predictable
performance & quality of service
Shared
Provides a stable, reliable, secure
foundation and shared operational
services across multiple workloads
The Data Operating System for Hadoop 2.x
Data Processing Engines Run Natively IN Hadoop
BATCH
MapReduce
LOG STORE
Kafka
STREAMING
Storm
IN-MEMORY
Spark
GRAPH
Giraph
SAS
LASR, HPA
ONLINE
HBase, Accumulo
OTHERS
HDFS: Redundant, Reliable Storage
YARN: Cluster Resource Management
Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez
•API’s and libraries to create data processing applications on YARN
•Customizable and adaptable DAG definition
•Orchestration framework to execute the DAG in a Hadoop cluster
•NOT a general purpose execution engine
Open Source
Apache Project
Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Goals
• Tez solves the hard problems of running on a distributed Hadoop environment
• Apps can focus on solving their domain specific problems
• Tez instantiates the physical execution structure. App fills in logic and behavior
• API targets data processing specified as a data flow graph
App
Tez
• Custom application logic
• Custom data format
• Custom data transfer technology
• Distributed parallel execution
• Negotiating resources from the Hadoop framework
• Fault tolerance and recovery
• Shared library of ready-to-use components
• Built-in performance optimizations
• Hadoop Security
Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Adoption
• Apache Hive
– Most popular SQL-like interface for data in Hadoop
• Apache Pig
– Scripting language used in some of the largest Hadoop installations
• Apache Flink (Stratosphere project from TU Berlin)
– General purpose engine with language integrated data processing API
• Cascading + Scalding
– Language integrated data processing API in Java/Scala
• Commercial Products
– Datameer, Syncsort and other in progress
Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Performance benefits
• Apache Hive
– Order of magnitude improvement in performance
– Speed up mainly from flexible DAG definition and runtime graph reconfiguration
– Performance oriented orchestration layer and shared library components
Hive : TPC-DS Query 64
Logical DAG
Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez – Scale and Reliability
• Apache Pig
– Predominant number of data processing jobs at Yahoo with up to 5000 node clusters
– Multi-Petabyte jobs
– On track for using Pig with Tez for all production Pig jobs
– Already use Hive with Tez for large scale analytics
• Hortonworks customers
– All new customers default on Hive with Tez
• Cascading + Scalding
– Cascading 3.0 released with Tez integration
– Very promising results with beta users
http://scalding.io/2015/05/scalding-cascading-tez-♥/
© Hortonworks Inc. 2013
Tez – DAG API
// Define DAG
DAG dag = DAG.create();
// Define Vertex
Vertex Scan1 = Vertex.create(Processor.class);
// Define Edge
Edge edge = Edge.create(Scan1, Partition1,
SCATTER_GATHER, PERSISTED, SEQUENTIAL,
Output.class, Input.class);
// Connect them
dag.addVertex(Scan1).addEdge(edge)….
Page 8
Defines the global logical processing flow
Scan1 Scan2
Partition1 Partition2
Join
Scatter
Gather
Scatter
Gather
© Hortonworks Inc. 2013
Tez – Logical DAG expansion at Runtime
Page 9
Partition1
Scan2
Partition2
Join
Scan1
© Hortonworks Inc. 2013
Tez – Task Composition
Page 10
V-A
V-B V-C
Logical DAG
Output-1 Output-3
Processor-A
Input-2
Processor-B
Input-4
Processor-C
Task A
Task B Task C
Edge AB Edge AC
V-A = { Processor-A.class }
V-B = { Processor-B.class }
V-C = { Processor-C.class }
Edge AB = { V-A, V-B,
Output-1.class, Input-2.class }
Edge AC = { V-A, V-C,
Output-3.class, Input-4.class }
© Hortonworks Inc. 2013
Tez – Composable Task Model
Page 11
Hive Processor
HDFS
Input
Remote
File
Server
Input
HDFS
Output
Local
Disk
Output
Custom Processor
HDFS
Input
Remote
File
Server
Input
HDFS
Output
Local
Disk
Output
Custom Processor
RDMA
Input
Native
DB
Input
Kakfa
Pub-Sub
Output
Amazon
S3
Output
Adopt Evolve Optimize
© Hortonworks Inc. 2013
Tez – Customizable Core Engine
Page 12
Vertex-2
Vertex-1
Start
vertex
Vertex Manager
Start
tasks
DAG
Scheduler
Get Priority
Get Priority
Start
vertex
Task
Scheduler
Get container
Get container
• Vertex Manager
• Determines task
parallelism
• Determines when
tasks in a vertex can
start.
• DAG Scheduler
Determines priority of
task
• Task Scheduler
Allocates containers
from YARN and assigns
them to tasks
© Hortonworks Inc. 2013
Tez – Customizable core engine: graph reconfiguration
Page 14
Vertex 1 tasks
Vertex 2 Input Data
App Master
Input Initializer
+
Vertex Manager
Filtering values
Vertex State
Machine
Reconfigure Vertex
Apply Filter to Prune Input Partitions
Event Model
Map tasks send data
statistics events to the
Reduce Vertex Manager.
Vertex Manager
Pluggable application logic
that understands the data
statistics and can formulate
the correct parallelism.
Advises vertex controller on
parallelism
Hive – Dynamic Partition Pruning
© Hortonworks Inc. 2013
Tez – Engineering optimizations
•Container re-use
•Support for user sessions
•Event-based control flow
Page 15
© Hortonworks Inc. 2013
Tez – Developer tools – Local Mode
• Fast prototyping – no hadoop setup required
• Quick turnaround in Unit testing – no overheads for allocating resources , launching
JVM’s.
• Easy debuggability – Single JVM
• Scheduling / RPC invocations skipped
Page 16
© Hortonworks Inc. 2013
Tez – Developer Tools - Tez UI
• View Status and
progress of DAG/Vertex
• Diagnostics on failure
• View counters for
DAG/Vertex
• View and compare
counters across
tasks/attempts
• View app specific
information
Page 17
© Hortonworks Inc. 2013
Tez – Developer Tools - Tez UI
Page 18
© Hortonworks Inc. 2013
Tez – Job Analysis tools - Swimlanes
• “$TEZ_HOME/tez-tools/swimlanes/yarn-swimlanes.sh <app_id>”
Page 19
© Hortonworks Inc. 2013
Tez – Job Analysis tools – Shuffle performance
• View shuffle performance between nodes
Page 20
© Hortonworks Inc. 2013
Tez – Job Analysis tools – Shuffle performance
• View shuffle performance between nodes
Page 21
© Hortonworks Inc. 2013
Tez – Hybrid Execution
Page 22
• Run “compute where its most
efficient”
• Building on the pluggable design of
Tez, different vertices in the DAG can
run in different execution
environments
• Hive LLAP daemons can run initial
scans, map joins etc. while large joins
can run in YARN containers
• Best of both worlds and the pattern
can be repeated for Apache Phoenix or
your MPP database
MPP
Daemon
MPP
Daemon
MPP
Daemon
MPP
Daemon
MPP
Daemon
MPP
Daemon
Vertex 1
Vertex 2
Vertex 3
YARNYARN YARN
Join
Scan/Filter
© Hortonworks Inc. 2013
Tez – How can you help?
•Improve core Tez infrastructure
– Apache open source project. Your use cases and code are welcome
•Port DB ideas to Hive+Tez world
– Evolve distributed query optimization and execution
•Use Tez hybrid execution
– Use the Hive-LLAP pattern to get the best of both worlds with your
execution environment
•Integrate your project with Tez
– Get benefits similar to Hive, Pig, Cascading, Flink. Takes between 1-6
months depending on the complexity of the target project
© Hortonworks Inc. 2013
Tez – How to contribute
•Useful links
– Work tracking: https://issues.apache.org/jira/browse/TEZ
– Code: https://github.com/apache/tez
– Developer list: dev@tez.apache.org
User list: user@tez.apache.org
Issues list: issues@tez.apache.org
© Hortonworks Inc. 2013
Tez
Thanks for your time and attention!
Video with Deep Dive on Tez
http://goo.gl/BL67o7
http://www.infoq.com/presentations/apache-tez
Questions?
@bikassaha
Page 25

Weitere ähnliche Inhalte

Was ist angesagt?

Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 
Hadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowHadoop Security Today and Tomorrow
Hadoop Security Today and Tomorrow
DataWorks Summit
 

Was ist angesagt? (20)

Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with Python
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Hadoop Security Today and Tomorrow
Hadoop Security Today and TomorrowHadoop Security Today and Tomorrow
Hadoop Security Today and Tomorrow
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Apache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXApache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOX
 

Andere mochten auch

Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2
DataWorks Summit
 

Andere mochten auch (20)

Quick Introduction to Apache Tez
Quick Introduction to Apache TezQuick Introduction to Apache Tez
Quick Introduction to Apache Tez
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2June 10 145pm hortonworks_tan & welch_v2
June 10 145pm hortonworks_tan & welch_v2
 
Internet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop SummitInternet of Things Crash Course Workshop at Hadoop Summit
Internet of Things Crash Course Workshop at Hadoop Summit
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run ApproachEvolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
 
Pig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big DataPig on Tez: Low Latency Data Processing with Big Data
Pig on Tez: Low Latency Data Processing with Big Data
 
Hadoop 生態系十年回顧與未來展望
Hadoop 生態系十年回顧與未來展望Hadoop 生態系十年回顧與未來展望
Hadoop 生態系十年回顧與未來展望
 
Hive Now Sparks
Hive Now SparksHive Now Sparks
Hive Now Sparks
 
February 2014 HUG : Hive On Tez
February 2014 HUG : Hive On TezFebruary 2014 HUG : Hive On Tez
February 2014 HUG : Hive On Tez
 
Tuning up with Apache Tez
Tuning up with Apache TezTuning up with Apache Tez
Tuning up with Apache Tez
 
Oozie sweet
Oozie sweetOozie sweet
Oozie sweet
 
Authoring and Hosting Applications on YARN using Slider
Authoring and Hosting Applications on YARN using SliderAuthoring and Hosting Applications on YARN using Slider
Authoring and Hosting Applications on YARN using Slider
 
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
 
LinkedIn
LinkedInLinkedIn
LinkedIn
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 

Ähnlich wie Apache Tez - A unifying Framework for Hadoop Data Processing

Ähnlich wie Apache Tez - A unifying Framework for Hadoop Data Processing (20)

Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Hortonworks tech workshop in-memory processing with spark
Hortonworks tech workshop   in-memory processing with sparkHortonworks tech workshop   in-memory processing with spark
Hortonworks tech workshop in-memory processing with spark
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN Applications
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 

Mehr von DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Kürzlich hochgeladen (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Apache Tez - A unifying Framework for Hadoop Data Processing

  • 1. Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Tez Bikas Saha @bikassaha
  • 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Apache Hadoop YARN and HDFS Flexible Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming Efficient Double processing IN Hadoop on the same hardware while providing predictable performance & quality of service Shared Provides a stable, reliable, secure foundation and shared operational services across multiple workloads The Data Operating System for Hadoop 2.x Data Processing Engines Run Natively IN Hadoop BATCH MapReduce LOG STORE Kafka STREAMING Storm IN-MEMORY Spark GRAPH Giraph SAS LASR, HPA ONLINE HBase, Accumulo OTHERS HDFS: Redundant, Reliable Storage YARN: Cluster Resource Management
  • 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Tez •API’s and libraries to create data processing applications on YARN •Customizable and adaptable DAG definition •Orchestration framework to execute the DAG in a Hadoop cluster •NOT a general purpose execution engine Open Source Apache Project
  • 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Tez – Goals • Tez solves the hard problems of running on a distributed Hadoop environment • Apps can focus on solving their domain specific problems • Tez instantiates the physical execution structure. App fills in logic and behavior • API targets data processing specified as a data flow graph App Tez • Custom application logic • Custom data format • Custom data transfer technology • Distributed parallel execution • Negotiating resources from the Hadoop framework • Fault tolerance and recovery • Shared library of ready-to-use components • Built-in performance optimizations • Hadoop Security
  • 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Tez – Adoption • Apache Hive – Most popular SQL-like interface for data in Hadoop • Apache Pig – Scripting language used in some of the largest Hadoop installations • Apache Flink (Stratosphere project from TU Berlin) – General purpose engine with language integrated data processing API • Cascading + Scalding – Language integrated data processing API in Java/Scala • Commercial Products – Datameer, Syncsort and other in progress
  • 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Tez – Performance benefits • Apache Hive – Order of magnitude improvement in performance – Speed up mainly from flexible DAG definition and runtime graph reconfiguration – Performance oriented orchestration layer and shared library components Hive : TPC-DS Query 64 Logical DAG
  • 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Tez – Scale and Reliability • Apache Pig – Predominant number of data processing jobs at Yahoo with up to 5000 node clusters – Multi-Petabyte jobs – On track for using Pig with Tez for all production Pig jobs – Already use Hive with Tez for large scale analytics • Hortonworks customers – All new customers default on Hive with Tez • Cascading + Scalding – Cascading 3.0 released with Tez integration – Very promising results with beta users http://scalding.io/2015/05/scalding-cascading-tez-♥/
  • 8. © Hortonworks Inc. 2013 Tez – DAG API // Define DAG DAG dag = DAG.create(); // Define Vertex Vertex Scan1 = Vertex.create(Processor.class); // Define Edge Edge edge = Edge.create(Scan1, Partition1, SCATTER_GATHER, PERSISTED, SEQUENTIAL, Output.class, Input.class); // Connect them dag.addVertex(Scan1).addEdge(edge)…. Page 8 Defines the global logical processing flow Scan1 Scan2 Partition1 Partition2 Join Scatter Gather Scatter Gather
  • 9. © Hortonworks Inc. 2013 Tez – Logical DAG expansion at Runtime Page 9 Partition1 Scan2 Partition2 Join Scan1
  • 10. © Hortonworks Inc. 2013 Tez – Task Composition Page 10 V-A V-B V-C Logical DAG Output-1 Output-3 Processor-A Input-2 Processor-B Input-4 Processor-C Task A Task B Task C Edge AB Edge AC V-A = { Processor-A.class } V-B = { Processor-B.class } V-C = { Processor-C.class } Edge AB = { V-A, V-B, Output-1.class, Input-2.class } Edge AC = { V-A, V-C, Output-3.class, Input-4.class }
  • 11. © Hortonworks Inc. 2013 Tez – Composable Task Model Page 11 Hive Processor HDFS Input Remote File Server Input HDFS Output Local Disk Output Custom Processor HDFS Input Remote File Server Input HDFS Output Local Disk Output Custom Processor RDMA Input Native DB Input Kakfa Pub-Sub Output Amazon S3 Output Adopt Evolve Optimize
  • 12. © Hortonworks Inc. 2013 Tez – Customizable Core Engine Page 12 Vertex-2 Vertex-1 Start vertex Vertex Manager Start tasks DAG Scheduler Get Priority Get Priority Start vertex Task Scheduler Get container Get container • Vertex Manager • Determines task parallelism • Determines when tasks in a vertex can start. • DAG Scheduler Determines priority of task • Task Scheduler Allocates containers from YARN and assigns them to tasks
  • 13. © Hortonworks Inc. 2013 Tez – Customizable core engine: graph reconfiguration Page 14 Vertex 1 tasks Vertex 2 Input Data App Master Input Initializer + Vertex Manager Filtering values Vertex State Machine Reconfigure Vertex Apply Filter to Prune Input Partitions Event Model Map tasks send data statistics events to the Reduce Vertex Manager. Vertex Manager Pluggable application logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism Hive – Dynamic Partition Pruning
  • 14. © Hortonworks Inc. 2013 Tez – Engineering optimizations •Container re-use •Support for user sessions •Event-based control flow Page 15
  • 15. © Hortonworks Inc. 2013 Tez – Developer tools – Local Mode • Fast prototyping – no hadoop setup required • Quick turnaround in Unit testing – no overheads for allocating resources , launching JVM’s. • Easy debuggability – Single JVM • Scheduling / RPC invocations skipped Page 16
  • 16. © Hortonworks Inc. 2013 Tez – Developer Tools - Tez UI • View Status and progress of DAG/Vertex • Diagnostics on failure • View counters for DAG/Vertex • View and compare counters across tasks/attempts • View app specific information Page 17
  • 17. © Hortonworks Inc. 2013 Tez – Developer Tools - Tez UI Page 18
  • 18. © Hortonworks Inc. 2013 Tez – Job Analysis tools - Swimlanes • “$TEZ_HOME/tez-tools/swimlanes/yarn-swimlanes.sh <app_id>” Page 19
  • 19. © Hortonworks Inc. 2013 Tez – Job Analysis tools – Shuffle performance • View shuffle performance between nodes Page 20
  • 20. © Hortonworks Inc. 2013 Tez – Job Analysis tools – Shuffle performance • View shuffle performance between nodes Page 21
  • 21. © Hortonworks Inc. 2013 Tez – Hybrid Execution Page 22 • Run “compute where its most efficient” • Building on the pluggable design of Tez, different vertices in the DAG can run in different execution environments • Hive LLAP daemons can run initial scans, map joins etc. while large joins can run in YARN containers • Best of both worlds and the pattern can be repeated for Apache Phoenix or your MPP database MPP Daemon MPP Daemon MPP Daemon MPP Daemon MPP Daemon MPP Daemon Vertex 1 Vertex 2 Vertex 3 YARNYARN YARN Join Scan/Filter
  • 22. © Hortonworks Inc. 2013 Tez – How can you help? •Improve core Tez infrastructure – Apache open source project. Your use cases and code are welcome •Port DB ideas to Hive+Tez world – Evolve distributed query optimization and execution •Use Tez hybrid execution – Use the Hive-LLAP pattern to get the best of both worlds with your execution environment •Integrate your project with Tez – Get benefits similar to Hive, Pig, Cascading, Flink. Takes between 1-6 months depending on the complexity of the target project
  • 23. © Hortonworks Inc. 2013 Tez – How to contribute •Useful links – Work tracking: https://issues.apache.org/jira/browse/TEZ – Code: https://github.com/apache/tez – Developer list: dev@tez.apache.org User list: user@tez.apache.org Issues list: issues@tez.apache.org
  • 24. © Hortonworks Inc. 2013 Tez Thanks for your time and attention! Video with Deep Dive on Tez http://goo.gl/BL67o7 http://www.infoq.com/presentations/apache-tez Questions? @bikassaha Page 25

Hinweis der Redaktion

  1. TODO: Rohit compile list of current apps out there and 1-2 sentences on what they do for the notes here The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management capabilities built into it. The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes Hadoop’s Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the distributed operating system for data processing applications. For those curious, YARN stands for “Yet Another Resource Negotiator”. [CLICK] As I like to say, YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop. [CLICK] Why is that important? Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. With predictable performance and quality of service. [CLICK] This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of open source projects that are or will be leveraging YARN in the not so distant future. For example, engineers at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley that’s been recently contributed to the Apache Software Foundation. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.
  2. For anyone who has been working on MapReduce, there is this age-old problem around “how do I figure out the correct number of reducers?”. We guess some number at compile-time and usually that turns out to be incorrect at run-time. Let’s see how we can use the Tez model to fix that. So here is this Map Vertex and this Reduce Vertex, which have these tasks running and you have the Vertex Manager running inside the framework … [CLICK] The Map Tasks can send Data Size Statistics to the Vertex Manager, which can then extrapolate those statistics to figure out “what would be the final size of the data when all of these Maps finish?”. Based on that, it can realize that the data size is actually smaller than expected, and I can actually run two reduce tasks instead of three. [CLICK] The Vertex Manager sends a Set Paralellism command to the framework which changes the routing information in-between these two tasks and also cancels the last task.
  3. For anyone who has been working on MapReduce, there is this age-old problem around “how do I figure out the correct number of reducers?”. We guess some number at compile-time and usually that turns out to be incorrect at run-time. Let’s see how we can use the Tez model to fix that. So here is this Map Vertex and this Reduce Vertex, which have these tasks running and you have the Vertex Manager running inside the framework … [CLICK] The Map Tasks can send Data Size Statistics to the Vertex Manager, which can then extrapolate those statistics to figure out “what would be the final size of the data when all of these Maps finish?”. Based on that, it can realize that the data size is actually smaller than expected, and I can actually run two reduce tasks instead of three. [CLICK] The Vertex Manager sends a Set Paralellism command to the framework which changes the routing information in-between these two tasks and also cancels the last task.