SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Running Non-MapReduce
Applications on Apache Hadoop
Hitesh Shah & Siddharth Seth
Hortonworks Inc.

© Hortonworks Inc. 2011

Page 1
Who am I?
• Hitesh Shah
–Member of Technical Staff at Hortonworks Inc.
–Apache Hadoop PMC member and committer
–Apache Tez and Apache Ambari PPMC member and
committer

• Siddharth Seth
–Member of Technical Staff at Hortonworks Inc.
–Apache Hadoop PMC member and committer
–Apache Tez PPMC member and committer

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 2
Agenda
•Apache Hadoop v1 to v2

•YARN
•Applications on YARN
•YARN Best Practices

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 3
Apache Hadoop v1
Submit Job

JobTracker

Job Client

TaskTracker

TaskTracker

TaskTracker

TaskTracker

TaskTracker

TaskTracker

TaskTracker

TaskTracker

Map Slot
Reduce Slot

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 4
Apache Hadoop v1
•Pros:
–A framework to run MapReduce jobs that
allows you to run the same piece of code on
a single node cluster to one spanning 1000s
of machines.

•Cons:
–It is a framework to run MapReduce jobs.

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 5
Apache Giraph
• Iterative graph processing on a Hadoop cluster
• An iterative approach on MapReduce would require running multiple jobs.
• To avoid MR overheads, runs everything as a Map-only job.

Map Task:
Worker

Map Task:
Worker

Map Task:
Worker

Map Task:
Worker

Map Task:
Worker

Map Task:
Worker

Map Task:
Worker

Map Task:
Worker

Map Task:
Master

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 6
Apache Oozie
• Workflow scheduler system to manage Hadoop jobs.
• Running a PIG script through Oozie

Submit Job
JobTracker

Oozie
1

2

3

Submit
Subsequent
MR jobs

MapTask:
Pig Script
Launcher

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 7
Apache Hadoop v2

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 8
YARN
The Operating System of
a Hadoop cluster

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 9
The YARN Stack

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 10
YARN Glossary
• Installer
–Application Installer or Application Client

• Client
–Application Client

• Supervisor
–Application Master

• Workers
–Application Containers

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 11
YARN Architecture
Client

ResourceManager

Submit Application
Client

NodeManager

NodeManager

NodeManager

NodeManager

App
Master

Container

Container

Container

Container

Container

Container

Container

App
Master

Container

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 12
YARN Application Flow
Application Client
Protocol

Application Client

YarnClient
App
Specific API

Resource
Manager
NodeManager
Application Master
Protocol

App
Container

Application Master

AMRMClient

Container
Management
Protocol

NMClient

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 13
YARN Protocols & Client Libraries
• Application Client Protocol: Client to RM interaction
– Library: YarnClient
– Application Lifecycle control
– Access Cluster Information

• Application Master Protocol: AM – RM interaction
– Library: AMRMClient / AMRMClientAsync
– Resource negotiation
– Heartbeat to the RM

• Container Management Protocol: AM to NM interaction
– Library: NMClient/NMClientAsync
– Launching allocated containers
– Stop Running containers
Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 14
Applications
on YARN

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 15
YARN Applications
• Categorizing Applications
– What does the Application do?
– Application Lifetime
– How Applications accept work
– Language

• Application Lifetime
– Job submit to complete.
– Long-running Services

• Job Submissions
– One job : One Application
– Multiple jobs per application

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 16
Language considerations
• Hadoop RPC uses Google Protobuf
–Protobuf bindings: C/C++, GO, Java, Python…

• Accessing HDFS
–WebHDFS
–libhdfs for C
–Python client by Spotify Labs: Snakebite

• YARN Application Logic
–ApplicationMaster in Java and containers in any language

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 17
Tez ( App Submission)
• Distributed Execution framework – computation is expressed as a DAG
• Takes MapReduce to the next level – where each job was limited to a
Map and/or Reduce stage.

Tez Client

YARN

• Job Submission
• Monitoring

Resource
Manager

Tasks

Submit DAG

Tez AM
• DAG execution logic
• Task co-ordination
• Local Task
Scheduling

Launch AM

Node
Manager(s)

AM Launched
Tasks Launched

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 18
HOYA ( Long Running App )
• On Demand HBase cluster setup
• Share cluster resources – persist and shutdown the cluster when not
needed
• Dynamically handles Node failures
• Allows re-sizing of a running HBase cluster

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 19
Samza on YARN ( Failure Handling App )
• Stream processing system – uses YARN as the execution framework
• Makes use of CGroups support in YARN for CPU isolation
• Uses Kafka as underlying store

YARN
Task Container
Task

Resource
Manager

Kafka (Streams)

Task

Task Container

Task Container

Container Finished
Task

Node
Manager(s)

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Task

Task

Task

Samza AM

Page 20
YARN Eco-system
Powered by YARN

YARN Utilities/Frameworks

•
•
•
•
•
•
•
•

• Weave by Continuity
• REEF by Microsoft
• Spring support for Hadoop 2

•
•
•
•

Apache Giraph – Graph Processing
Apache Hama - BSP
Apache Hadoop MapReduce – Batch
Apache Tez – Batch/Interactive
Apache S4 – Stream Processing
Apache Samza – Stream Processing
Apache Storm – Stream Processing
Apache Spark – Iterative/Interactive
applications
Cloudera Llama
DataTorrent
HOYA – HBase on YARN
RedPoint Data Management

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 21
YARN
Best Practices

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 22
Best Practices
• Use provided Client libraries
• Resource Negotiation
– You may ask but you may not get what you want - immediately.
– Locality requests may not always be met.
– Resources like memory/CPU are guaranteed.

• Failure handling
– Remember, anything can fail ( or YARN can pre-empt your containers)
– AM failures handled by YARN but container failures handled by the
application.

• Checkpointing
– Check-point AM state for AM recovery.
– If tasks are long running, check-point task state.

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 23
Best Practices
• Cluster Dependencies
– Try to make zero assumptions on the cluster.
– Your application bundle should deploy everything required using
YARN’s local resources.

• Client-only installs if possible
– Simplifies cluster deployment, and multi-version support

• Securing your Application
– YARN does not secure communications between the AM and its
containers.

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 24
Testing/Debugging your Application
• MiniYARNCluster
– Regression tests

• Unmanaged AM
– Support to run the AM outside of a YARN cluster for manual
testing.

• Logs
– Log aggregation support to push all logs into HDFS
– Accessible via CLI, UI.

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 25
Future work in YARN
• ResourceManager High Availability and Work-preserving restart
– Work-in-Progress

• Scheduler Enhancements
– SLA Driven Scheduling, Gang scheduling
– Multiple resource types – disk/network/GPUs/affinity

• Rolling upgrades
• Long running services
– Better support to running services like HBase
– Discovery of services, upgrades without downtime

• More utilities/libraries for Application Developers
– Failover/Checkpointing

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 26
Questions?

Architecting the Future of Big Data
© Hortonworks Inc. 2011

Page 27

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
DataWorks Summit
 
Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014
Hortonworks
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 

Was ist angesagt? (20)

Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Apache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and FutureApache Hadoop YARN: Present and Future
Apache Hadoop YARN: Present and Future
 
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and FutureHadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
Hadoop Summit Europe Talk 2014: Apache Hadoop YARN: Present and Future
 
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the CloudOperationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
 
Hadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and FutureHadoop Summit Europe 2015 - YARN Present and Future
Hadoop Summit Europe 2015 - YARN Present and Future
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014Developing YARN Applications - Integrating natively to YARN July 24 2014
Developing YARN Applications - Integrating natively to YARN July 24 2014
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data ProcessingApache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN Clusters
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
NextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduceNextGen Apache Hadoop MapReduce
NextGen Apache Hadoop MapReduce
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 

Ähnlich wie Running Non-MapReduce Big Data Applications on Apache Hadoop

Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
DataWorks Summit
 
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
Hortonworks
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
hdhappy001
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the Union
DataWorks Summit
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo
DataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
DataWorks Summit
 

Ähnlich wie Running Non-MapReduce Big Data Applications on Apache Hadoop (20)

Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The UnionDataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
Dataworks Berlin Summit 18' - Apache hadoop YARN State Of The Union
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Writing YARN Applications Hadoop Summit 2012
Writing YARN Applications Hadoop Summit 2012Writing YARN Applications Hadoop Summit 2012
Writing YARN Applications Hadoop Summit 2012
 
Tez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_sahaTez big datacamp-la-bikas_saha
Tez big datacamp-la-bikas_saha
 
YARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User GroupYARN - Presented At Dallas Hadoop User Group
YARN - Presented At Dallas Hadoop User Group
 
YARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache HadoopYARN: Future of Data Processing with Apache Hadoop
YARN: Future of Data Processing with Apache Hadoop
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014Hortonworks Yarn Code Walk Through January 2014
Hortonworks Yarn Code Walk Through January 2014
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
Apache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the UnionApache Hadoop YARN: State of the Union
Apache Hadoop YARN: State of the Union
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo Apache Hadoop YARN: state of the union - Tokyo
Apache Hadoop YARN: state of the union - Tokyo
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
Yarnthug2014
Yarnthug2014Yarnthug2014
Yarnthug2014
 
YARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider WebinarYARN Ready - Integrating to YARN using Slider Webinar
YARN Ready - Integrating to YARN using Slider Webinar
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Running Non-MapReduce Big Data Applications on Apache Hadoop

  • 1. Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. © Hortonworks Inc. 2011 Page 1
  • 2. Who am I? • Hitesh Shah –Member of Technical Staff at Hortonworks Inc. –Apache Hadoop PMC member and committer –Apache Tez and Apache Ambari PPMC member and committer • Siddharth Seth –Member of Technical Staff at Hortonworks Inc. –Apache Hadoop PMC member and committer –Apache Tez PPMC member and committer Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 2
  • 3. Agenda •Apache Hadoop v1 to v2 •YARN •Applications on YARN •YARN Best Practices Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 3
  • 4. Apache Hadoop v1 Submit Job JobTracker Job Client TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker TaskTracker Map Slot Reduce Slot Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 4
  • 5. Apache Hadoop v1 •Pros: –A framework to run MapReduce jobs that allows you to run the same piece of code on a single node cluster to one spanning 1000s of machines. •Cons: –It is a framework to run MapReduce jobs. Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 5
  • 6. Apache Giraph • Iterative graph processing on a Hadoop cluster • An iterative approach on MapReduce would require running multiple jobs. • To avoid MR overheads, runs everything as a Map-only job. Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Worker Map Task: Master Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 6
  • 7. Apache Oozie • Workflow scheduler system to manage Hadoop jobs. • Running a PIG script through Oozie Submit Job JobTracker Oozie 1 2 3 Submit Subsequent MR jobs MapTask: Pig Script Launcher Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 7
  • 8. Apache Hadoop v2 Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 8
  • 9. YARN The Operating System of a Hadoop cluster Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 9
  • 10. The YARN Stack Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 10
  • 11. YARN Glossary • Installer –Application Installer or Application Client • Client –Application Client • Supervisor –Application Master • Workers –Application Containers Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 11
  • 13. YARN Application Flow Application Client Protocol Application Client YarnClient App Specific API Resource Manager NodeManager Application Master Protocol App Container Application Master AMRMClient Container Management Protocol NMClient Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 13
  • 14. YARN Protocols & Client Libraries • Application Client Protocol: Client to RM interaction – Library: YarnClient – Application Lifecycle control – Access Cluster Information • Application Master Protocol: AM – RM interaction – Library: AMRMClient / AMRMClientAsync – Resource negotiation – Heartbeat to the RM • Container Management Protocol: AM to NM interaction – Library: NMClient/NMClientAsync – Launching allocated containers – Stop Running containers Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 14
  • 15. Applications on YARN Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 15
  • 16. YARN Applications • Categorizing Applications – What does the Application do? – Application Lifetime – How Applications accept work – Language • Application Lifetime – Job submit to complete. – Long-running Services • Job Submissions – One job : One Application – Multiple jobs per application Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 16
  • 17. Language considerations • Hadoop RPC uses Google Protobuf –Protobuf bindings: C/C++, GO, Java, Python… • Accessing HDFS –WebHDFS –libhdfs for C –Python client by Spotify Labs: Snakebite • YARN Application Logic –ApplicationMaster in Java and containers in any language Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 17
  • 18. Tez ( App Submission) • Distributed Execution framework – computation is expressed as a DAG • Takes MapReduce to the next level – where each job was limited to a Map and/or Reduce stage. Tez Client YARN • Job Submission • Monitoring Resource Manager Tasks Submit DAG Tez AM • DAG execution logic • Task co-ordination • Local Task Scheduling Launch AM Node Manager(s) AM Launched Tasks Launched Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 18
  • 19. HOYA ( Long Running App ) • On Demand HBase cluster setup • Share cluster resources – persist and shutdown the cluster when not needed • Dynamically handles Node failures • Allows re-sizing of a running HBase cluster Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 19
  • 20. Samza on YARN ( Failure Handling App ) • Stream processing system – uses YARN as the execution framework • Makes use of CGroups support in YARN for CPU isolation • Uses Kafka as underlying store YARN Task Container Task Resource Manager Kafka (Streams) Task Task Container Task Container Container Finished Task Node Manager(s) Architecting the Future of Big Data © Hortonworks Inc. 2011 Task Task Task Samza AM Page 20
  • 21. YARN Eco-system Powered by YARN YARN Utilities/Frameworks • • • • • • • • • Weave by Continuity • REEF by Microsoft • Spring support for Hadoop 2 • • • • Apache Giraph – Graph Processing Apache Hama - BSP Apache Hadoop MapReduce – Batch Apache Tez – Batch/Interactive Apache S4 – Stream Processing Apache Samza – Stream Processing Apache Storm – Stream Processing Apache Spark – Iterative/Interactive applications Cloudera Llama DataTorrent HOYA – HBase on YARN RedPoint Data Management Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 21
  • 22. YARN Best Practices Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 22
  • 23. Best Practices • Use provided Client libraries • Resource Negotiation – You may ask but you may not get what you want - immediately. – Locality requests may not always be met. – Resources like memory/CPU are guaranteed. • Failure handling – Remember, anything can fail ( or YARN can pre-empt your containers) – AM failures handled by YARN but container failures handled by the application. • Checkpointing – Check-point AM state for AM recovery. – If tasks are long running, check-point task state. Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 23
  • 24. Best Practices • Cluster Dependencies – Try to make zero assumptions on the cluster. – Your application bundle should deploy everything required using YARN’s local resources. • Client-only installs if possible – Simplifies cluster deployment, and multi-version support • Securing your Application – YARN does not secure communications between the AM and its containers. Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 24
  • 25. Testing/Debugging your Application • MiniYARNCluster – Regression tests • Unmanaged AM – Support to run the AM outside of a YARN cluster for manual testing. • Logs – Log aggregation support to push all logs into HDFS – Accessible via CLI, UI. Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 25
  • 26. Future work in YARN • ResourceManager High Availability and Work-preserving restart – Work-in-Progress • Scheduler Enhancements – SLA Driven Scheduling, Gang scheduling – Multiple resource types – disk/network/GPUs/affinity • Rolling upgrades • Long running services – Better support to running services like HBase – Discovery of services, upgrades without downtime • More utilities/libraries for Application Developers – Failover/Checkpointing Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 26
  • 27. Questions? Architecting the Future of Big Data © Hortonworks Inc. 2011 Page 27

Hinweis der Redaktion

  1. Jobtracker and Tasktrackers with 4 map slots and 2 reduce slots each As a job is submitted, the jobtracker runs work in assigned slots.As more jobs run, more and more slots in the cluster get used up.
  2. Talk about a MR job structure – input, map, sort, disk, shuffle, merge, reduce, output. Inspite of all these restrictions, the hadoop ecosystem has grown coming up with innovative ways to do things.
  3. YARN takes what was a single-use data platform for batch processing and evolving it into a multi-use platform that enables batch, interactive, real-time and stream processing.
  4. YARN acts as the primary resource management layer and mediator of access to the cluster, giving anyonethe capability to store data in a single place and then interact with it in multiple ways, simultaneously, with consistent levels of service.
  5. Now, compared to v1, a YARN base cluster has the RM as the central controlling entity and NodeManagers on all the worker nodes. There is no notion of slots ( maps/reduces ). There is only controlled resources. Memory/CPU, disks/network IO in the future. Explain application flow – client submits application, RM launches Application Master. App Master requests containers from the RM and then launches containers. Explain how subsequent applications work – containers launched as long as they fit within the resource pool.
  6. YarnClientApplication Lifecycle Control – submit, query state, kill - .e.g Submit, followed by poll state, and optionally connect to AM for further stateCluster Information – Query node status, Scheduler queues etcAMRMCLientRequest Resources from the RM – specifies a priority (within app only), location information, and whether strict locality is required, also number of such containers.Hides the details of the protocol – which is absolute (i.e. each call to the RM must provide complete information about a priority level)Once a container is allocated, provides a utility method to link it back to the entity which generated the request (entity optionally passed in while asking for containers)Allocate call doubles as a heartbeat – must be sent out ever so often.Also provides some node level information, available resources, etcNMClientOnce resources are obtained from the RM, an App uses this to communicate with the NM to start containers. Setup the ContainerLuanchContext – cmdLine, environment, resources.Authorization information is part of the contianers allocated by the RM – to prevent unauthorized usageOnly a single process can be launched on an allocated container, and one must be within a certain period otherwise the allocation times outContainer must launch a process with a certain time after being allocatedOnly a single process can be launched on an allocated container
  7. Covered YARN Architecture, What a YARN application looks like, and Haddop 1vsHadoop 2
  8. Application functionality – stream processing, batch, interactive, etcRest – Application’s interaction with YARNShort running – obtain, process, releaseLong Running – obtain, keep, scale up and down if required.How would we categorize applications? YARN is still nascent. We will see more and more applications being built on it. Today, based on the current ecosystem, there are various ways of categorizing applications. Data-processing execution applications such as MapReduce and TezStreaming applications like Samza, StormServices like Hbase on YARNOn demand web server scalingFrom a framework perspective or from the perspective of building applications, they can be categorized according to How long the application runs, How the application accepts work, Language it’s written in.Application Lifetime‘Short’ Running Jobs – expected to finish their processing and complete. Typically one YARN application per job. - Ask for adequate resources, process and then release the resources once the job completes.Long Running Services – typically deployed via YARN to share cluster resources. Will generally reserve certain resources on start-up. During the lifetime of the job, new resources can be acquired – and it’s up to the application on whether it wants to hang on to these or return them to the cluster. May define their own job/work submission construct – e.g. Storm starts up the Storm master which then accepts work – each of these is not a separate YARN application
  9. Hadoop associated with JavaSimpler to write non Java apps – but not trivialImplement RPC mechanics in language of choice, and then the HDFS/YARN protocolsOR … listed optionsEveryone associates Hadoop with Java. Hadoop RPC  while that’s primarily the case, writing non Java applications should be easier in Hadoop 2Either build an RPC layer in the language being used – which can then be used to access the HDFS and YARN protocols (disclaimer: may not be a trivial exercise)HDFS –WebHDFS which is a REST based API, libHdfs which is JNI based C API for accessing HDFS. Folks at Spotify labs have built a Phthon library using Hadoop RPC mechanics – also have detailed out the RPC interactionsYARN – Experimental early stage Go RPC by Arun Murthy – note: this is experimental and very early stage. - Alternately, write the Applcation Logic in Java – but allow the rest of the application to be in any languageIf a language binding exists, barring some caveats, you should be able to write yarn applications. Recently, Arun Murthy came up with a hello world application built in GO. Please do take this with a bucket of salt as there are still some things such tokens and security which need to flush out properly but effectively, soon, we should have applications coming up that in written in other languages apart from Java.
  10. Computation expressed as a DAGMR limited to two stages Tez allows multiple stages with different types of dataflow between stagesLaunch AM, StartContaienr – specify command line, environment, resources etc.
  11. Deploy Hbase on demand - Old: Separate node or carve out part of each node.e.g. If Hbase not always required – hourly / daily jobs – spin up on demand, or running custom co-processors, test clustersStoresHbase cluster stateCLI tool for all functionality.Locality, dynamic resizing, failure handlingCLI Tool + configuration to deployHBase clusters on YARNRather than carving out separate nodes, or resources on each node up-front – allows the cluster resources to be centrally managed by YARN.Hbase Master starts up directly within the HOYA ApplicationMaster.Region Servers are run within containers allocated by YARN – based on previously stored state or Locality.HOYA CLI tool used to manage the cluster – restart, stop, resize etc.Example usage: Bring up the cluster only when it’s required, and then release resources back once it’s complete – a daily analysis job, isolation between users ?Resources for the cluster are obtained from YARN – allowing it to share resources with the rest of the cluster. e.g. Daily analysis job, bring up a previously persisted HBase cluster, execute and release resources.Provides APIs to grow / shrink a clusterHOYA AM notified by YARN on container or node failure – at which point it can bring up a replacement RegionServer on another node
  12. Stream processing –YARN resource scheduling, rather than starting daemons – which also know how to handle resources.CGroups, Samza – A stream processing system built to make use of YARN as the execution framework - no separate daemon for scheduling resources.Samza AM spun up per job / dataflow.Takes care of making node local requests for Samza ContainersRe-scheduling failed tasks – new container from YARN
  13. Graph processing – Giraph, HamaStream proessing – Smaza, Storm, Spark, DataTorrentMapReduceTez – fast query executionWeave/REEF – frameworks to help with writing applicationsList of some of the applications which already support YARN, in some form.Smaza, Storm, S4 and DataTorrent are streaming frameworksVarious types of graph processing frameworks – Giraph and Hama are graph processing systemsThere’s some github projects – caching systems, on-demand web-server spin up Wave and REEF are frameworks on top of YARN to make writing applications easier
  14. Recommendations / Points to keep in mind
  15. Client libs – mostly thin, AMRM hides detailsResource Negotiation – contending withh other appsFailure handling – large cluster, failed node, disk, network etcCheckpointing – depends on apps – long running should maintain AM state, maybe stask stateUse Provided Client libraries In most part, these are thin wrappers over the YARN protocols – but in some cases do have some additional functionalityResource Negotiation Shared cluster, app may be contending with others. Write your application to work with incremental resource allocations; Be aware of this if the application MUST have certain resources Locality requests are best effort, unless strict locality is enabled – delays before rack or non, local but no guaranteesFailure Handling Design applications to manage node failure, container failure etc. YARN will restart AMs, and will notify the AM about Container failure – it’s up to the application to deal with this, which brings us to CheckpointingCheckpointing In case of failure or restart, the AM will be restarted by YARN – recovered AMs can be written to restart from scratch, or can store state to recover from where they left off. Similarly – the AM may need to store task state. As an example – MapReduce stores task state in HDFS, on AM restart – it’s able to recover successful work from the previous run and continue the job from there.
  16. Trying to avoid complicated cluster deploysDeploy everything with the AppClient libraries – debuggingSecurity – Secure cluster – YARN interaction uses kerberos / tokens. Client to AM / Container to AM up to the AppCluster Dependencies Application can run on any node in the cluster. Be aware of what’s available on the cluster, or ship all resources and binaries required by an application along with it. Typically, a shell, Java can be assumed.Client Only Install Recommended to deploy applications via a client package – YARN will take care of making this available on the nodes where the application runs. Allows for different versions of the same application to co-exist on the cluster.Also simplifies the initial cluster deployment. MapReduce is currently deployed on all nodes – but is moving to a model where it’ll only be deployed on the client. Tez, HOYA etc are already client side only applications.Quick note on Security YARN, in secure mode, use kerberos and tokens to secure RM and NM communication. Communication with the AM and between containers needs to be secure by the app-writer. Hadoop RPC security can ofcourse be used.
  17. MiniYARN – cluster within process in threads (HDFS, MR, Hbase) – avoid for unit tests – SLOWUnmanagedAM – debugger, heap dumps, etcLogs – enable by default – All application logsMiniYarnCluster Those familiar with MapReduce, HDFS, HBase – this is a YARN cluster which spins up within the same JVM. Can configure number of NodeManagers etc. Primarily meant for regression testing – should be avoided for unit tests considering the cost.Unmanaged AM Allows the AM to be run outside the cluster. Typically, when running on a cluster – the AM could run on any node which can make it tough to debug. Helps with attaching debuggers, accessing AM logs etc.Log Aggregation – aggregates logs onto HDFS Recommend running with log aggregation enabled – enormously useful for debugging. Can fetch ALL logs for an application after it completes, or logs for specific containers with a simple CLI command.
  18. HA and work preserving – being actively worked upoin by the communitiy.Scheduler – Additional resources – specifically disk / network. Gang schedulingRolling upgrades – upgrading a cluster typically involves downtime. NM forgets containers across restartsLong Running – Enhandcement to log handling, security, multiple tasks per container, container resizingHA and work preserving restart are still being worked on in the community – YARN-128 and YARN-149.On scheduling – there’ve been requests for gang scheduling, meeting SLAs. Also TBD is support for scheduling additional resource types – disk/ network.Rolling Upgrades – some work pending. Big piece here, which ties in with work preserving restart – restarting a NodeManager should not cause processes started by the previous NM to be killedLong Running Services support – handling logs, security – specifically token expiryAdditional utility libraries to help AppWriters – primarily geared towards checkpointing in the AM, app history handling