Hortonworks Get Started Building YARN Applications Dec. 2013. We cover YARN basics, benefits, getting started and roadmap. Actian shares their experience and recommendations on building their real-world YARN application.
14. Jeff Gullick – Principal Solutions Engineer
Shane Pratt - Sr. Director, Hadoop and Analytics COE
Jim Falgout – Chief Technologist
Actian and YARN
12/18/13
18. Developing with YARN
Getting started
• Investigation
Installed HDP 2.0 on development cluster
Read Hortonworks blogs on YARN (very informative!)
http://hortonworks.com/blog/introducing-apache-hadoop-yarn/
Looked at sample YARN application code
Browsed MapReduce source code
• Prototyping
Started with getting an Application Master spawned
Relatively easy way to get started with the YARN API’s
Also helped to learn about containers and shared resources
• Project implemented by two senior developers
Page
19. Developing with YARN
Design
• Using AMRMAsnycClient
Handles communication with resource manager
Provides callbacks for asynchronous container events
(allocations, completions, …)
• Using NMClientAsync
Handles communications with multiple node managers
Callbacks for asynchronous container events
• Configuration
Reusing existing Actian web application for configuration
• Application Specific History Service
Reusing existing Actian web application for job monitoring
Page
20. Developing with YARN
Design
• Application Master
Started per Actian Dataflow job (batch mode)
Determines resources needed; acquires from ResourceManager
Elastically allocates resources according to job needs
Launches worker containers via NodeManager(s)
Monitors progress and cleans up as job completes
• Application Containers
Execute distributed Dataflow graphs within launched container(s)
Provide runtime status and statistics to history server
Statistics include items like: records processed, I/O stats, …
Page
21. Developing with YARN
Client
launch
AppMaster
YARN
Web app
Resource
Manager
launches
Links to
Allocate
resources
Application
Master
get stats
launch
Worker
Containers
Node
Node
Manager
Node
Manager
Manager
Config/
History
Server
get stats
launches
Application
Application
Container
Application
Container
Container
Page
22. Developing with YARN
Phases of Development
• Job launching
Integrated Actian Dataflow client with YARN to launch application master
Built application master: allocate resources; launch workers
Built worker containers
Result: able to launch Dataflow jobs via YARN
1 senior developer; approximately 5 weeks (including investigation)
• Configuration and Monitoring
Modified existing web application to handle Dataflow configuration items specific to
YARN
Collect and display runtime stats from executing jobs
Provide history service
Log viewing
1 senior developer; approximately 3 weeks
Page
23. Developing with YARN
Lessons Learned
• Distributed cache allows frictionless install of Actian software on cluster
worker nodes
• The sample YARN application is too simple
• (Hortonworks now has a MemcacheD on YARN sample app)
• MapReduce code provides better coverage but is complex
• An application history server is required
We hoped to not have install/run any Actian servers on cluster
A JIRA issue exists to provide a history service as part of YARN
• Configuration can be supplied via Hadoop config files
This is messy (how to keep coherent across the cluster …)
Applications should integrate with Hadoop management layers (i.e. Ambari)
Page
30. 1-2-3 Getting Started with YARN
http://hortonworks.com/get-started/YARN
Get started with Hortonworks Sandbox
http://hortonworks.com/sandbox/
Code walk through – Jan. 22nd 2014 at 9am PT
Register at Hortonworks.com/webinars/yarn-code
Get involved! YARN is part of a community driven
open source project and you can help accelerate
the innovation!
Follow Us:
@hortonworks @actiancorp
Editor's Notes
The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management capabilities built into it.The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes Hadoop’s Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the distributed operating system for data processing applications. For those curious, YARN stands for “Yet Another Resource Negotiator”.[CLICK] As I like to say, YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop. [CLICK] Why is that important? Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. With predictable performance and quality of service. [CLICK] This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of open source projects that are or will be leveraging YARN in the not so distant future.For example, engineers at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley that’s been recently contributed to the Apache Software Foundation. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.
As Arun mentioned there are less JVMs to spin up per job management (1 instead of 3) as well as the RM and NM provisioning being fasterOriginally conceived & architected by the team at Yahoo!Arun Murthy created the original JIRA in 2008 and led the PMCThe team at Hortonworks has been working on YARN for 4 years: 90% of code from Hortonworks & Yahoo!YARN based architecture running at scale at Yahoo!Deployed on 35,000 nodes for 6+ monthsMultitude of YARN applications*********************On great public example of in production use of YARN, is at Yahoo!. They outlined some performance gains in a keynote address at Hadoop Summit this year. Yahoo uses YARN for three use cases, stream processing, iterative processing and shared storage. With Storm on YARN they stream data into a cluster and execute 5 second analytics windows. This cluster is only 320 nodes, but is processing 133,000 events per second and is executing 12000 threads. Their shared data cluster uses 1900 nodes to store 2PB of data.In all, Yahoo has over 30000 nodes running YARN across over 365PB of data. They calculate running about 400,000 jobs per day for about 10 million hours of compute time. They also have estimated a 60% – 150% improvement on node usage per day.ANDAt this point, over 50,000 Hadoop nodes have been upgraded at Yahoo from Hadoop 1.0 to Hadoop 2, yielding 50% improvement in cluster utilization & efficiency.This should be a big deal in terms of potential ROI.
HA and work preserving – being actively worked upoin by the communitiy.Scheduler – Additional resources – specifically disk / network. Gang schedulingRolling upgrades – upgrading a cluster typically involves downtime. NM forgets containers across restartsLong Running – Enhandcement to log handling, security, multiple tasks per container, container resizingHA and work preserving restart are still being worked on in the community – YARN-128 and YARN-149.On scheduling – there’ve been requests for gang scheduling, meeting SLAs. Also TBD is support for scheduling additional resource types – disk/ network.Rolling Upgrades – some work pending. Big piece here, which ties in with work preserving restart – restarting a NodeManager should not cause processes started by the previous NM to be killedLong Running Services support – handling logs, security – specifically token expiryAdditional utility libraries to help AppWriters – primarily geared towards checkpointing in the AM, app history handling
HA and work preserving – being actively worked upoin by the communitiy.Scheduler – Additional resources – specifically disk / network. Gang schedulingRolling upgrades – upgrading a cluster typically involves downtime. NM forgets containers across restartsLong Running – Enhandcement to log handling, security, multiple tasks per container, container resizingHA and work preserving restart are still being worked on in the community – YARN-128 and YARN-149.On scheduling – there’ve been requests for gang scheduling, meeting SLAs. Also TBD is support for scheduling additional resource types – disk/ network.Rolling Upgrades – some work pending. Big piece here, which ties in with work preserving restart – restarting a NodeManager should not cause processes started by the previous NM to be killedLong Running Services support – handling logs, security – specifically token expiryAdditional utility libraries to help AppWriters – primarily geared towards checkpointing in the AM, app history handling