Get Started Building YARN Applications

Getting Started Writing
YARN Applications

© Hortonworks Inc. 2013

Page 1

Agenda
• Overview and Benefits
• YARN Basics
• Guest Speaker: Actian
– Developing a Real World YARN Application

• Getting Started
• Roadmap

© Hortonworks Inc. 2013 - Confidential

Page 2

Apache Hadoop Release Info
October

• Apache Hadoop 2.2.0 GA

15
October

23

• Hortonworks Data Platform 2.0
– Based on Apache Hadoop 2.2.0

“Foundation of next-generation Open Source Big Data Cloud computing platform runs multiple
applications simultaneously to enable users to quickly and efficiently leverage data in multiple
ways at supercomputing speed”
Apache Software Foundation Blog
“Hadoop 2.0 Makes Big Data Even More Accessible”
ReadWrite.com
“Apache Software Foundation announces general availability of watershed Big Data release ”

Yarn Wins Best Paper Award at SOCC-2013

ZDNet
SOCC-2013


Page 3

1st Generation Hadoop: Batch Focus
HADOOP 1.0
Built for Web-Scale Batch Apps

Single App

Single App

INTERACTIVE

ONLINE

Single App

Single App

Single App

MapReduce

MapReduce

MapReduce

HDFS

HDFS

All other usage patterns
MUST leverage same
infrastructure

HDFS


Forces Creation of Silos to
Manage Mixed Workloads

Page 4

Hadoop 1 Limitations
• Lacks Support for Alternate Paradigms and Services
– Force everything needs to look like Map Reduce
– Iterative applications in MapReduce are 10x slower

• Scalability
– Max Cluster size ~5,000 nodes
– Max concurrent tasks ~40,000

• Availability
– Failure Kills Queued & Running Jobs

• Hard partition of resources into map and reduce slots
– Non-optimal Resource Utilization


Page 5

Our Vision: Hadoop as Multi-Workload Platform

Single Use System

Multi Purpose Platform

Batch Apps

Batch, Interactive, Online, Streaming, …

HADOOP 1.0

HADOOP 2.0
MapReduce

Others

(data processing)

MapReduce

YARN

(cluster resource management
& data processing)

(cluster resource management)

HDFS

HDFS2

(redundant, reliable storage)

(redundant, highly-available & reliable storage)


Page 6

Apache YARN Benefits
The Data Operating System for Hadoop 2.0
Flexible

Efficient

Shared

Enables other purpose-built data
processing models beyond
MapReduce (batch), such as
interactive and streaming

Increase processing IN Hadoop
on the same hardware while
providing predictable
performance & quality of service

Provides a
stable, reliable, secure
foundation and shared
operational services across
multiple workloads

Data Processing Engines Run Natively IN Hadoop
BATCH
MapReduce

INTERACTIVE
Tez

ONLINE
HBase

STREAMING
Storm, S4, …

GRAPH
Giraph

MICROSOFT
REEF

SAS
LASR, HPA

OTHERS

YARN: Cluster Resource Management
HDFS2: Redundant, Reliable Storage


Page 7

YARN: Efficiency with Shared Services

Yahoo! leverages YARN
40,000+ nodes running YARN across over 365PB of data
~400,000 jobs per day for about 10 million hours of compute
time
Estimated a 60% – 150% improvement on node usage per
day using YARN
Eliminated Colo (~10K nodes) due to increased utilization
For more details check out the YARN SOCC 2013 paper

Page 8

YARN Basics


Page 9

Hadoop 2 - YARN Architecture
 ResourceManager (RM)

Node
Manager

Central agent - Manages and allocates cluster resources

App Mstr

 NodeManager (NM)
Per-Node agent - Manages and enforces node resource
allocations

Resource
Manager

Node
Manager

Client
Container

 User Application
Client
Submits the applications

ApplicationMaster (AM)

MapReduce Status
Job Submission

Node
Manager

Node Status
Resource Request

Manages application lifecycle
and task scheduling

Container Application
Executes application logic


Page 10

Containers
• Capability
– Memory, CPU

• Container Request
– Capability, Host, Rack, Priority, relaxLocality

• Container Launch Context
– LocalResources
– Resources needed to execute container application

– Environment variables
– Example: classpath

– Command to execute

• Launch the container
– Client requests Resource Manager to launch Application Master Container
– Application Master requests Node Manager to launch Application Containers


Page 11

APIs
• What APIs do I need to use?
– Only three protocols

Application Client
Protocol

– Client to ResourceManager

Resource
Manager

– Application submission

– ApplicationMaster to
ResourceManager
– Container allocation

– ApplicationMaster to NodeManager

Application
Client

Application Master
Protocol

YarnClient

NodeManage
r
App
Contain
er

Application Master

– Container launch

AMRMClient

– Use client libraries for all 3 actions

NMClient

– Package
org.apache.hadoop.yarn.client.api;
– Provides both synchronous and
asynchronous libraries

Container Management
Protocol

– Use 3rd party libraries like
Twill, Reef, Spring


12

Developing a Real World YARN
Application


Page 13

Jeff Gullick – Principal Solutions Engineer
Shane Pratt - Sr. Director, Hadoop and Analytics COE
Jim Falgout – Chief Technologist

Actian and YARN
12/18/13

Actian “Dataflow” Technology
…a series of analytic, ETL, data quality applications based on parallel dataflow
technology that eliminate performance bottlenecks in data-intensive operations

Actian “Dataflow”
Applications

•

Native Hadoop Execution: Alternative execution engine to
MapReduce that runs local to the Hadoop cluster

•

High Throughput: Pipeline parallelism executes up to 500%
faster than MapReduce; Parallel readers and writers

•

Auto-Scaling: Performance dynamically scales with
increased core counts and increased Hadoop nodes.

•

Cost Efficient: Designed for maximum performance from
commodity multicore servers and Hadoop clusters.

Hadoop
Cluster

Fully Integrated: A single platform and user experience for
ETL, data quality, and data science.

Cluster

•

Server

Easy to Implement: GUI and API-level interfaces; eliminates
the need to understand MapReduce or complex parallel
processing.

Multicore

•

Actian “Dataflow” Engine

Dataflow Apps Scale Up and Out

Confidential © 2013 Actian Corporation

15

Why Actian Needs YARN….
 Potential resource competition concerns between MapReduce
applications and Dataflow on the Hadoop cluster were
preventing market uptake of the technology


16

Hortonworks & Actian Analytics and DataPrep for Hadoop

Reference
Architecture

AMBARI
DATA REFINEMENT

DEVELOPMENT METHODS

Analytics and DataPrep for
Hadoop

SOURCE
DATA

DISCOVER

TRANSFORM

STANDARDIZE

MATCH-MERGE

VISUAL UI

OR
NATIVE HADOOP PARALLEL
EXECUTION

Databases / Marts
Warehouses

JAVA, JAVASC
RIPT

Dataflow Engine
OPEN API/SDK
Enterprise
Applications

10X
Cloud / SaaS
Applications

DATA
SYSTEMS

HDFS API
HDFS API

HDFS

HBASE API

HBASE API
HCATALOG

MASSIVELY
PARALLEL
EXTRACT/LOAD

Structured &
Unstructured
Data

MASSIVELY
PARALLEL
EXTRACT/LOAD

YARN

ANALYTIC
DATASTORES

10X

MDM
EDW


Developing with YARN
 Getting started
• Investigation
 Installed HDP 2.0 on development cluster
 Read Hortonworks blogs on YARN (very informative!)
http://hortonworks.com/blog/introducing-apache-hadoop-yarn/

 Looked at sample YARN application code
 Browsed MapReduce source code

• Prototyping
 Started with getting an Application Master spawned
 Relatively easy way to get started with the YARN API’s
 Also helped to learn about containers and shared resources

• Project implemented by two senior developers
Page

 Design
• Using AMRMAsnycClient
 Handles communication with resource manager
 Provides callbacks for asynchronous container events
(allocations, completions, …)

• Using NMClientAsync
 Handles communications with multiple node managers
 Callbacks for asynchronous container events

• Configuration
 Reusing existing Actian web application for configuration

• Application Specific History Service
 Reusing existing Actian web application for job monitoring

Page

 Design
• Application Master
 Started per Actian Dataflow job (batch mode)
 Determines resources needed; acquires from ResourceManager
 Elastically allocates resources according to job needs
 Launches worker containers via NodeManager(s)
 Monitors progress and cleans up as job completes

• Application Containers
 Execute distributed Dataflow graphs within launched container(s)
 Provide runtime status and statistics to history server
 Statistics include items like: records processed, I/O stats, …

Page

Client

launch
AppMaster

YARN
Web app

Resource
Manager
launches

Links to

Allocate
resources

Application
Master

get stats

launch
Worker
Containers

Node
Node
Manager
Node
Manager
Manager

Config/
History
Server
get stats

launches

Application
Application
Container
Application
Container
Container

Page

 Phases of Development
• Job launching
 Integrated Actian Dataflow client with YARN to launch application master
 Built application master: allocate resources; launch workers
 Built worker containers

 Result: able to launch Dataflow jobs via YARN
 1 senior developer; approximately 5 weeks (including investigation)

• Configuration and Monitoring
 Modified existing web application to handle Dataflow configuration items specific to
YARN
 Collect and display runtime stats from executing jobs
 Provide history service
 Log viewing
 1 senior developer; approximately 3 weeks
Page

 Lessons Learned
• Distributed cache allows frictionless install of Actian software on cluster
worker nodes
• The sample YARN application is too simple
• (Hortonworks now has a MemcacheD on YARN sample app)
• MapReduce code provides better coverage but is complex
• An application history server is required
 We hoped to not have install/run any Actian servers on cluster

 A JIRA issue exists to provide a history service as part of YARN

• Configuration can be supplied via Hadoop config files
 This is messy (how to keep coherent across the cluster …)
 Applications should integrate with Hadoop management layers (i.e. Ambari)
Page

 Next Steps
• Integrate with Hadoop management & configuration capabilities
• Utilize YARN History Service when it is available
• More complex resource allocation schemes


24

Thank You
www.actian.com
facebook.com/actiancorp

@actiancorp
CTA: For more information on Hadoop
solutions from Actian, please visit:
www.actian.com/hadoop
Questions on Data Flow? Email:
Shane.Pratt@actian.com”


25

YARN – Getting Started


Page 27

Hortonworks.com/get-started/YARN
Step 1

Step 2

Step 3

• Understand the
motivations and YARN
architecture

• Explore example
applications on YARN

• Examine real world
applications on YARN

 Setup HDP 2.0 environment
 Leverage Sandbox

 Review Sample Code & Execute Simple YARN Application
 https://github.com/hortonworks/simple-yarn-app
BUILD FLEXIBLE, SCALABLE, RESILIENT & POWERFUL APPLICATIONS
TO RUN IN HADOOP

Page 28

YARN – Road Ahead


Page 29

YARN – Roadmap
• ResourceManager High Availability
– Automatic failover
– Work preserving failover

• Scheduler Enhancements
– SLA Driven Scheduling, Low latency allocations
– Multiple resource types – disk/network/GPUs/affinity

• Rolling upgrades
• Generic History Service
• Long running services
– Better support to running services like HBase
– Service Discovery

• More utilities/libraries for Application Developers
– Failover/Checkpointing


Page 30

1-2-3 Getting Started with YARN
http://hortonworks.com/get-started/YARN

Get started with Hortonworks Sandbox
http://hortonworks.com/sandbox/

Code walk through – Jan. 22nd 2014 at 9am PT
Register at Hortonworks.com/webinars/yarn-code
Get involved! YARN is part of a community driven
open source project and you can help accelerate
the innovation!
Follow Us:
@hortonworks @actiancorp

Get Started Building YARN Applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Get Started Building YARN Applications

Similar to Get Started Building YARN Applications (20)

More from Hortonworks

More from Hortonworks (20)

Recently uploaded

Recently uploaded (20)

Get Started Building YARN Applications

Editor's Notes