5. What the new Architecture gets us?
Scale
Compute Platform
6. Scale for a compute platform
⢠Application Size
⢠No of sub-tasks
⢠Application level state
⢠eg. Counters
⢠Number of Concurrent Tasks in a single
cluster
9. Why a limitation on cluster size ?
Hadoop 1.0
Cluster
Utilization
Cluster Size
10. JobTracker JIP TIP Scheduler
Heartbeat
Request
⢠Synchronous Heartbeat
Processing
⢠JobTracker Global Lock
Heartbeat
Response
JT transaction rate limit:
200 heartbeats/sec
11. Highly Concurrent Systems
⢠scales much better (if done
right)
⢠makes effective use of multi-
core hardware
⢠managing eventual
consistency of states hard
⢠need for a systemic framework
to manage this
12. Event Queue Event
Dispatcher
Component Component Component
A B N
⢠Mutations only via events
⢠Components only expose Read APIs
⢠Use Re-entrant locks
⢠Components follow clear lifecycle
Event Model
13. Heartbeat NodeManager
Listener Event Q Meta
Heartbeat
Request
Get
commands
Heartbeat
Response
Asynchronous
Heartbeat Handling
18. Complex State Management
⢠Light weight State Machines Library
⢠Declarative way of specifying the state
Transitions
⢠Invalid transitions are handled automatically
⢠Fits nicely with the event model
⢠Debug-ability is drastically improved.
Lineage of object states can easily be
determined
⢠Handy while recovering the state
21. MR Application Master Recovery
⢠Hadoop 1.0
⢠Application need to resubmit Job
⢠All completed tasks are lost
⢠YARN
⢠Application execution state check pointed in
HDFS
⢠Rebuilds the state by replaying the events
We will talk about how YARN is built fundamentally different than Hadoop 1.0. what is the motivation for doing so ? What it buys us ?Hadoop 1.0 as classic MRHadoop 2.0 has MR on Yarn
I work primarily on Map-Reduce side and was part of the team when yarn was conceptualizedI work at InMobi, which is a mobile advertising company. I lead the development of big data platforms at InMobi, right from data collection to data analytics systems.I donât see many folks from India. I am the organizer of hadoopmeetup group.
Quick primer on the Hadoop 1.0 architectureSingle Master known as JobTracker. Slave daemons are called TaskTrackerClient submit Jobs to JobTracker.Individual jobs contain map and reduce definitions.Jobtracker knows about the cluster resource and schedules the map and reduce tasks accordingly.
Single master known as Resource Manager - RM manages the resources of the clusterSlave daemons known as NodeManager - manages the resources of individual nodesClient submits Applications (Jobs are now called applications in YARN) to ResourceManagerEach Application has its own master process which gets spawned when the Application starts running - this process is responsible for managing the lifecycle of the Application- called Application master - fi Application Master wants to spawn more processes in the cluster, it ask the resource manager to spawn one. - the resource definition of the process which needs to be launched in the cluster is container â it says about things like RAM, disk, cpuetcFundamentally the application state mgmt is distributedRM is only responsible for cluster mgmt
What the new architecture gets usTodo:put animationScale and general purpose distributed compute platformI will discuss First Lets understand what scale meanshadoop context
For a distributed computate platform, scalability is at two levelsin terms of how big a single application couldAnd the number of concurrent running tasks in a single clusterApplication size is number of sub tasks and application level state.Number of concurrent tasks is nothing but the cluster size
In hadoop 1.0, application size is constrained by JobtrakerJT is a huge Monolithic master.Keeps cluster level metadata, task level metadata and application specific meta data. You see things like counter limits etc for the same reasonTODO: put the formulae
Todo: put animationBecauseApplication management is distributedLets see the other : number of concurrent tasks
Todo: animationWhy nobody runs more than say 3k or 4k nodes in JTBecause as the cluster size grows the utilization drops. The steeper the curve, the more you sacrifice on utilization So we said at 4k utilization is acceptable, so cluster size should not grow beyond thatLets see why this drops
Task scheduling happens in the heartbeatJob tracker has a global lock and heartbeat is process synchronouslyJT thru put is limited say 200 heartbeats/secAs cluster size is increased the interval a TT sends a heartbeat increasesJT is not very concurrentNeed to design for better concurrency
Same as in slides
Asynchronous processing of eventsEach component encapsulates its state. Mutations happen only via eventsReads can happen direclty
In Resource manager, the heartbeat processing is asynchronous, so it can handle large number of heartbeats/secso what is the impact of this on utilization
as the cluster size increases, the drop in utlization is much lowerYarn cluster can have large number of nodes within a single cluster
Lets look at state management aspectsThe state management in distributed systems where there are lot of moving parts is very crucial
This is the state transition picture for Jobsimilarly for different entities like Job, task, attemptTask and attempt have even more nodes
This is a very small snippet of Jobtracker code. It is thru out like this.No one dares to touch this. Very very fragile
For this reason, yarn has very light weight state machine library
All valid state transitions are declared upfrontApart from the obvious benefits:Now one can visualize/discuss/argue about the proposed changes to state machine which is not possible in current Hadoop 1.0I remember the first version of all the state transitions we designed in a spreadsheet in which we could see what all valid transitions we are missing
Lets look at the HA story
Same as in slides
Now since the work being done by Resource Manager is limited to cluster management and scheduling, now it is much much simpler to build HA in RM as opposed to JT which has a huge state
There are several compute paradigm being built over YARN. This list some of themThere are several others as well
Same as slideYARN is a general purpose distributed compute platform