[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma

Apache Hadoop 3
Rohith Sharma K S
Hadoop PMC member
YARN @ Hortonworks
rohithsharmaks@hortonworks.com

An Abbreviated History of Hadoop Releases
Date Release Major Notes
2007-11-04 0.14.1 First release at the ASF
2011-12-27 1.0.0 Security, HBase support
2012-05-23 2.0.0 YARN, NameNode HA, wire compatibility
2014-11-18 2.6.0 HDFS encryption, rolling upgrade, node labels
2015-04-21 2.7.0 Most recent production-quality release line

Motivation for Hadoop 3
● Upgrade minimum Java version to Java 8
○ Java 7 end-of-life in April 2015
○ Many Java libraries now only support Java 8
● HDFS erasure coding
○ Major feature that refactored core pieces of HDFS
○ Too big to backport to 2.x
● YARN as Data/container cloud
○ Significant change to support Docker and native service in YARN
● Other miscellaneous incompatible bugfixes and improvements
○ Hadoop 2.x was branched in 2011
○ 6 years of changes waiting for 3.0

Hadoop 3 status and release plan
● A series of alphas and betas leading up to
GA
● GA by the end of the year
Release Date
3.0.0-alpha1 2016-09-03 ✔
3.0.0-alpha2 2017-01-25 ✔
3.0.0-alpha3 2017-05-16 ✔
3.0.0-alpha4 2017-07-07 ✔
3.0.0-beta1 2017-10-03 ✔
3.0.0 GA 2017 Q4
https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+3.0.0+release

Erasure coding (HDFS-7285)
● Data protection method uses data stripping methods
● Motivation: improve storage efficiency of HDFS
○ ~2x the storage efficiency compared to 3x replication
○ Reduction of overhead from 200% to 40%
● Uses Reed-Solomon(k,m) erasure codes instead of replication
○ Support for multiple erasure coding policies
○ RS(3,2), RS(6,3), RS(10,4)
● Missing blocks reconstructed from remaining blocks

Classpath isolation (HADOOP-11656)
● Hadoop leaks lots of
dependencies onto the
application’s classpath
○ Known offenders: Guava, Protobuf,
Jackson, Jetty, …
● No separate HDFS client jar
means server jars are leaked
● YARN / MR clients not shaded
● HDFS-6200: Split HDFS client
into separate JAR
● HADOOP-11804: Shaded
hadoop-client dependency
● YARN-6466: Shade the task
umbilical for a clean YARN
container environment (ongoing)

Miscellaneous
● Shell script rewrite
● Support for multiple Standby NameNodes
● Intra-DataNode balancer
● Support for Microsoft Azure Data Lake and Aliyun OSS
● Move default ports out of the ephemeral range
● S3 consistency and performance improvements (ongoing)
● Tightening the Hadoop compatibility policy (ongoing)

Apache Hadoop 3.0 - YARN Enhancements
● Built-in support for Long Running Services
● Better resource isolation and Docker!!
● YARN Scheduling Enhancements
● Re-architecture for YARN Timeline Service - ATS v2
● Better User Experiences
● Other Enhancements

Built-in support for long running service in YARN
● A native YARN framework - YARN-4692
○ Abstract common Framework for long running service
■ Similar to Slider
○ More simplified API
● Recognition of long running service
○ Affect the policy of preemption, container reservation, etc.
○ Auto-restart of containers
○ Long running containers are retried to same node in case of local state
● Service/application upgrade support - YARN-4726
○ Services are expected to run long enough to cross versions
● Dynamic container configuration
● Service Discovery
○ Expose existing service information in YARN registry via DNS (YARN-4757)

Docker on YARN
● Docker support in LinuxContainerExecutor
○ YARN-3611 (Umbrella)
○ Multiple container types are supported in the same executor.
○ A new docker container runtime is introduced that manages docker containers
○ LinuxContainerExecutor can delegate to either runtime on a per application basis
○ Clients specify which container type they want to use
■ currently via environment variables but eventually through well-defined client
APIs.

Docker road to YARN on YARN
Can use YARN to test Hadoop!!

Scheduling Enhancements
● Generic Resource Types
○ Abstract ResourceTypes to allow new resources, like: GPU, Network, etc.
○ Resource profiles for containers
● Global Scheduling: YARN-5139
○ Replace trigger scheduling only on heartbeat with global scheduler that has parallel threads
○ Globally optimal placement strategies

Scheduling Enhancements (Contd.)
● Other CapacityScheduler improvements
○ Queue Management Improvements (REST API support)
○ Absolute resource configuration support in queues
○ Priority Support in Application and Queue
○ Intra – queue Preemption
● FairScheduler improvements
○ Preemption improvements
○ Better defaults:
■ Assign multiple containers in a heartbeat based on resource availability

Application Timeline Service v2
● ATS: Captures system/application
events/metrics
● v2 improvements:
○ Enhanced Data Model: first-class citizen for Flows,
Config, etc.
○ Scalable backend: HBase
○ Distributed Reader/Writer
○ Others
■ Captures system metrics. E.g. memory/cpu
usage per container over time
■ Efficient updates: just write a new version to
the appropriate HBase cell

YARN New WebUI
● Improved visibility into cluster usage
○ Memory, CPU
○ By queues and applications
○ Sunburst graphs for hierarchical queues
○ NodeManager heatmap
● ATSv2 integration
○ Plot container start/stop events
○ Easy to capture delays in app execution

Misc. YARN/MR improvements
● Opportunistic containers (YARN-2877 & YARN-5542)
○ Motivation: Resource utilization is typically low in most clusters
○ Solution: Run some containers at a lower priority, and preempted as and when needed for
Guaranteed containers
● YARN Federation (YARN-2915 & YARN-5597)
○ Allows YARN to scale to 100k nodes and beyond
● HA improvements
○ Better handling of transient network issues
○ ZK-store scalability: Limit number of children under a znode
● MapReduce Native Collector (MAPREDUCE-2841)
○ Native implementation of the map output collector
○ Upto 30% faster for shuffle-intensive jobs

Summary: What’s new in Hadoop 3.0?
● Storage Optimization
○ HDFS: Erasure codes
● Improved Utilization
○ YARN: Long Running Services
○ YARN: Schedule Enhancements
● Additional Workloads
○ YARN: Docker & Isolation
● Easier to Use
○ New User Interface
● Refactor Base
○ Lots of Trunk content
○ JDK8 and newer dependent libraries
3.0

Compatibility
● Strong feedback from large users on the need for compatibility
● Preserves wire-compatibility with Hadoop 2 clients
○ Impossible to coordinate upgrading off-cluster Hadoop clients
● Will support rolling upgrade from Hadoop 2 to Hadoop 3
○ Can’t take downtime to upgrade a business-critical cluster
● Not fully preserving API compatibility!
○ Dependency version bumps
○ Removal of deprecated APIs and tools
○ Shell script rewrite, rework of Hadoop tools scripts
○ Incompatible bug fixes

Testing and validation
● Extended alpha → beta → GA plan designed for stabilization
● EC already has some users in production (700 nodes at Y! JP)
● Cloudera is rebasing CDH against upstream and running full test suite
○ Integration of Hadoop 3 with all components in CDH stack
○ Same integration tests used to validate CDH5
● Hortonworks is also integrating and testing Hadoop 3
● Microsoft is deployed YARN federation feature in production
● Happy synergy between 2.8.x and 3.0.x lines
○ Shares much of the same code, fixes flow into both
○ Yahoo! doing scale testing of 2.8.0

Conclusion
● Expect Hadoop 3.0.0 GA by the end of December
● Shiny new features
○ HDFS Erasure Coding
○ YARN Docker and Native Service Support
○ YARN ATSv2
○ Client classpath isolation
○ YARN federation
● Great time to get involved in testing and validation

[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie [Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma

Ähnlich wie [Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma

Hinweis der Redaktion