Updated version of my talk about Hadoop 3.0 with the newest community updates.
Talk given at the codecentric Meetup Berlin on 31.08.2017 and on Data2Day Meetup on 28.09.2017 in Heidelberg.
3. Some Hadoop history
Hadoop 2
HDFS
Redundant, reliable storage
MapReduce
Data processing
YARN
Cluster resource management
Hive
SQL
Spark
In-Memory
…
Oct. 2013
Let there be YARN Apps!
Era of Enterprise Hadoop
2006
Hadoop 1
HDFS
Redundant, reliable
storage
MapReduce
Cluster resource mgmt. +
data processing
Let there be batch!
Era of Silicon Valley Hadoop
Hadoop 3
?
IoT
Machine
Learning GPU’s
TensorFlow
Data
Science
Streaming
Data
Cloud
FPGA’s
Artificial
Intelligence
Kafka
Late 2017
Let there be …?
Era of ?
4. Why Hadoop 3.0?
• Deprecated APIs can only be removed in major
release
• Not fully preserving API compatibility
• Wire-compatibility could be broken
• Change of default ports
• But preserves wire-compatibility with Hadoop 2 clients
• Will support rolling upgrade from Hadoop 2 to Hadoop 3
• Hadoop command scripts rewrite
• Big features that need stabilizing major release
5. What is Hadoop 3.0?
20142010 2011 201320122009 2015
2.2.02.0.0-alpha
branch-1
(branch-0.20)
1.0.0 1.1.0 1.2.1 (Stable)0.20.1 0.20.205
0.21.0
New append
0.23.0
branch-2
HDFS Snapshots
NFSv3 support
HDFS ACLs
HDFS Rolling Upgrades
RM Automatic Failover
2.6.0
YARN Rolling Upgrades
Transparent Encryption
Archival Storage
2.7.0
Hadoop 2
Drop JDK6 Support
File Truncate API
2016
branch-0.23
Hadoop 3
Hadoop 2 and 3 were
diverged 5+ years ago
Hadoop 1 (EOL)
Source: Akira Ajisaka
(with additions by Uwe Printz)
2017
0.22.0
0.23.11 (Final)
Security
trunk
2.3.0 2.5.0
2.4.0
NameNode Federation , YARN
NameNode HA
Heterogeneous storage
HDFS In-Memory Caching
2.8.1
3.0.0-alpha1
3.0.0-alpha2
2.1.0-beta
HDFS Extended
attributes
Docker Container in Linux
ATS 1.5
3.0.0-alpha3
3.0.0-alpha4
3.0.0-beta1
15.09.
GA
01.11.
6. Hadoop 3.0 in a nutshell
• HDFS
• Erasure codes
• Low-level performance enhancements with Intel ISA-L
• 2+ NameNodes
• Intra-DataNode Balancer
• YARN
• Better support for long-running services
• Improved isolation & Docker support
• Scheduler enhancements
• Application Timeline Service v2
• New UI
• MapReduce
• Task-level native optimization
• Derive heap-size automatically
• DevOps
• Drop JDK7 & Move to JDK8
• Change of default ports
• Library & Dependency Upgrade
• Client-side classpath Isolation
• Shell Script Rewrite & ShellDoc
• .hadooprc & .hadoop-env
• Metrics plugin for Kafka
8. HDFS - Current implementation
• 3 replicas by default
• Tolerate maximum of 2 failures
Write request
Lease for file
Split into blocks
Request for
data nodes
List of
data nodes
HDFS Client
NameNode
DataNode 1 DataNode 2 DataNode 3
Write block +
checksum
• Simple, scalable & robust
• 200% space overhead
Write
Pipeline
Write
Pipeline
Calculate
checksum
ACKACK
ACK
Complete!
9. Erasure Coding (EC)
• k data blocks + m parity blocks
• Example: Reed-Solomon (6,3)
d d d d d d
Raw
Data
Splitting
d d d d d d
d d d d d d
d d d d d d
p p p
p p p
p p p
p p p
Encoding
Store data and parity
• Key Points
• XOR Coding —> Saves space, slower recovery
• Missing or corrupt data will be restored from available data and parity
• Parity can be smaller than data
10. EC - Main characteristics
Replication
(Factor 1)
Replication
(Factor 3)
Reed-Solomon
(6,3)
Reed-Solomon
(10,4)
Maximum fault tolerance 0 2 3 4
Space Efficiency 100 % 33 % 67 % 71 %
Data Locality Yes No (Phase 1) / Yes (Phase 2)
Write performance High Low
Read performance High Medium
Recovery costs Low High
Pluggable implementation,
first choice
Storage Tier Hot Warm Cold Frozen
Memory/SSD Disk Dense Disk EC
20 x Day 5 x Week 5 x Month 2 x Year
11. EC - Contiguous blocks
• Approach 1: Retain block size and add parity
File A File B File C
128
MB
128
MB
128
MB
128
MB
128
MB
128
MB
Block 1 Block 2 Block 3 Block 4 Block 5 Block 6
DN 3 DN 2 DN 12 DN 7 DN 5 DN 1
• Pro: Better for locality
• Con: Significant overhead for smaller files, always 3 parity
blocks needed
• Con: Client potentially needs to process GB’s of data for encoding
Parity Parity Parity
DN 6 DN 4 DN 8
Encoding
12. EC - Striping
• Approach 2: Splitting blocks into smaller cells (1 MB)
File A File B File C
• Pro: Works for small files
• Pro: Allows parallel write
• Con: No data locality -> Increased read latency &
More complicated recovery process
Block 2 Block 3 Block 4 Block 5 Block 6
DN 7 DN 3 DN 4 DN 1 DN 6 DN 12
Stripe 1
Stripe 2
Stripe n
Block 1 Block 4
Round-robin
… … … … … …
Parity
DN 10 DN 8 DN 14
Parity Parity
Encoding
… … …
14. EC - Shell Command
• Create a EC Zone on an empty directory
• All the files under a zone directory are automatically erasure coded
• Rename across zones with different EC schemas are disallowed
Usage: hdfs erasurecode [generic options]
[-getPolicy <path>]
[-help [cmd ...]]
[-listPolicies]
[-setPolicy [-p <policyName>] <path>]
-getPolicy <path> :
Get erasure coding policy information about at specified path
-listPolicies :
Get the list of erasure coding policies supported
-setPolicy [-p <policyName>] <path> :
Set a specified erasure coding policy to a directory
Options :
-p <policyName> erasure coding policy name to encode files. If not passed the
default policy will be used
<path> Path to a directory. Under this directory files will be
encoded using specified erasure coding policy
15. EC - Write Path
• Parallel write
• Client writes to 9 data nodes at the same time
• Calculate parity at client, at write time
• Durability
• Reed-Solomon(6,3) can tolerate max. 3 failures
• Visibility
• Read is supported for files being written
• Appendable
• Files can be reopened for appending data
HDFS Client
DataNode 1
…
…
DataNode 6
DataNode 7
DataNode 8
DataNode 9
1MB
Data
Parity
Parity
Parity
ACK
ACK
ACK
ACK
ACK
Split into stripes
+ calculate parity
1MB
16. EC - Write Failure Handling
• Data node failure
• Client ignores the failed data node and
continues writing
• Reed-Solomon(6,3) is able to tolerate 3 data node
failures
• Requires at least 6 data nodes
• Missing blocks will be constructed later
HDFS Client
DataNode 1
…
…
DataNode 6
DataNode 7
DataNode 8
DataNode 9
1MB
Data
1MB
Data
Parity
Parity
Parity
ACK
ACK
ACK
ACK
ACK
17. EC - Read Path
• Read data from 6 data nodes
in parallel
HDFS Client
DataNode 1
…
…
DataNode 6
DataNode 7
DataNode 8
DataNode 9
1MB
Data
1MB
Data
Block
18. EC - Read Failure Handling
• Read data from 6 arbitrary
data nodes in parallel
• Read parity block to reconstruct missing
data block
HDFS Client
DataNode 1
…
…
DataNode 6
DataNode 7
DataNode 8
DataNode 9
1MB
Data
Block
Parity
Parity
reconstructs
19. EC - Network behavior
• Pro’s
• Low latency because of parallel read & write
• Good for small file sizes
• Con’s
• Requires high network bandwidth between client & server
• Dead data nodes result in high network traffic and reconstruction
time
20. EC - Coder implementation
• Legacy coder
• From Facebook’s HDFS-RAID project
• [Umbrella] HADOOP-11264
• Pure Java coder
• Code improvements over HDFS-RAID
• HADOOP-11542
• Intel ISA-L coder
• Native coder with Intel’s Intelligent Storage Acceleration Library
• Accelerates EC-related linear algebra calculations by exploiting advanced hardware
instruction sets like SSE, AVX, and AVX2
• HADOOP-11540
24. • Hadoop 1
• No built-in High Availability
• Needed to solve yourself via e.g. VMware
2+ Name Nodes (HDFS-6440)
• Hadoop 2
• High Availability out-of-the-box via Active-Passive Pattern
• Needed to recover immediately after failure NameNode
Active
NameNode
Standby
• Hadoop 3
• 1 Active NameNode with N Standby NameNodes
• Trade-off between operation costs vs. hardware costs
NameNode
Active
NameNode
Standby
NameNode
Standby
25. Intra-DataNode Balancer (HDFS-1312)
• Hadoop already has a Balancer between
DataNodes
• Needs to be called manually by design
• Typically used after adding additional worker nodes
• The Disk Balancer lets administrators rebalance
data across multiple disks of a DataNode
• It is useful to correct skewed data distribution often seen after adding or
replacing disks
• Adds hdfs diskbalancer that will submit a plan but does not wait for the
plan to finish executing and the DataNode will do the moves itself
27. YARN - Built-in support for long-running services
• Simplified and first-class support for services (YARN-4692)
• Abstract common framework to support long running service (similar to Apache Slider)
• More simplified API for managing the service lifecycle of YARN Apps
• Better support for long running service
• Recognition of long running service (YARN-4725)
• Auto-restart of containers
• Containers for long running service are retried at same node in case of local state
• Service/Application upgrade support (YARN-4726)
• Hold on to containers during an upgrade of the YARN App
• Dynamic container resizing (YARN-1197)
• Only ask for minimum resources at start and rather adjust them at runtime
• Currently the only way is releasing containers and allocating new containers with the
expected size
28. YARN - Resource Isolation & Docker
• Better Resource Isolation
• Support for disk isolation (YARN-2619)
• Support for network isolation (YARN-2140)
• Uses cgroups to give containers their fair share
• Docker support in LinuxContainerExecutor (YARN-3611)
• The LinuxContainerExecutor already provides functionality around localization,
cgroups based resource management and isolation for CPU, network, disk, etc. as
well as security mechanisms
• Support Docker containers to be run inside of LinuxContainerExecutor
• Offers packaging and resource isolation
• Complements YARN’s support for long-running services
29. YARN - Service Discovery
• Services can run on any YARN node
• Dynamic IP, can change in case of node failures, etc.
• YARN Service Discovery via DNS (YARN-4757)
• The YARN service registry already provides facilities for applications to register their
endpoints and for clients to discover them but they are only available via Java API and REST
• Expose service information via a already available discovery mechanism: DNS
• Current YARN Service Registry records need to be converted into DNS entries
• Discovery of the container IP and service port via standard DNS lookups
• Mapping of Applications, e.g.
zkapp1.griduser.yarncluster.com -> 172.17.0.2
• Mapping of containers, e.g.
container-e3741-1454001598828-0131-01000004.yarncluster.com -> 172.17.0.3
30. YARN - Scheduling enhancements
• Generic Resource Types
• Abstract ResourceTypes to allow new resources, like GPU, Network, etc.
• Resource profiles for containers , like small, medium, large, etc. similar to EC2 instance types
• Global Scheduling (YARN-5139)
• Currently YARN scheduling is done one-node-at-a-time at arrival of heart beats and can lead to suboptimal decisions
• With global scheduling, YARN scheduler looks at more nodes and selects the best nodes based on application requirements which
leads to a globally optimal placement and enhanced container scheduling throughput
• Application priorities within a queue (YARN-1963)
• For example, in queue Marketing Hive jobs > MapReduce jobs
• Inter-Queue priorities (YARN-4945)
• Queue 1 > Queue 2, irrespective of demand & capacity
• Previously based only on unconsumed capacity
• Affinity / Anti-Affinity (YARN-1042)
• More fine-granular restraints on locations, e.g. do not allocate HBase Region servers and Storm workers on the same host
• Gang Scheduling (YARN-624)
• Allow allocation of sets of containers, e.g. 1 container with 128GB of RAM and 16 cores OR 100 containers with 2GB of RAM and 1 core
• Can be achieved already by holding on to containers but might lead to deadlocks and decreased cluster utilization
31. YARN - Use the force!
YARN
MapReduce Tez Spark
YARN
MapReduce Tez Spark
33. Application Timeline Service v2 (YARN-2928)
Why?
• Scalability & Performance
• Single global instance of Writer/Reader
• Local disk based LevelDB storage
• Reliability
• Failure handling with local disk
• Single point-of-failure
• Usability
• Add configuration and metrics as first-class
members
• Better support for queries
• Flexibility
• Data model is more describable
Core Concepts
• Distributed write path
• Logical per app collector
• Separate reader instances
• Pluggable backend storage
• HBase
• Enhanced internal data model
• Metrics Aggregation
• Richer REST API for queries
35. • Major release, expect it end of 2017
• Shiny new features like Erasure Coding and
better support for long-running services &
Docker
• Expect some changes in administration of
your existing Hadoop clusters
• But Ambari & Wire compatibility & Rolling upgrade
Summary of Hadoop 3.0
36. Big Data Lake
DataSourcesDataSystemsApplications
Traditional Sources
RDBMS OLTP OLAP …
Business
Intelligence
New Sources
Logs Sensor …Social
Media
Enterprise Hadoop
SAS LASR Server
Zeppelin
Big Data Lake
Hadoop
Traditional Data
System Requirements
Governance
Security
Operations
Access
Apache Zeppelin
Ambari Views
Apache Ambari
Cloudbreak
Apache Ranger
Apache Knox
Apache Atlas
Apache Atlas
Ambari Falcon
Storage &
Processing
HDFS & YARN
37. • Microsoft
• HDInsights
• Cloudbreak & Azure
• Amazon
• Hortonworks Data Cloud (HDC)
• Cloudbreak & EC2
• Google Cloud
• Cloudbreak & GCP
• They suggest to spin up a cluster
per job
Hadoop & Cloud
• YARN Support for Docker
• Think packing of apps like TensorFlow, etc.
• Better support for long-running
services
• Think dynamic resizing of cluster resources
• Service Discovery
• Scheduling enhancements
• Think automatic up- and downscaling
• Let’s welcome Hadoop to
the cloud age!
38. Hadoop & Streaming & IoT
Batch Layer
Speed Layer
…
ms - s
min - h
#1 Lambda Architecture
#2 Hadoop as long-term storage
• Better YARN support for
long-running services
• Think Spark Streaming
• Erasure Coding for more
efficient cold storage
• Apache NiFi & HDF
• Rumors about integration of
Apache Flink into HDP 3.0
39. Data Science & Machine Learning
• YARN support for GPUs
and other resources
• YARN & Docker for
packing of apps
• Better integration of
Spark & MLlib
• TensorFlow on the rise
https://unsplash.com/photos/iWYrCr8eGwU