More Related Content Similar to Hadoop 3 in a Nutshell (20) More from DataWorks Summit/Hadoop Summit (20) Hadoop 3 in a Nutshell1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hadoop 3.0 in
Nutshell
Munich, Apr. 2017
Sanjay Radia, Junping Du
2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About Speakers
Sanjay Radia
⬢ Chief Architect, Founder, Hortonworks
⬢ Part of the original Hadoop team at Yahoo! since 2007
– Chief Architect of Hadoop Core at Yahoo!
– Apache Hadoop PMC and Committer
⬢ Prior
– Data center automation, virtualization, Java, HA, OSs, File Systems
– Startup, Sun Microsystems, Inria …
– Ph.D., University of Waterloo
Junping Du
– Apache Hadoop Committer & PMC member
– Lead Software Engineer @ Hortonworks YARN Core Team
– 10+ years for developing enterprise software (5+ years for being “Hadooper”)
Page 2
3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Hadoop 3.0
⬢ Lot of content in Trunk that did not
make it to 2.x branch
⬢ JDK Upgrade – does not truly require
bumping major number
⬢ Hadoop command scripts rewrite
(incompatible)
⬢ Big features that need stabilizing major
release – Erasure codes
⬢ YARN: long running services
⬢ Ephemeral Ports (incompatible)
The Driving Reasons Some features taking advantage of 3.0
4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hadoop 3.0
⬢HDFS: Erasure codes
⬢YARN:
–Long running services,
– scheduler enhancements,
– Isolation & Docker
– UI
⬢Lots of Trunk content
⬢ JDK8 and newer dependent
libraries
⬢ 3.0.0-alpha1 - Sep/3/2016
⬢ Alpha2 - Jan/25/2017
⬢ Alpha3 - Q2 2017 (Estimated)
⬢ Beta/GA - Q3/Q4 2017 (Estimated)
Key Takeaways Release Timeline
5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Hadoop 3.0 Basis - Major changes you should know before upgrade
– JDK upgrade
– Dependency upgrade
– Change on default port for daemon/services
– Shell script rewrite
⬢ Features
– Hadoop Common
•Client-Side Classpath Isolation
– HDFS
•Erasure Coding
•Support for more than 2 NameNodes
– YARN
•Support for long running services
•Scheduling enhancements: : App / Queue Priorities, global scheduling, placement strategies
•New UI
•ATS v2
– MAPREDUCE
•Task-level native optimizationHADOOP-11264
Agenda
6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Minimum JDK for Hadoop 3.0.x is JDK8OOP-11858
– Oracle JDK 7 is EoL at April 2015!!
⬢ Moving forward to use new features of JDK8
– Lambda Expressions – starting to use this
– Stream API
– security enhancements
– performance enhancement for HashMaps, IO/NIO, etc.
⬢ Hadoop’s evolution with JDK upgrades
– Hadoop 2.6.x - JDK 6, 7, 8 or later
– Hadoop 2.7.x/2.8.x/2.9.x - JDK 7, 8 or later
– Hadoop 3.0.x - JDK 8 or later
Hadoop Operation - JDK Upgrade
7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Jersey: 1.9 to 1.19
–the root element whose content is empty collection is changed from null to
empty object({}).
⬢ Grizzly-http-servlet: 2.1.2 to 2.2.21
⬢ Guice: 3.0 to 4.0
⬢ cglib: 2.2 to 3.2.0
⬢ asm: 3.2 to 5.0.4
⬢ netty-all: 4.0.23 to 4.1x (in discussion)
⬢ Protocol Buffer: 2.5 to 3.x (in discussion)
Dependency Upgrade
8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Previously, the default ports of multiple Hadoop services were in the Linux
ephemeral port range (32768-61000)
– Can conflict with other apps running on the same node
⬢ New ports:
– Namenode ports: 50470 9871, 50070 9870, 8020 9820
– Secondary NN ports: 50091 9869, 50090 9868
– Datanode ports: 50020 9867, 50010 9866, 50475 9865, 50075 9864
⬢ KMS service port 16000 9600
Change of Default Ports for Hadoop Services
9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop Common
Client-Side Classpath Isolation
10. 1
0
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Problem
– Application code’s dependency (including Apache Hive or dependency projects) can conflict with
Hadoop’s dependencies
⬢ Solution
– Separating Server-side jar and Client-side jar
•Like hbase-client, dependencies are shaded
Client-side classpath isolation
HADOOP-11656/HADOOP-13070
Hadoop
Client
Server
Older
commons
Hadoop
-client
shaded
Server
Older
commons
User code
newer
commons
Single Jar File
Conflicts!!!
User code
newer
commons
Co-existable!
11. 1
1
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS
Support for Three NameNodes for HA
Erasure coding
12. 1
2
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Current (2.x) HDFS Replication Strategy
⬢ Three replicas by default
– 1st replica on local node, local rack or random node
– 2nd and 3rd replicas on the same remote rack
– Reliability: tolerate 2 failures
⬢ Good data locality, local shortcut
⬢ Multiple copies => Parallel IO for parallel compute
⬢ Very Fast block recovery and node recovery
– Parallel recover - the bigger the cluster the faster
– 10TB Node recovery 30sec to a few hours
⬢ 3/x storage overhead vs 1.4-1.6 of Erasure Code
– Remember that Hadoop’s JBod is much much cheaper
– 1/10 - 1/20 of SANs
– 1/10 – 1/5 of NFS
r1
Rack I
DataNode
r2
Rack II
DataNode
r3
13. 1
3
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Erasure Coding
⬢k data blocks + m parity blocks (k + m)
– Example: Reed-Solomon 6+3
⬢Reliability: tolerate m failures
⬢Save disk space
⬢Save I/O bandwidth on the write path
b3b1 b2 P1b6b4 b5 P2 P3
6 data blocks 3 parity blocks
• 1.5x storage overhead
• Tolerate any 3 failures
3-replication (6, 3) Reed-Solomon
Maximum fault Tolerance 2 3
Disk usage
(N byte of data)
3N 1.5N
14. 1
4
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Block Reconstruction
⬢ Block reconstruction overhead
– Higher network bandwidth cost
– Extra CPU overhead
• Local Reconstruction Codes (LRC), Hitchhiker
b4
Rack
b2
Rack
b3
Rack
b1
Rack
b6
Rack
b5
Rack RackRack
P1 P2
Rack
P3
Huang et al. Erasure Coding in Windows Azure Storage. USENIX ATC'12.
Sathiamoorthy et al. XORing elephants: novel erasure codes for big data. VLDB 2013.
Rashmi et al. A "Hitchhiker's" Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers. SIGCOMM'14.
15. 1
5
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Erasure Coding on Contiguous/Striped Blocks
⬢ EC on striped blocks
– Pros: Leverage multiple disks in parallel
– Pros: Works for small small files
– Cons: No data locality for readers
C1 C2 C3 C4 C5 C6 PC1 PC2 PC3
C7 C8 C9 C10 C11 C12 PC4 PC5 PC6
stripe 1
stripe 2
stripe n
b1 b2 b3 b4 b5 b6 P1 P2 P3
6 Data Blocks 3 Parity Blocks
b3b1 b2 b6b4 b5
File f1
P1 P2 P3
parity blocks
File f2 f3
data blocks
Two Approaches
EC on contiguous blocks
– Pros: Better for locality
– Cons: small files cannot be handled
16. 1
6
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Starting from Striping to deal with smaller files
⬢ Hadoop 3.0.0 implementes Phase 1.1 and Phase 1.2
Apache Hadoop’s decision
17. 1
7
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Erasure Coding Zone
⬢ Create a zone on an empty directory
– Shell command: hdfs erasurecode –createZone [-s <schemaName>] <path>
⬢ All the files under a zone directory are automatically erasure
coded
– Rename across zones with different EC schemas are disallowed
18. 1
8
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Write Pipeline for Replicated Files
⬢ Write pipeline to datanodes
⬢ Durability
– Use 3 replicas to tolerate maximum 2 failures
⬢ Visibility
– Read is supported for being written files
– Data can be made visible by hflush/hsync
⬢ Consistency
– Client can start reading from any replica and failover to any other replica to read the same data
⬢ Appendable
– Files can be reopened for append
* DN = DataNode
DN1 DN2 DN3
data data
ackack
Writer
data
ack
19. 1
9
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Parallel Write for EC Files
⬢ Parallel write
– Client writes to a group of 9 datanodes at the same time
– Calculate Parity bits at client side, at Write Time
⬢ Durability
– (6, 3)-Reed-Solomon can tolerate maximum 3 failures
⬢ Visibility (Same as replicated files)
– Read is supported for being written files
– Data can be made visible by hflush/hsync
⬢ Consistency
– Client can start reading from any 6 of the 9 replicas
– When reading from a datanode fails, client can failover to
any other remaining replica to read the same data.
⬢ Appendable (Same as replicated files)
– Files can be reopened for append
DN1
DN6
DN7
data
parity
ack
ack
Writer
data
ack
DN9
parity
ack
……
Stipe size 1MB
20. 2
0
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
EC: Write Failure Handling
⬢ Datanode failure
– Client ignores the failed datanode and continue writing.
– Able to tolerate 3 failures.
– Require at least 6 datanodes.
– Missing blocks will be reconstructed later.
DN1
DN6
DN7
data
parity
ack
ack
Writer
data
ack
DN9
parity
ack
……
21. 2
1
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Replication:
Slow Writers & Replace Datanode on Failure
⬢ Write pipeline for replicated files
– Datanode can be replaced in case of failure.
⬢ Slow writers
– A write pipeline may last for a long time
– The probability of datanode failures increases over time.
– Need to replace datanode on failure.
⬢ EC files
– Do not support replace-datanode-on-failure.
– Slow writer improved
DN1 DN4
data
ack
DN3DN2
data
ack
Writer
data
ack
22. 2
2
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Reading with Parity Blocks
⬢ Parallel read
– Read from 6 Datanodes with data blocks
– Support both stateful read and pread
⬢ Block reconstruction
– Read parity blocks to reconstruct missing blocks
DN3
DN7
DN1
DN2
Reader
DN4
DN5
DN6
Block3
reconstruct
Block2
Block1
Block4
Block5
Block6Parity1
23. 2
3
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Pros
–Low latency because of parallel write/read
–Good for small-size files
⬢ Cons
–Require high network bandwidth between client-server
–Higher reconstruction cost
–Dead DataNode implies high network traffic and reconstruction time
Network traffic – Need good network bandwidth
Workload 3-replication (6, 3) Reed-Solomon
Read 1 block 1 LN 1/6 LN + 5/6 RR
Write 1LN + 1LR + 1RR 1/6 LN + 1/6 LR +
7/6 RR
LN: Local Node
LR: Local Rack
RR: Remote Rack
24. 2
4
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN
YARN Scheduling Enhancements
Support for Long Running Services
Re-architecture for YARN Timeline Service - ATS v2
Better elasticity and resource utilization
Better resource isolation and Docker!!
Better User Experiences
Other Enhancements
25. 2
5
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scheduling Enhancements
Application priorities within a queue: YARN-1963
– In Queue A, App1 > App 2
Inter-Queue priorities
– Q1 > Q2 irrespective of demand / capacity
– Previously based on unconsumed capacity
Affinity / anti-affinity: YARN-1042
– More restraints on locations
Global Scheduling: YARN-5139
– Get rid of scheduling triggered on node heartbeat
– Replaced with global scheduler that has parallel threads
• Globally optimal placement
• Critical for long running services – they stick to the allocation – better be a good one
• Enhanced container scheduling throughput (6x)
26. 2
6
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Drivers for Long Running Services
Consolidation of Infrastructure
Hadoop clusters have a lot of compute and storage resources (some unused)
Can’t I use Hadoop’s resources for non-Hadoop load?
Openstack is hard to run, can I use YARN?
But does it support Docker? – yes, we heard you
Hadoop related Data Services that run outside a Hadoop cluster
Why can’t I run them in the Hadoop cluster
Run Hadoop services (Hive, HBase) on YARN
Run Multiple instances
Benefit from YARN’s Elasticity and resource management
27. 2
7
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Built-in support for long running Service in YARN
A native YARN framework. YARN-4692
Abstract common Framework (Similar to Slider) to support long running service
More simplified API (to manage service lifecycle)
Better support for long running service
Recognition of long running service
Affect the policy of preemption, container reservation, etc.
Auto-restart of containers
Containers for long running service are retried to same node in case of local state
Service/application upgrade support – YARN-4726
In general, services are expected to run long enough to cross versions
Dynamic container configuration
Only ask for resources just enough, but adjust them at runtime (memory harder)
28. 2
8
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Discovery services in YARN
Services can run on any YARN node; how do get its IP?
– It can also move due to node failure
YARN Service Discovery via DNS: YARN-4757
– Expose existing service information in YARN registry via DNS
• Current YARN service registry’s records will be converted into DNS entries
– Discovery of container IP and service port via standard DNS lookups.
• Application
– zkapp1.user1.yarncluster.com -> 192.168.10.11:8080
• Container
– Container 1454001598828-0001-01-00004.yarncluster.com -> 192.168.10.18
29. 2
9
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
A More Powerful YARN
⬢ Elastic Resource Model
–Dynamic Resource Configuration
•YARN-291
•Allow tune down/up on NM’s resource in runtime
–Graceful decommissioning of NodeManagers
•YARN-914
•Drains a node that’s being decommissioned to allow running containers to
finish
⬢ Efficient Resource Utilization
–Support for container resizing
•YARN-1197
•Allows applications to change the size of an existing container
30. 3
0
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
More Powerful YARN (Contd.)
⬢ Resource Isolation
–Resource isolation support for disk and network
•YARN-2619 (disk), YARN-2140 (network)
•Containers get a fair share of disk and network resources using Cgroups
–Docker support in LinuxContainerExecutor
•YARN-3611
•Support to launch Docker containers alongside process
•Packaging and resource isolation
• Complements YARN’s support for long running services
31. 3
1
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop Apps
Docker on Yarn & YARN on YARN - YCloud
YARN
MR Tez Spark
TensorFlow YARN
MR Tez Spar
k
Can use Yarn to test Hadoop!!
33. 3
3
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Timeline Service Revolution – Why ATS v2
⬢ Scalability & Performance
v1 limitation:
–Single global instance of writer/reader
–Local disk based LevelDB storage
⬢ Usability
–Handle flows as first-class concepts and
model aggregation
–Add configuration and metrics as first-class
members
–Better support for queries
⬢ Reliability
v1 limitation:
–Data is stored in a local disk
–Single point of failure (SPOF) for timeline
server
⬢ Flexibility
–Data model is more describable
–Extended to more specific info to app
34. 3
4
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Core Design for ATS v2
⬢ Distributed write path
– Logical per app collector + physical per
node writer
– Collector/Writer launched as an auxiliary
service in NM.
– Standalone writers will be added later.
⬢ Pluggable backend storage
– Built in with a scalable and reliable
implementation (HBase)
⬢ Enhanced data model
– Entity (bi-directional relation) with flow,
queue, etc.
– Configuration, Metric, Event, etc.
⬢ Separate reader instances
⬢ Aggregation & Accumulation
– Aggregation: rolling up the metric values to the
parent
•Online aggregation for apps and flow runs
•Offline aggregation for users, flows and
queues
– Accumulation: rolling up the metric values
across time interval
•Accumulated resource consumption for app,
flow, etc.
35. 3
5
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Other YARN work planned in Hadoop 3.X
⬢ Resource profiles
–YARN-3926
–Users can specify resource profile name instead of individual resources
–Resource types read via a config file
⬢ YARN federation
–YARN-2915
–Allows YARN to scale out to tens of thousands of nodes
–Cluster of clusters which appear as a single cluster to an end user
⬢ Gang Scheduling
–YARN-624
36. 3
6
© Hortonworks Inc. 2011 – 2016. All Rights Reserved3
6
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you!
Reminder: BoFs on Thursday at 5:50pm
Editor's Notes it enables online EC which bypasses the conversion phase and immediately saves storage space; this is especially desirable in clusters with high end networking. Second, it naturally distributes a small file to multiple DataNodesand eliminates the need to bundle multiple files into a single coding group.