SlideShare a Scribd company logo
1 of 54
Download to read offline
1© Cloudera, Inc. All rights reserved.
Apache Hadoop 3
Andrew Wang Daniel Templeton
andrew.wang@cloudera.com daniel@cloudera.com
2© Cloudera, Inc. All rights reserved.
Who We Are
Andrew Wang
● HDFS @ Cloudera
● Hadoop PMC Member
● Release Manager for Hadoop 3.0
Daniel Templeton
● YARN @ Cloudera
● Hadoop PMC Member
3© Cloudera, Inc. All rights reserved.
An Abbreviated History of Hadoop Releases
Date Release Major Notes
2007-11-04 0.14.1 First release at the ASF
2011-12-27 1.0.0 Security, HBase support
2012-05-23 2.0.0 YARN, NameNode HA, wire compat
2014-11-18 2.6.0 HDFS encryption, rolling upgrade, node labels
2015-04-21 2.7.0 Truncate, Variable-length blocks, YARN Global Caching,
2017-03-22 2.8.0 Cloud improvement, Azure Data Lake, and etc.
2017-11-17 2.9.0 Stability Improvement
2017-12-13 3.0.0 Java 8, Erasure Coding, S3Guard, YARN Timeline Service
4© Cloudera, Inc. All rights reserved.
Motivation for Hadoop 3
● Upgrade minimum Java version to Java 8
○ Java 7 end-of-life in April 2015
○ Many Java libraries now only support Java 8
● HDFS erasure coding
○ Major feature that refactored core pieces of HDFS
○ Too big to backport to 2.x
● Classpath isolation
○ Potentially impacts all clients
● Other miscellaneous incompatible bugfixes and improvements
○ Hadoop 2.x was branched in 2011
○ 6 years of changes waiting for 3.0
5© Cloudera, Inc. All rights reserved.
Hadoop 3 Status and Release Plan
● After four alphas and one beta, 3.0.0 is out!
● Took close to two years from inception
● 3.0.1 and 3.1.0 are already in progress
https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+3.0.0+release
Release Date
3.0.0-alpha1 2016-09-03 ✔
3.0.0-alpha2 2017-01-25 ✔
3.0.0-alpha3 2017-05-26 ✔
3.0.0-alpha4 2017-07-07 ✔
3.0.0-beta1 2017-10-03 ✔
3.0.0 GA 2017-12-13 ✔
3.0.1 2017 Mar
6© Cloudera, Inc. All rights reserved.
HDFS & Hadoop Features
7© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
8© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
b1 b2 b3
b1 b2 b3
9© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
b1 b2 b3
b1 b2 b3
3 replicas
3 blocks
3 x 3 = 9 total replicas
9 / 3 = 200% overhead!
10© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
11© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
p1 p2
12© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
p1 p2
3 data blocks 2 parity blocks
3 + 2 = 5 replicas
5 / 3 = 67% overhead!
13© Cloudera, Inc. All rights reserved.
3x replication vs. Erasure coding
b1 b2 b3
/foo.csv - 3 block file
p1 p2
3 data blocks 2 parity blocks
3 + 2 = 5 replicas
5 / 3 = 67% overhead!
b1 b2 b10
/bigfoo.csv - 10 block file
p1 p4
10 data blocks 4 parity blocks
10 + 4 = 14 replicas
14 / 10 = 40% overhead!
... ...
14© Cloudera, Inc. All rights reserved.
EC Reconstruction
b1 b2 b3
/foo.csv - 3 block file
p1 p2 Reed-Solomon (3,2)
15© Cloudera, Inc. All rights reserved.
EC Reconstruction
b1 b2 b3
/foo.csv - 3 block file
p1 p2 Reed-Solomon (3,2)
X
16© Cloudera, Inc. All rights reserved.
EC Reconstruction
b1 b2 b3
/foo.csv - 3 block file
p1 p2 Reed-Solomon (3,2)
Read 3 remaining blocks
b3
Run RS to recover b3
New copy of b3 recovered
X
17© Cloudera, Inc. All rights reserved.
Erasure coding (HDFS-7285)
● Motivation: improve storage efficiency of HDFS
○ ~2x the storage efficiency compared to 3x replication
○ Reduction of overhead from 200% to 40%
● Uses Reed-Solomon(k,m) erasure codes instead of replication
○ Support for multiple erasure coding policies
○ RS(3,2), RS(6,3), RS(10,4)
● Can improves data durability
○ RS(6,3) can tolerate 3 failures
○ RS(10,4) can tolerate 4 failures
● Missing blocks reconstructed from remaining blocks
18© Cloudera, Inc. All rights reserved.
EC implications
● File data is striped across multiple nodes and racks
● Reads and writes are remote and cross-rack
● Reconstruction is network-intensive, reads m blocks cross-rack
● Important to use Intel’s optimized ISA-L for performance
○ 1+ GB/s encode/decode speed, much faster than Java implementation
● Combine data into larger files to avoid an explosion in # replicas
○ Bad: 1x1GB file -> RS(10,4) -> 14x100MB EC blocks (4.6x # replicas)
○ Good: 10x1GB file -> RS(10,4) -> 14x1GB EC blocks (0.46x # replicas)
● Works best for archival / cold data use cases
19© Cloudera, Inc. All rights reserved.
EC performance
20© Cloudera, Inc. All rights reserved.
EC performance
21© Cloudera, Inc. All rights reserved.
EC performance
22© Cloudera, Inc. All rights reserved.
Erasure coding status
● Massive development effort by the Hadoop community
○ 20+ contributors from many companies
■ Cloudera, Intel, Hortonworks, Huawei, Y! JP, …
○ 100s of commits over more than three years (started in 2014)
● Erasure coding is ready in 3.0.0 GA!
● Current focus is on testing and integration efforts
○ Want the complete Hadoop stack to work with HDFS erasure coding enabled
○ Ongoing stress / endurance testing to ensure stability at scale
23© Cloudera, Inc. All rights reserved.
● Hadoop leaks lots of dependencies
onto the application’s classpath
○ Known offenders: Guava,
Protobuf, Jackson, Jetty, …
● No separate HDFS client jar means
server jars are leaked
● YARN / MR clients not shaded
● HDFS-6200: Split HDFS client into
separate JAR
● HADOOP-11804: Shaded
hadoop-client dependency
● YARN-6466: Shade the task
umbilical for a clean YARN
container environment (ongoing)
Classpath isolation (HADOOP-11656)
24© Cloudera, Inc. All rights reserved.
Miscellaneous
● Supportability improvements
○ Shell script rewrite
○ Intra-DataNode balancer
○ Move default ports out of the ephemeral range
● Support for multiple Standby NameNodes
● Cloud enhancements
○ Support for Microsoft Azure Data Lake and Aliyun OSS
○ S3 consistency and performance improvements
● Tightened Hadoop compatibility policy
25© Cloudera, Inc. All rights reserved.
YARN Features
26© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
27© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
jobs
28© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
jobs
Job
History
Server
29© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
jobs
Job
History
Server
HDFS
Node
Manager
30© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
jobs
Job
History
Server
Spark
History
Server
31© Cloudera, Inc. All rights reserved.
Job History Server
Resource
Manager
jobs
Job
History
Server
Spark
History
Server
?
32© Cloudera, Inc. All rights reserved.
Application Timeline Service v2
● Store for application and system events and data
○ Distributed
○ Scalable
○ Structured Data Model
● Updated in real time
○ Application status
○ Application metrics
○ System metrics
● Fed by resource manager, node manager, and application masters
● REST API
33© Cloudera, Inc. All rights reserved.
Application Timeline Service v2
Resource
Manager
jobs
Application
Timeline
Service
HBase
34© Cloudera, Inc. All rights reserved.
Timeline
Reader
Timeline
Reader
Application Timeline Service v2
Resource
Manager
Timeline
Collecter
HBase Node
Manager
Application
Master
Timeline
Collecter
Timeline
Reader
35© Cloudera, Inc. All rights reserved.
Application Timeline Service v2 Flows
36© Cloudera, Inc. All rights reserved.
Application Timeline Service v2 Flows
37© Cloudera, Inc. All rights reserved.
Application Timeline Service v2 Flows
38© Cloudera, Inc. All rights reserved.
Old YARN UI
39© Cloudera, Inc. All rights reserved.
New YARN UI
● Rich client application
○ Built on Node.js and Ember
● Improved visibility into cluster usage
○ Memory, CPU
○ By queues and applications
○ Sunburst graphs for hierarchical queues
○ NodeManager heatmap
● ATSv2 integration
○ Plot container start/stop events
○ Easy to capture delays in app execution
40© Cloudera, Inc. All rights reserved.
New YARN UI: Cluster Overview
41© Cloudera, Inc. All rights reserved.
New YARN UI: Queues
42© Cloudera, Inc. All rights reserved.
● Before Hadoop 3 memory and CPU are the only managed resources
● Resource Types allows adding new managed resources
○ Countable resources: GPUs, Disks etc.
○ Static resources: Java version, Python version, hardware profile, ...
■ Still in proposal stage
● Resource profiles
○ Similar conceptually to EC2 instance types
○ Capture complex resource request
● DRF for scheduling
● Current virtual CPU cores and memory resources work as before
Resource Types
43© Cloudera, Inc. All rights reserved.
YARN Federation
● YARN scalability
○ Twitter runs a 10k node cluster with fair scheduler
○ Yahoo! runs 4k node cluster with capacity scheduler
● Federation
○ Restrict users to sub-clusters based on policy
○ Scalability to 100k nodes and beyond
○ Independent cluster scheduling
44© Cloudera, Inc. All rights reserved.
YARN Federation
Router
Resource
Manager
Node Manager
Node Manager
Node Manager
Node Manager
Resource
Manager
Node Manager
Node Manager
Node Manager
Node Manager
Policy
Admin
45© Cloudera, Inc. All rights reserved.
Opportunistic Containers
● Scheduler’s job is to keep all resources busy
● Scheduling gaps
○ Nothing to run
○ Resource contention
○ Resource reservations
● Opportunistic containers fill those gaps
○ Requested explicitly
○ Dedicated scheduler
○ Queued at the node managers
○ Scheduled locally when resources are available
○ Preempted when guaranteed containers need to run
● Coming in 2.9 and 3.0
46© Cloudera, Inc. All rights reserved.
Oversubscription
● Resource utilization is typically
low in most clusters (20-50%)
○ Provision for peak usage
● Usage < Allocation
○ Mean Usage = ½ Peak Usage
47© Cloudera, Inc. All rights reserved.
Oversubscription
● Oversubscription
○ Allocate opportunistic containers to use allocated-but-unused resources
○ Jobs automatically use these unless they opt-out
○ Threshold to control aggressiveness of oversubscription
○ Threshold to trigger preemption
● Currently in progress
48© Cloudera, Inc. All rights reserved.
● Long Running Services
○ Slider merging into YARN
○ Docker support
● Scheduler improvements
○ Capacity scheduler
■ Performance and preemption
improvements
■ Online scheduling (“global
scheduler”)
■ Queue management
○ Fair scheduler
■ Performance and preemption
improvements
● High availability improvements
○ Better handling of transient
network issues
○ ZK-store scalability: Limit number
of children under a znode
● MapReduce Native Collector
(MAPREDUCE-2841)
○ Native implementation of the map
output collector
○ Up to 30% faster for
shuffle-intensive jobs
Other YARN Improvements
49© Cloudera, Inc. All rights reserved.
Summary: What’s new in Hadoop 3.0?
● Storage Optimization
○ HDFS: Erasure codes
● Improved Visibility into Cluster Operations
○ YARN: ATSv2
○ YARN: New UI
● Scalability & Multi-tenancy
○ YARN: Federation
● Improved Utilization
○ YARN: Opportunistic Containers
○ YARN: Oversubscription
● Refactor Base
○ Lots of Trunk content
○ JDK8 and newer dependent libraries
50© Cloudera, Inc. All rights reserved.
Compatibility and Testing
51© Cloudera, Inc. All rights reserved.
Compatibility
● Strong feedback from large users on the need for compatibility
● Preserves wire-compatibility with Hadoop 2 clients
○ Impossible to coordinate upgrading off-cluster Hadoop clients
● Will support rolling upgrade from Hadoop 2 to Hadoop 3
○ Can’t take downtime to upgrade a business-critical cluster
● Not fully preserving API compatibility!
○ Dependency version bumps
○ Removal of deprecated APIs and tools
○ Shell script rewrite, rework of Hadoop tools scripts
○ Incompatible bug fixes
52© Cloudera, Inc. All rights reserved.
Testing and Validation
● Cloudera CDH 6 is based on upstream Hadoop 3.0.0
○ Running full test suite
○ Integration of Hadoop 3 with all components in CDH stack
○ Same integration tests used to validate CDH5
● Plans for extensive HDFS EC testing by Cloudera and Intel
● Happy synergy between 2.8.x and 3.0.x lines
○ Shares much of the same code, fixes flow into both
○ Yahoo! doing scale testing of 2.8.0
53© Cloudera, Inc. All rights reserved.
Conclusion
● Hadoop 3.0.0 GA is out!
● Shiny new features
○ HDFS erasure coding
○ Client classpath isolation
○ YARN ATSv2
○ YARN Federation
○ Opportunistic containers and oversubscription
● Great time to get involved in testing and validation
54© Cloudera, Inc. All rights reserved.
Thank you
Andrew Wang Daniel Templeton
andrew.wang@cloudera.com daniel@cloudera.com

More Related Content

What's hot

Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Simplilearn
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 

What's hot (20)

Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Presto
PrestoPresto
Presto
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Overview of new features in Apache Ranger
Overview of new features in Apache RangerOverview of new features in Apache Ranger
Overview of new features in Apache Ranger
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Road to NODES - Handling Neo4j Data with Apache Hop
Road to NODES - Handling Neo4j Data with Apache HopRoad to NODES - Handling Neo4j Data with Apache Hop
Road to NODES - Handling Neo4j Data with Apache Hop
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Similar to Apache Hadoop 3

Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Cloudera, Inc.
 

Similar to Apache Hadoop 3 (20)

Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 
Ozone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityOzone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalability
 
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...
 
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NYApache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
Apache hadoop 3.x state of the union and upgrade guidance - Strata 2019 NY
 
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdfCloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
 
DDN Product Update from SC13
DDN Product Update from SC13DDN Product Update from SC13
DDN Product Update from SC13
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
[Hadoop Meetup] Apache Hadoop 3 community update - Rohith Sharma
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
 
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...How to build leakproof stream processing pipelines with Apache Kafka and Apac...
How to build leakproof stream processing pipelines with Apache Kafka and Apac...
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 

Recently uploaded (20)

AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 

Apache Hadoop 3

  • 1. 1© Cloudera, Inc. All rights reserved. Apache Hadoop 3 Andrew Wang Daniel Templeton andrew.wang@cloudera.com daniel@cloudera.com
  • 2. 2© Cloudera, Inc. All rights reserved. Who We Are Andrew Wang ● HDFS @ Cloudera ● Hadoop PMC Member ● Release Manager for Hadoop 3.0 Daniel Templeton ● YARN @ Cloudera ● Hadoop PMC Member
  • 3. 3© Cloudera, Inc. All rights reserved. An Abbreviated History of Hadoop Releases Date Release Major Notes 2007-11-04 0.14.1 First release at the ASF 2011-12-27 1.0.0 Security, HBase support 2012-05-23 2.0.0 YARN, NameNode HA, wire compat 2014-11-18 2.6.0 HDFS encryption, rolling upgrade, node labels 2015-04-21 2.7.0 Truncate, Variable-length blocks, YARN Global Caching, 2017-03-22 2.8.0 Cloud improvement, Azure Data Lake, and etc. 2017-11-17 2.9.0 Stability Improvement 2017-12-13 3.0.0 Java 8, Erasure Coding, S3Guard, YARN Timeline Service
  • 4. 4© Cloudera, Inc. All rights reserved. Motivation for Hadoop 3 ● Upgrade minimum Java version to Java 8 ○ Java 7 end-of-life in April 2015 ○ Many Java libraries now only support Java 8 ● HDFS erasure coding ○ Major feature that refactored core pieces of HDFS ○ Too big to backport to 2.x ● Classpath isolation ○ Potentially impacts all clients ● Other miscellaneous incompatible bugfixes and improvements ○ Hadoop 2.x was branched in 2011 ○ 6 years of changes waiting for 3.0
  • 5. 5© Cloudera, Inc. All rights reserved. Hadoop 3 Status and Release Plan ● After four alphas and one beta, 3.0.0 is out! ● Took close to two years from inception ● 3.0.1 and 3.1.0 are already in progress https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+3.0.0+release Release Date 3.0.0-alpha1 2016-09-03 ✔ 3.0.0-alpha2 2017-01-25 ✔ 3.0.0-alpha3 2017-05-26 ✔ 3.0.0-alpha4 2017-07-07 ✔ 3.0.0-beta1 2017-10-03 ✔ 3.0.0 GA 2017-12-13 ✔ 3.0.1 2017 Mar
  • 6. 6© Cloudera, Inc. All rights reserved. HDFS & Hadoop Features
  • 7. 7© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file
  • 8. 8© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file b1 b2 b3 b1 b2 b3
  • 9. 9© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file b1 b2 b3 b1 b2 b3 3 replicas 3 blocks 3 x 3 = 9 total replicas 9 / 3 = 200% overhead!
  • 10. 10© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file
  • 11. 11© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file p1 p2
  • 12. 12© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file p1 p2 3 data blocks 2 parity blocks 3 + 2 = 5 replicas 5 / 3 = 67% overhead!
  • 13. 13© Cloudera, Inc. All rights reserved. 3x replication vs. Erasure coding b1 b2 b3 /foo.csv - 3 block file p1 p2 3 data blocks 2 parity blocks 3 + 2 = 5 replicas 5 / 3 = 67% overhead! b1 b2 b10 /bigfoo.csv - 10 block file p1 p4 10 data blocks 4 parity blocks 10 + 4 = 14 replicas 14 / 10 = 40% overhead! ... ...
  • 14. 14© Cloudera, Inc. All rights reserved. EC Reconstruction b1 b2 b3 /foo.csv - 3 block file p1 p2 Reed-Solomon (3,2)
  • 15. 15© Cloudera, Inc. All rights reserved. EC Reconstruction b1 b2 b3 /foo.csv - 3 block file p1 p2 Reed-Solomon (3,2) X
  • 16. 16© Cloudera, Inc. All rights reserved. EC Reconstruction b1 b2 b3 /foo.csv - 3 block file p1 p2 Reed-Solomon (3,2) Read 3 remaining blocks b3 Run RS to recover b3 New copy of b3 recovered X
  • 17. 17© Cloudera, Inc. All rights reserved. Erasure coding (HDFS-7285) ● Motivation: improve storage efficiency of HDFS ○ ~2x the storage efficiency compared to 3x replication ○ Reduction of overhead from 200% to 40% ● Uses Reed-Solomon(k,m) erasure codes instead of replication ○ Support for multiple erasure coding policies ○ RS(3,2), RS(6,3), RS(10,4) ● Can improves data durability ○ RS(6,3) can tolerate 3 failures ○ RS(10,4) can tolerate 4 failures ● Missing blocks reconstructed from remaining blocks
  • 18. 18© Cloudera, Inc. All rights reserved. EC implications ● File data is striped across multiple nodes and racks ● Reads and writes are remote and cross-rack ● Reconstruction is network-intensive, reads m blocks cross-rack ● Important to use Intel’s optimized ISA-L for performance ○ 1+ GB/s encode/decode speed, much faster than Java implementation ● Combine data into larger files to avoid an explosion in # replicas ○ Bad: 1x1GB file -> RS(10,4) -> 14x100MB EC blocks (4.6x # replicas) ○ Good: 10x1GB file -> RS(10,4) -> 14x1GB EC blocks (0.46x # replicas) ● Works best for archival / cold data use cases
  • 19. 19© Cloudera, Inc. All rights reserved. EC performance
  • 20. 20© Cloudera, Inc. All rights reserved. EC performance
  • 21. 21© Cloudera, Inc. All rights reserved. EC performance
  • 22. 22© Cloudera, Inc. All rights reserved. Erasure coding status ● Massive development effort by the Hadoop community ○ 20+ contributors from many companies ■ Cloudera, Intel, Hortonworks, Huawei, Y! JP, … ○ 100s of commits over more than three years (started in 2014) ● Erasure coding is ready in 3.0.0 GA! ● Current focus is on testing and integration efforts ○ Want the complete Hadoop stack to work with HDFS erasure coding enabled ○ Ongoing stress / endurance testing to ensure stability at scale
  • 23. 23© Cloudera, Inc. All rights reserved. ● Hadoop leaks lots of dependencies onto the application’s classpath ○ Known offenders: Guava, Protobuf, Jackson, Jetty, … ● No separate HDFS client jar means server jars are leaked ● YARN / MR clients not shaded ● HDFS-6200: Split HDFS client into separate JAR ● HADOOP-11804: Shaded hadoop-client dependency ● YARN-6466: Shade the task umbilical for a clean YARN container environment (ongoing) Classpath isolation (HADOOP-11656)
  • 24. 24© Cloudera, Inc. All rights reserved. Miscellaneous ● Supportability improvements ○ Shell script rewrite ○ Intra-DataNode balancer ○ Move default ports out of the ephemeral range ● Support for multiple Standby NameNodes ● Cloud enhancements ○ Support for Microsoft Azure Data Lake and Aliyun OSS ○ S3 consistency and performance improvements ● Tightened Hadoop compatibility policy
  • 25. 25© Cloudera, Inc. All rights reserved. YARN Features
  • 26. 26© Cloudera, Inc. All rights reserved. Job History Server Resource Manager
  • 27. 27© Cloudera, Inc. All rights reserved. Job History Server Resource Manager jobs
  • 28. 28© Cloudera, Inc. All rights reserved. Job History Server Resource Manager jobs Job History Server
  • 29. 29© Cloudera, Inc. All rights reserved. Job History Server Resource Manager jobs Job History Server HDFS Node Manager
  • 30. 30© Cloudera, Inc. All rights reserved. Job History Server Resource Manager jobs Job History Server Spark History Server
  • 31. 31© Cloudera, Inc. All rights reserved. Job History Server Resource Manager jobs Job History Server Spark History Server ?
  • 32. 32© Cloudera, Inc. All rights reserved. Application Timeline Service v2 ● Store for application and system events and data ○ Distributed ○ Scalable ○ Structured Data Model ● Updated in real time ○ Application status ○ Application metrics ○ System metrics ● Fed by resource manager, node manager, and application masters ● REST API
  • 33. 33© Cloudera, Inc. All rights reserved. Application Timeline Service v2 Resource Manager jobs Application Timeline Service HBase
  • 34. 34© Cloudera, Inc. All rights reserved. Timeline Reader Timeline Reader Application Timeline Service v2 Resource Manager Timeline Collecter HBase Node Manager Application Master Timeline Collecter Timeline Reader
  • 35. 35© Cloudera, Inc. All rights reserved. Application Timeline Service v2 Flows
  • 36. 36© Cloudera, Inc. All rights reserved. Application Timeline Service v2 Flows
  • 37. 37© Cloudera, Inc. All rights reserved. Application Timeline Service v2 Flows
  • 38. 38© Cloudera, Inc. All rights reserved. Old YARN UI
  • 39. 39© Cloudera, Inc. All rights reserved. New YARN UI ● Rich client application ○ Built on Node.js and Ember ● Improved visibility into cluster usage ○ Memory, CPU ○ By queues and applications ○ Sunburst graphs for hierarchical queues ○ NodeManager heatmap ● ATSv2 integration ○ Plot container start/stop events ○ Easy to capture delays in app execution
  • 40. 40© Cloudera, Inc. All rights reserved. New YARN UI: Cluster Overview
  • 41. 41© Cloudera, Inc. All rights reserved. New YARN UI: Queues
  • 42. 42© Cloudera, Inc. All rights reserved. ● Before Hadoop 3 memory and CPU are the only managed resources ● Resource Types allows adding new managed resources ○ Countable resources: GPUs, Disks etc. ○ Static resources: Java version, Python version, hardware profile, ... ■ Still in proposal stage ● Resource profiles ○ Similar conceptually to EC2 instance types ○ Capture complex resource request ● DRF for scheduling ● Current virtual CPU cores and memory resources work as before Resource Types
  • 43. 43© Cloudera, Inc. All rights reserved. YARN Federation ● YARN scalability ○ Twitter runs a 10k node cluster with fair scheduler ○ Yahoo! runs 4k node cluster with capacity scheduler ● Federation ○ Restrict users to sub-clusters based on policy ○ Scalability to 100k nodes and beyond ○ Independent cluster scheduling
  • 44. 44© Cloudera, Inc. All rights reserved. YARN Federation Router Resource Manager Node Manager Node Manager Node Manager Node Manager Resource Manager Node Manager Node Manager Node Manager Node Manager Policy Admin
  • 45. 45© Cloudera, Inc. All rights reserved. Opportunistic Containers ● Scheduler’s job is to keep all resources busy ● Scheduling gaps ○ Nothing to run ○ Resource contention ○ Resource reservations ● Opportunistic containers fill those gaps ○ Requested explicitly ○ Dedicated scheduler ○ Queued at the node managers ○ Scheduled locally when resources are available ○ Preempted when guaranteed containers need to run ● Coming in 2.9 and 3.0
  • 46. 46© Cloudera, Inc. All rights reserved. Oversubscription ● Resource utilization is typically low in most clusters (20-50%) ○ Provision for peak usage ● Usage < Allocation ○ Mean Usage = ½ Peak Usage
  • 47. 47© Cloudera, Inc. All rights reserved. Oversubscription ● Oversubscription ○ Allocate opportunistic containers to use allocated-but-unused resources ○ Jobs automatically use these unless they opt-out ○ Threshold to control aggressiveness of oversubscription ○ Threshold to trigger preemption ● Currently in progress
  • 48. 48© Cloudera, Inc. All rights reserved. ● Long Running Services ○ Slider merging into YARN ○ Docker support ● Scheduler improvements ○ Capacity scheduler ■ Performance and preemption improvements ■ Online scheduling (“global scheduler”) ■ Queue management ○ Fair scheduler ■ Performance and preemption improvements ● High availability improvements ○ Better handling of transient network issues ○ ZK-store scalability: Limit number of children under a znode ● MapReduce Native Collector (MAPREDUCE-2841) ○ Native implementation of the map output collector ○ Up to 30% faster for shuffle-intensive jobs Other YARN Improvements
  • 49. 49© Cloudera, Inc. All rights reserved. Summary: What’s new in Hadoop 3.0? ● Storage Optimization ○ HDFS: Erasure codes ● Improved Visibility into Cluster Operations ○ YARN: ATSv2 ○ YARN: New UI ● Scalability & Multi-tenancy ○ YARN: Federation ● Improved Utilization ○ YARN: Opportunistic Containers ○ YARN: Oversubscription ● Refactor Base ○ Lots of Trunk content ○ JDK8 and newer dependent libraries
  • 50. 50© Cloudera, Inc. All rights reserved. Compatibility and Testing
  • 51. 51© Cloudera, Inc. All rights reserved. Compatibility ● Strong feedback from large users on the need for compatibility ● Preserves wire-compatibility with Hadoop 2 clients ○ Impossible to coordinate upgrading off-cluster Hadoop clients ● Will support rolling upgrade from Hadoop 2 to Hadoop 3 ○ Can’t take downtime to upgrade a business-critical cluster ● Not fully preserving API compatibility! ○ Dependency version bumps ○ Removal of deprecated APIs and tools ○ Shell script rewrite, rework of Hadoop tools scripts ○ Incompatible bug fixes
  • 52. 52© Cloudera, Inc. All rights reserved. Testing and Validation ● Cloudera CDH 6 is based on upstream Hadoop 3.0.0 ○ Running full test suite ○ Integration of Hadoop 3 with all components in CDH stack ○ Same integration tests used to validate CDH5 ● Plans for extensive HDFS EC testing by Cloudera and Intel ● Happy synergy between 2.8.x and 3.0.x lines ○ Shares much of the same code, fixes flow into both ○ Yahoo! doing scale testing of 2.8.0
  • 53. 53© Cloudera, Inc. All rights reserved. Conclusion ● Hadoop 3.0.0 GA is out! ● Shiny new features ○ HDFS erasure coding ○ Client classpath isolation ○ YARN ATSv2 ○ YARN Federation ○ Opportunistic containers and oversubscription ● Great time to get involved in testing and validation
  • 54. 54© Cloudera, Inc. All rights reserved. Thank you Andrew Wang Daniel Templeton andrew.wang@cloudera.com daniel@cloudera.com