SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
Hadoop 3.0
Revolution or evolution?
uweprintz
https://unsplash.com/photos/CIXoFys3gsw
/whoami
&
/disclaimer
Copyright by Uwe Printz
Some Hadoop history
Hadoop 2
HDFS
Redundant, reliable storage
MapReduce
Data processing
YARN
Cluster resource management
Hive
SQL
Spark
In-Memory
…
Oct. 2013
Let there be YARN Apps!
Era of Enterprise Hadoop
2006
Hadoop 1
HDFS
Redundant, reliable
storage
MapReduce
Cluster resource mgmt. +
data processing
Let there be batch!
Era of Silicon Valley Hadoop
Hadoop 3
?
IoT
Machine
Learning GPU’s
TensorFlow
Data
Science
Streaming
Data
Cloud
FPGA’s
Artificial
Intelligence
Kafka
Late 2017
Let there be …?
Era of ?
Why Hadoop 3.0?
• Deprecated APIs can only be removed in major
release
• Not fully preserving API compatibility
• Wire-compatibility could be broken
• Change of default ports
• But preserves wire-compatibility with Hadoop 2 clients
• Will support rolling upgrade from Hadoop 2 to Hadoop 3
• Hadoop command scripts rewrite
• Big features that need stabilizing major release
What is Hadoop 3.0?
20142010 2011 201320122009 2015
2.2.02.0.0-alpha
branch-1
(branch-0.20)
1.0.0 1.1.0 1.2.1 (Stable)0.20.1 0.20.205
0.21.0
New append
0.23.0
branch-2
HDFS Snapshots
NFSv3 support
HDFS ACLs
HDFS Rolling Upgrades
RM Automatic Failover
2.6.0
YARN Rolling Upgrades
Transparent Encryption
Archival Storage
2.7.0
Hadoop 2
Drop JDK6 Support
File Truncate API
2016
branch-0.23
Hadoop 3
Hadoop 2 and 3 were
diverged 5+ years ago
Hadoop 1 (EOL)
Source: Akira Ajisaka
(with additions by Uwe Printz)
2017
0.22.0
0.23.11 (Final)
Security
trunk
2.3.0 2.5.0
2.4.0
NameNode Federation , YARN
NameNode HA
Heterogeneous storage
HDFS In-Memory Caching
2.8.1
3.0.0-alpha1
3.0.0-alpha2
2.1.0-beta
HDFS Extended
attributes
Docker Container in Linux
ATS 1.5
3.0.0-alpha3
3.0.0-alpha4
3.0.0-beta1
15.09.
GA
01.11.
Hadoop 3.0 in a nutshell
• HDFS
• Erasure codes
• Low-level performance enhancements with Intel ISA-L
• 2+ NameNodes
• Intra-DataNode Balancer
• YARN
• Better support for long-running services
• Improved isolation & Docker support
• Scheduler enhancements
• Application Timeline Service v2
• New UI
• MapReduce
• Task-level native optimization
• Derive heap-size automatically
• DevOps
• Drop JDK7 & Move to JDK8
• Change of default ports
• Library & Dependency Upgrade
• Client-side classpath Isolation
• Shell Script Rewrite & ShellDoc
• .hadooprc & .hadoop-env
• Metrics plugin for Kafka
HDFS
https://unsplash.com/photos/LHlwgjbSo3k
HDFS - Current implementation
• 3 replicas by default
• Tolerate maximum of 2 failures
Write request
Lease for file
Split into blocks
Request for
data nodes
List of
data nodes
HDFS Client
NameNode
DataNode 1 DataNode 2 DataNode 3
Write block +
checksum
• Simple, scalable & robust
• 200% space overhead
Write
Pipeline
Write
Pipeline
Calculate
checksum
ACKACK
ACK
Complete!
Erasure Coding (EC)
• k data blocks + m parity blocks
• Example: Reed-Solomon (6,3)
d d d d d d
Raw
Data
Splitting
d d d d d d
d d d d d d
d d d d d d
p p p
p p p
p p p
p p p
Encoding
Store data and parity
• Key Points
• XOR Coding —> Saves space, slower recovery
• Missing or corrupt data will be restored from available data and parity
• Parity can be smaller than data
EC - Main characteristics
Replication
(Factor 1)
Replication
(Factor 3)
Reed-Solomon
(6,3)
Reed-Solomon
(10,4)
Maximum fault tolerance 0 2 3 4
Space Efficiency 100 % 33 % 67 % 71 %
Data Locality Yes No (Phase 1) / Yes (Phase 2)
Write performance High Low
Read performance High Medium
Recovery costs Low High
Pluggable implementation,
first choice
Storage Tier Hot Warm Cold Frozen
Memory/SSD Disk Dense Disk EC
20 x Day 5 x Week 5 x Month 2 x Year
EC - Contiguous blocks
• Approach 1: Retain block size and add parity
File A File B File C
128
MB
128
MB
128
MB
128
MB
128
MB
128
MB
Block 1 Block 2 Block 3 Block 4 Block 5 Block 6
DN 3 DN 2 DN 12 DN 7 DN 5 DN 1
• Pro: Better for locality
• Con: Significant overhead for smaller files, always 3 parity
blocks needed
• Con: Client potentially needs to process GB’s of data for encoding
Parity Parity Parity
DN 6 DN 4 DN 8
Encoding
EC - Striping
• Approach 2: Splitting blocks into smaller cells (1 MB)
File A File B File C
• Pro: Works for small files
• Pro: Allows parallel write
• Con: No data locality -> Increased read latency &
More complicated recovery process
Block 2 Block 3 Block 4 Block 5 Block 6
DN 7 DN 3 DN 4 DN 1 DN 6 DN 12
Stripe 1
Stripe 2
Stripe n
Block 1 Block 4
Round-robin
… … … … … …
Parity
DN 10 DN 8 DN 14
Parity Parity
Encoding
… … …
• Start from striping to deal with smaller files
EC - Apache Hadoop’s decision (HDFS-7285)
Contiguous
Striping
Replication Erasure Coding
HDFS
Facebook f4
Azure
Ceph (before Firefly)
Lustre
Ceph (with Firefly)
QFS
Phase 1.1
HDFS-7285
Phase
1.2
HDFS-8031
Phase 3
(Future Work)
Phase 2
HDFS-8030
Hadoop 3.0.x implements Phase 1.1
EC - Shell Command
• Create a EC Zone on an empty directory
• All the files under a zone directory are automatically erasure coded
• Rename across zones with different EC schemas are disallowed
Usage: hdfs erasurecode [generic options]
[-getPolicy <path>]
[-help [cmd ...]]
[-listPolicies]
[-setPolicy [-p <policyName>] <path>]
-getPolicy <path> :
Get erasure coding policy information about at specified path
-listPolicies :
Get the list of erasure coding policies supported
-setPolicy [-p <policyName>] <path> :
Set a specified erasure coding policy to a directory
Options :
-p <policyName> erasure coding policy name to encode files. If not passed the
default policy will be used
<path> Path to a directory. Under this directory files will be
encoded using specified erasure coding policy
EC - Write Path
• Parallel write
• Client writes to 9 data nodes at the same time
• Calculate parity at client, at write time
• Durability
• Reed-Solomon(6,3) can tolerate max. 3 failures
• Visibility
• Read is supported for files being written
• Appendable
• Files can be reopened for appending data
HDFS Client
DataNode 1
…
…
DataNode 6
DataNode 7
DataNode 8
DataNode 9
1MB
Data
Parity
Parity
Parity
ACK
ACK
ACK
ACK
ACK
Split into stripes
+ calculate parity
1MB
EC - Write Failure Handling
• Data node failure
• Client ignores the failed data node and
continues writing
• Reed-Solomon(6,3) is able to tolerate 3 data node
failures
• Requires at least 6 data nodes
• Missing blocks will be constructed later
HDFS Client
DataNode 1
…
…
DataNode 6
DataNode 7
DataNode 8
DataNode 9
1MB
Data
1MB
Data
Parity
Parity
Parity
ACK
ACK
ACK
ACK
ACK
EC - Read Path
• Read data from 6 data nodes
in parallel
HDFS Client
DataNode 1
…
…
DataNode 6
DataNode 7
DataNode 8
DataNode 9
1MB
Data
1MB
Data
Block
EC - Read Failure Handling
• Read data from 6 arbitrary
data nodes in parallel
• Read parity block to reconstruct missing
data block
HDFS Client
DataNode 1
…
…
DataNode 6
DataNode 7
DataNode 8
DataNode 9
1MB
Data
Block
Parity
Parity
reconstructs
EC - Network behavior
• Pro’s
• Low latency because of parallel read & write
• Good for small file sizes
• Con’s
• Requires high network bandwidth between client & server
• Dead data nodes result in high network traffic and reconstruction
time
EC - Coder implementation
• Legacy coder
• From Facebook’s HDFS-RAID project
• [Umbrella] HADOOP-11264
• Pure Java coder
• Code improvements over HDFS-RAID
• HADOOP-11542
• Intel ISA-L coder
• Native coder with Intel’s Intelligent Storage Acceleration Library
• Accelerates EC-related linear algebra calculations by exploiting advanced hardware
instruction sets like SSE, AVX, and AVX2
• HADOOP-11540
EC - Coder performance I
EC - Coder performance II
EC - Coder performance III
• Hadoop 1
• No built-in High Availability
• Needed to solve yourself via e.g. VMware
2+ Name Nodes (HDFS-6440)
• Hadoop 2
• High Availability out-of-the-box via Active-Passive Pattern
• Needed to recover immediately after failure NameNode
Active
NameNode
Standby
• Hadoop 3
• 1 Active NameNode with N Standby NameNodes
• Trade-off between operation costs vs. hardware costs
NameNode
Active
NameNode
Standby
NameNode
Standby
Intra-DataNode Balancer (HDFS-1312)
• Hadoop already has a Balancer between
DataNodes
• Needs to be called manually by design
• Typically used after adding additional worker nodes
• The Disk Balancer lets administrators rebalance
data across multiple disks of a DataNode
• It is useful to correct skewed data distribution often seen after adding or
replacing disks
• Adds hdfs diskbalancer that will submit a plan but does not wait for the
plan to finish executing and the DataNode will do the moves itself
YARN
https://unsplash.com/photos/WMWF9WcDBOw
YARN - Built-in support for long-running services
• Simplified and first-class support for services (YARN-4692)
• Abstract common framework to support long running service (similar to Apache Slider)
• More simplified API for managing the service lifecycle of YARN Apps
• Better support for long running service
• Recognition of long running service (YARN-4725)
• Auto-restart of containers
• Containers for long running service are retried at same node in case of local state
• Service/Application upgrade support (YARN-4726)
• Hold on to containers during an upgrade of the YARN App
• Dynamic container resizing (YARN-1197)
• Only ask for minimum resources at start and rather adjust them at runtime
• Currently the only way is releasing containers and allocating new containers with the
expected size
YARN - Resource Isolation & Docker
• Better Resource Isolation
• Support for disk isolation (YARN-2619)
• Support for network isolation (YARN-2140)
• Uses cgroups to give containers their fair share
• Docker support in LinuxContainerExecutor (YARN-3611)
• The LinuxContainerExecutor already provides functionality around localization,
cgroups based resource management and isolation for CPU, network, disk, etc. as
well as security mechanisms
• Support Docker containers to be run inside of LinuxContainerExecutor
• Offers packaging and resource isolation
• Complements YARN’s support for long-running services
YARN - Service Discovery
• Services can run on any YARN node
• Dynamic IP, can change in case of node failures, etc.
• YARN Service Discovery via DNS (YARN-4757)
• The YARN service registry already provides facilities for applications to register their
endpoints and for clients to discover them but they are only available via Java API and REST
• Expose service information via a already available discovery mechanism: DNS
• Current YARN Service Registry records need to be converted into DNS entries
• Discovery of the container IP and service port via standard DNS lookups
• Mapping of Applications, e.g.
zkapp1.griduser.yarncluster.com -> 172.17.0.2
• Mapping of containers, e.g.
container-e3741-1454001598828-0131-01000004.yarncluster.com -> 172.17.0.3
YARN - Scheduling enhancements
• Generic Resource Types
• Abstract ResourceTypes to allow new resources, like GPU, Network, etc.
• Resource profiles for containers , like small, medium, large, etc. similar to EC2 instance types
• Global Scheduling (YARN-5139)
• Currently YARN scheduling is done one-node-at-a-time at arrival of heart beats and can lead to suboptimal decisions
• With global scheduling, YARN scheduler looks at more nodes and selects the best nodes based on application requirements which
leads to a globally optimal placement and enhanced container scheduling throughput
• Application priorities within a queue (YARN-1963)
• For example, in queue Marketing Hive jobs > MapReduce jobs
• Inter-Queue priorities (YARN-4945)
• Queue 1 > Queue 2, irrespective of demand & capacity
• Previously based only on unconsumed capacity
• Affinity / Anti-Affinity (YARN-1042)
• More fine-granular restraints on locations, e.g. do not allocate HBase Region servers and Storm workers on the same host
• Gang Scheduling (YARN-624)
• Allow allocation of sets of containers, e.g. 1 container with 128GB of RAM and 16 cores OR 100 containers with 2GB of RAM and 1 core
• Can be achieved already by holding on to containers but might lead to deadlocks and decreased cluster utilization
YARN - Use the force!
YARN
MapReduce Tez Spark
YARN
MapReduce Tez Spark
YARN - New UI (YARN-3368)
Application Timeline Service v2 (YARN-2928)
Why?
• Scalability & Performance
• Single global instance of Writer/Reader
• Local disk based LevelDB storage
• Reliability
• Failure handling with local disk
• Single point-of-failure
• Usability
• Add configuration and metrics as first-class
members
• Better support for queries
• Flexibility
• Data model is more describable
Core Concepts
• Distributed write path
• Logical per app collector
• Separate reader instances
• Pluggable backend storage
• HBase
• Enhanced internal data model
• Metrics Aggregation
• Richer REST API for queries
Revolution or evolution?
https://unsplash.com/photos/Cvf1IqUel9w
• Major release, expect it end of 2017
• Shiny new features like Erasure Coding and
better support for long-running services &
Docker
• Expect some changes in administration of
your existing Hadoop clusters
• But Ambari & Wire compatibility & Rolling upgrade
Summary of Hadoop 3.0
Big Data Lake
DataSourcesDataSystemsApplications
Traditional Sources
RDBMS OLTP OLAP …
Business

Intelligence
New Sources
Logs Sensor …Social

Media
Enterprise Hadoop
SAS LASR Server
Zeppelin
Big Data Lake
Hadoop
Traditional Data
System Requirements
Governance
Security
Operations
Access
Apache Zeppelin
Ambari Views
Apache Ambari
Cloudbreak
Apache Ranger
Apache Knox
Apache Atlas
Apache Atlas
Ambari Falcon
Storage &
Processing
HDFS & YARN
• Microsoft
• HDInsights
• Cloudbreak & Azure
• Amazon
• Hortonworks Data Cloud (HDC)
• Cloudbreak & EC2
• Google Cloud
• Cloudbreak & GCP
• They suggest to spin up a cluster
per job
Hadoop & Cloud
• YARN Support for Docker
• Think packing of apps like TensorFlow, etc.
• Better support for long-running
services
• Think dynamic resizing of cluster resources
• Service Discovery
• Scheduling enhancements
• Think automatic up- and downscaling
• Let’s welcome Hadoop to
the cloud age!
Hadoop & Streaming & IoT
Batch Layer
Speed Layer
…
ms - s
min - h
#1 Lambda Architecture
#2 Hadoop as long-term storage
• Better YARN support for
long-running services
• Think Spark Streaming
• Erasure Coding for more
efficient cold storage
• Apache NiFi & HDF
• Rumors about integration of
Apache Flink into HDP 3.0
Data Science & Machine Learning
• YARN support for GPUs
and other resources
• YARN & Docker for
packing of apps
• Better integration of
Spark & MLlib
• TensorFlow on the rise
https://unsplash.com/photos/iWYrCr8eGwU
Final Conclusion
…but it’s not a revolution!https://imgflip.com/i/mkovb
Twitter:
@uweprintz
uwe.seiler@codecentric.de
Mail:
uwe.printz@codecentric.de
Phone
+49 176 1076531
XING:
https://www.xing.com/profile/Uwe_Printz
Thank you!
Copyright by Uwe Printz
Slide 1: https://unsplash.com/photos/CIXoFys3gsw
Slide 2: Copyright by Uwe Printz
Slide 7: https://unsplash.com/photos/LHlwgjbSo3k
Slide 26: https://unsplash.com/photos/WMWF9WcDBOw
Slide 34: https://unsplash.com/photos/Cvf1IqUel9w
Slide 39: https://unsplash.com/photos/iWYrCr8eGwU
Slide 40: https://imgflip.com/i/mkovb
Slide 41: Copyright by Uwe Printz
All pictures CC0 or shot by the author

Weitere ähnliche Inhalte

Was ist angesagt?

From docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayFrom docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayDataWorks Summit
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnhdhappy001
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges DataWorks Summit
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureDataWorks Summit
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownDataWorks Summit
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneErik Krogen
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationAlex Moundalexis
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
CBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSCBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSDataWorks Summit
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Simplilearn
 
Ozone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityOzone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityDinesh Chitlangia
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, Howmcsrivas
 

Was ist angesagt? (20)

HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
From docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayFrom docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native way
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarnBikas saha:the next generation of hadoop– hadoop 2 and yarn
Bikas saha:the next generation of hadoop– hadoop 2 and yarn
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges Big data processing meets non-volatile memory: opportunities and challenges
Big data processing meets non-volatile memory: opportunities and challenges
 
High Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and FutureHigh Availability for HBase Tables - Past, Present, and Future
High Availability for HBase Tables - Past, Present, and Future
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Hadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of OzoneHadoop Meetup Jan 2019 - Overview of Ozone
Hadoop Meetup Jan 2019 - Overview of Ozone
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Apache Hadoop 0.22 and Other Versions
Apache Hadoop 0.22 and Other VersionsApache Hadoop 0.22 and Other Versions
Apache Hadoop 0.22 and Other Versions
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
CBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSCBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFS
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Ozone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalabilityOzone - Evolution of hdfs scalability
Ozone - Evolution of hdfs scalability
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 

Ähnlich wie Hadoop 3.0 - Revolution or evolution?

Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Chris Nauroth
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the ElephantDataWorks Summit
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfsNAVER D2
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Community
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsLars Nielsen
 

Ähnlich wie Hadoop 3.0 - Revolution or evolution? (20)

Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
MYSQL
MYSQLMYSQL
MYSQL
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Hadoop
HadoopHadoop
Hadoop
 
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the Elephant
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
 
Scalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data SystemsScalable Storage for Massive Volume Data Systems
Scalable Storage for Massive Volume Data Systems
 

Mehr von Uwe Printz

Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureUwe Printz
 
Lightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesLightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesUwe Printz
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 
MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)Uwe Printz
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceUwe Printz
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)Uwe Printz
 
MongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererMongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererUwe Printz
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Map/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBMap/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBUwe Printz
 
First meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtFirst meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtUwe Printz
 

Mehr von Uwe Printz (16)

Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Lightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesLightning Talk: Agility & Databases
Lightning Talk: Agility & Databases
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)
 
MongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererMongoDB für Java-Programmierer
MongoDB für Java-Programmierer
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Map/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBMap/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDB
 
First meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtFirst meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group Frankfurt
 

Kürzlich hochgeladen

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

Kürzlich hochgeladen (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

Hadoop 3.0 - Revolution or evolution?

  • 1. Hadoop 3.0 Revolution or evolution? uweprintz https://unsplash.com/photos/CIXoFys3gsw
  • 3. Some Hadoop history Hadoop 2 HDFS Redundant, reliable storage MapReduce Data processing YARN Cluster resource management Hive SQL Spark In-Memory … Oct. 2013 Let there be YARN Apps! Era of Enterprise Hadoop 2006 Hadoop 1 HDFS Redundant, reliable storage MapReduce Cluster resource mgmt. + data processing Let there be batch! Era of Silicon Valley Hadoop Hadoop 3 ? IoT Machine Learning GPU’s TensorFlow Data Science Streaming Data Cloud FPGA’s Artificial Intelligence Kafka Late 2017 Let there be …? Era of ?
  • 4. Why Hadoop 3.0? • Deprecated APIs can only be removed in major release • Not fully preserving API compatibility • Wire-compatibility could be broken • Change of default ports • But preserves wire-compatibility with Hadoop 2 clients • Will support rolling upgrade from Hadoop 2 to Hadoop 3 • Hadoop command scripts rewrite • Big features that need stabilizing major release
  • 5. What is Hadoop 3.0? 20142010 2011 201320122009 2015 2.2.02.0.0-alpha branch-1 (branch-0.20) 1.0.0 1.1.0 1.2.1 (Stable)0.20.1 0.20.205 0.21.0 New append 0.23.0 branch-2 HDFS Snapshots NFSv3 support HDFS ACLs HDFS Rolling Upgrades RM Automatic Failover 2.6.0 YARN Rolling Upgrades Transparent Encryption Archival Storage 2.7.0 Hadoop 2 Drop JDK6 Support File Truncate API 2016 branch-0.23 Hadoop 3 Hadoop 2 and 3 were diverged 5+ years ago Hadoop 1 (EOL) Source: Akira Ajisaka (with additions by Uwe Printz) 2017 0.22.0 0.23.11 (Final) Security trunk 2.3.0 2.5.0 2.4.0 NameNode Federation , YARN NameNode HA Heterogeneous storage HDFS In-Memory Caching 2.8.1 3.0.0-alpha1 3.0.0-alpha2 2.1.0-beta HDFS Extended attributes Docker Container in Linux ATS 1.5 3.0.0-alpha3 3.0.0-alpha4 3.0.0-beta1 15.09. GA 01.11.
  • 6. Hadoop 3.0 in a nutshell • HDFS • Erasure codes • Low-level performance enhancements with Intel ISA-L • 2+ NameNodes • Intra-DataNode Balancer • YARN • Better support for long-running services • Improved isolation & Docker support • Scheduler enhancements • Application Timeline Service v2 • New UI • MapReduce • Task-level native optimization • Derive heap-size automatically • DevOps • Drop JDK7 & Move to JDK8 • Change of default ports • Library & Dependency Upgrade • Client-side classpath Isolation • Shell Script Rewrite & ShellDoc • .hadooprc & .hadoop-env • Metrics plugin for Kafka
  • 8. HDFS - Current implementation • 3 replicas by default • Tolerate maximum of 2 failures Write request Lease for file Split into blocks Request for data nodes List of data nodes HDFS Client NameNode DataNode 1 DataNode 2 DataNode 3 Write block + checksum • Simple, scalable & robust • 200% space overhead Write Pipeline Write Pipeline Calculate checksum ACKACK ACK Complete!
  • 9. Erasure Coding (EC) • k data blocks + m parity blocks • Example: Reed-Solomon (6,3) d d d d d d Raw Data Splitting d d d d d d d d d d d d d d d d d d p p p p p p p p p p p p Encoding Store data and parity • Key Points • XOR Coding —> Saves space, slower recovery • Missing or corrupt data will be restored from available data and parity • Parity can be smaller than data
  • 10. EC - Main characteristics Replication (Factor 1) Replication (Factor 3) Reed-Solomon (6,3) Reed-Solomon (10,4) Maximum fault tolerance 0 2 3 4 Space Efficiency 100 % 33 % 67 % 71 % Data Locality Yes No (Phase 1) / Yes (Phase 2) Write performance High Low Read performance High Medium Recovery costs Low High Pluggable implementation, first choice Storage Tier Hot Warm Cold Frozen Memory/SSD Disk Dense Disk EC 20 x Day 5 x Week 5 x Month 2 x Year
  • 11. EC - Contiguous blocks • Approach 1: Retain block size and add parity File A File B File C 128 MB 128 MB 128 MB 128 MB 128 MB 128 MB Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 DN 3 DN 2 DN 12 DN 7 DN 5 DN 1 • Pro: Better for locality • Con: Significant overhead for smaller files, always 3 parity blocks needed • Con: Client potentially needs to process GB’s of data for encoding Parity Parity Parity DN 6 DN 4 DN 8 Encoding
  • 12. EC - Striping • Approach 2: Splitting blocks into smaller cells (1 MB) File A File B File C • Pro: Works for small files • Pro: Allows parallel write • Con: No data locality -> Increased read latency & More complicated recovery process Block 2 Block 3 Block 4 Block 5 Block 6 DN 7 DN 3 DN 4 DN 1 DN 6 DN 12 Stripe 1 Stripe 2 Stripe n Block 1 Block 4 Round-robin … … … … … … Parity DN 10 DN 8 DN 14 Parity Parity Encoding … … …
  • 13. • Start from striping to deal with smaller files EC - Apache Hadoop’s decision (HDFS-7285) Contiguous Striping Replication Erasure Coding HDFS Facebook f4 Azure Ceph (before Firefly) Lustre Ceph (with Firefly) QFS Phase 1.1 HDFS-7285 Phase 1.2 HDFS-8031 Phase 3 (Future Work) Phase 2 HDFS-8030 Hadoop 3.0.x implements Phase 1.1
  • 14. EC - Shell Command • Create a EC Zone on an empty directory • All the files under a zone directory are automatically erasure coded • Rename across zones with different EC schemas are disallowed Usage: hdfs erasurecode [generic options] [-getPolicy <path>] [-help [cmd ...]] [-listPolicies] [-setPolicy [-p <policyName>] <path>] -getPolicy <path> : Get erasure coding policy information about at specified path -listPolicies : Get the list of erasure coding policies supported -setPolicy [-p <policyName>] <path> : Set a specified erasure coding policy to a directory Options : -p <policyName> erasure coding policy name to encode files. If not passed the default policy will be used <path> Path to a directory. Under this directory files will be encoded using specified erasure coding policy
  • 15. EC - Write Path • Parallel write • Client writes to 9 data nodes at the same time • Calculate parity at client, at write time • Durability • Reed-Solomon(6,3) can tolerate max. 3 failures • Visibility • Read is supported for files being written • Appendable • Files can be reopened for appending data HDFS Client DataNode 1 … … DataNode 6 DataNode 7 DataNode 8 DataNode 9 1MB Data Parity Parity Parity ACK ACK ACK ACK ACK Split into stripes + calculate parity 1MB
  • 16. EC - Write Failure Handling • Data node failure • Client ignores the failed data node and continues writing • Reed-Solomon(6,3) is able to tolerate 3 data node failures • Requires at least 6 data nodes • Missing blocks will be constructed later HDFS Client DataNode 1 … … DataNode 6 DataNode 7 DataNode 8 DataNode 9 1MB Data 1MB Data Parity Parity Parity ACK ACK ACK ACK ACK
  • 17. EC - Read Path • Read data from 6 data nodes in parallel HDFS Client DataNode 1 … … DataNode 6 DataNode 7 DataNode 8 DataNode 9 1MB Data 1MB Data Block
  • 18. EC - Read Failure Handling • Read data from 6 arbitrary data nodes in parallel • Read parity block to reconstruct missing data block HDFS Client DataNode 1 … … DataNode 6 DataNode 7 DataNode 8 DataNode 9 1MB Data Block Parity Parity reconstructs
  • 19. EC - Network behavior • Pro’s • Low latency because of parallel read & write • Good for small file sizes • Con’s • Requires high network bandwidth between client & server • Dead data nodes result in high network traffic and reconstruction time
  • 20. EC - Coder implementation • Legacy coder • From Facebook’s HDFS-RAID project • [Umbrella] HADOOP-11264 • Pure Java coder • Code improvements over HDFS-RAID • HADOOP-11542 • Intel ISA-L coder • Native coder with Intel’s Intelligent Storage Acceleration Library • Accelerates EC-related linear algebra calculations by exploiting advanced hardware instruction sets like SSE, AVX, and AVX2 • HADOOP-11540
  • 21. EC - Coder performance I
  • 22. EC - Coder performance II
  • 23. EC - Coder performance III
  • 24. • Hadoop 1 • No built-in High Availability • Needed to solve yourself via e.g. VMware 2+ Name Nodes (HDFS-6440) • Hadoop 2 • High Availability out-of-the-box via Active-Passive Pattern • Needed to recover immediately after failure NameNode Active NameNode Standby • Hadoop 3 • 1 Active NameNode with N Standby NameNodes • Trade-off between operation costs vs. hardware costs NameNode Active NameNode Standby NameNode Standby
  • 25. Intra-DataNode Balancer (HDFS-1312) • Hadoop already has a Balancer between DataNodes • Needs to be called manually by design • Typically used after adding additional worker nodes • The Disk Balancer lets administrators rebalance data across multiple disks of a DataNode • It is useful to correct skewed data distribution often seen after adding or replacing disks • Adds hdfs diskbalancer that will submit a plan but does not wait for the plan to finish executing and the DataNode will do the moves itself
  • 27. YARN - Built-in support for long-running services • Simplified and first-class support for services (YARN-4692) • Abstract common framework to support long running service (similar to Apache Slider) • More simplified API for managing the service lifecycle of YARN Apps • Better support for long running service • Recognition of long running service (YARN-4725) • Auto-restart of containers • Containers for long running service are retried at same node in case of local state • Service/Application upgrade support (YARN-4726) • Hold on to containers during an upgrade of the YARN App • Dynamic container resizing (YARN-1197) • Only ask for minimum resources at start and rather adjust them at runtime • Currently the only way is releasing containers and allocating new containers with the expected size
  • 28. YARN - Resource Isolation & Docker • Better Resource Isolation • Support for disk isolation (YARN-2619) • Support for network isolation (YARN-2140) • Uses cgroups to give containers their fair share • Docker support in LinuxContainerExecutor (YARN-3611) • The LinuxContainerExecutor already provides functionality around localization, cgroups based resource management and isolation for CPU, network, disk, etc. as well as security mechanisms • Support Docker containers to be run inside of LinuxContainerExecutor • Offers packaging and resource isolation • Complements YARN’s support for long-running services
  • 29. YARN - Service Discovery • Services can run on any YARN node • Dynamic IP, can change in case of node failures, etc. • YARN Service Discovery via DNS (YARN-4757) • The YARN service registry already provides facilities for applications to register their endpoints and for clients to discover them but they are only available via Java API and REST • Expose service information via a already available discovery mechanism: DNS • Current YARN Service Registry records need to be converted into DNS entries • Discovery of the container IP and service port via standard DNS lookups • Mapping of Applications, e.g. zkapp1.griduser.yarncluster.com -> 172.17.0.2 • Mapping of containers, e.g. container-e3741-1454001598828-0131-01000004.yarncluster.com -> 172.17.0.3
  • 30. YARN - Scheduling enhancements • Generic Resource Types • Abstract ResourceTypes to allow new resources, like GPU, Network, etc. • Resource profiles for containers , like small, medium, large, etc. similar to EC2 instance types • Global Scheduling (YARN-5139) • Currently YARN scheduling is done one-node-at-a-time at arrival of heart beats and can lead to suboptimal decisions • With global scheduling, YARN scheduler looks at more nodes and selects the best nodes based on application requirements which leads to a globally optimal placement and enhanced container scheduling throughput • Application priorities within a queue (YARN-1963) • For example, in queue Marketing Hive jobs > MapReduce jobs • Inter-Queue priorities (YARN-4945) • Queue 1 > Queue 2, irrespective of demand & capacity • Previously based only on unconsumed capacity • Affinity / Anti-Affinity (YARN-1042) • More fine-granular restraints on locations, e.g. do not allocate HBase Region servers and Storm workers on the same host • Gang Scheduling (YARN-624) • Allow allocation of sets of containers, e.g. 1 container with 128GB of RAM and 16 cores OR 100 containers with 2GB of RAM and 1 core • Can be achieved already by holding on to containers but might lead to deadlocks and decreased cluster utilization
  • 31. YARN - Use the force! YARN MapReduce Tez Spark YARN MapReduce Tez Spark
  • 32. YARN - New UI (YARN-3368)
  • 33. Application Timeline Service v2 (YARN-2928) Why? • Scalability & Performance • Single global instance of Writer/Reader • Local disk based LevelDB storage • Reliability • Failure handling with local disk • Single point-of-failure • Usability • Add configuration and metrics as first-class members • Better support for queries • Flexibility • Data model is more describable Core Concepts • Distributed write path • Logical per app collector • Separate reader instances • Pluggable backend storage • HBase • Enhanced internal data model • Metrics Aggregation • Richer REST API for queries
  • 35. • Major release, expect it end of 2017 • Shiny new features like Erasure Coding and better support for long-running services & Docker • Expect some changes in administration of your existing Hadoop clusters • But Ambari & Wire compatibility & Rolling upgrade Summary of Hadoop 3.0
  • 36. Big Data Lake DataSourcesDataSystemsApplications Traditional Sources RDBMS OLTP OLAP … Business
 Intelligence New Sources Logs Sensor …Social
 Media Enterprise Hadoop SAS LASR Server Zeppelin Big Data Lake Hadoop Traditional Data System Requirements Governance Security Operations Access Apache Zeppelin Ambari Views Apache Ambari Cloudbreak Apache Ranger Apache Knox Apache Atlas Apache Atlas Ambari Falcon Storage & Processing HDFS & YARN
  • 37. • Microsoft • HDInsights • Cloudbreak & Azure • Amazon • Hortonworks Data Cloud (HDC) • Cloudbreak & EC2 • Google Cloud • Cloudbreak & GCP • They suggest to spin up a cluster per job Hadoop & Cloud • YARN Support for Docker • Think packing of apps like TensorFlow, etc. • Better support for long-running services • Think dynamic resizing of cluster resources • Service Discovery • Scheduling enhancements • Think automatic up- and downscaling • Let’s welcome Hadoop to the cloud age!
  • 38. Hadoop & Streaming & IoT Batch Layer Speed Layer … ms - s min - h #1 Lambda Architecture #2 Hadoop as long-term storage • Better YARN support for long-running services • Think Spark Streaming • Erasure Coding for more efficient cold storage • Apache NiFi & HDF • Rumors about integration of Apache Flink into HDP 3.0
  • 39. Data Science & Machine Learning • YARN support for GPUs and other resources • YARN & Docker for packing of apps • Better integration of Spark & MLlib • TensorFlow on the rise https://unsplash.com/photos/iWYrCr8eGwU
  • 40. Final Conclusion …but it’s not a revolution!https://imgflip.com/i/mkovb
  • 42. Slide 1: https://unsplash.com/photos/CIXoFys3gsw Slide 2: Copyright by Uwe Printz Slide 7: https://unsplash.com/photos/LHlwgjbSo3k Slide 26: https://unsplash.com/photos/WMWF9WcDBOw Slide 34: https://unsplash.com/photos/Cvf1IqUel9w Slide 39: https://unsplash.com/photos/iWYrCr8eGwU Slide 40: https://imgflip.com/i/mkovb Slide 41: Copyright by Uwe Printz All pictures CC0 or shot by the author