SlideShare ist ein Scribd-Unternehmen logo
1 von 71
Downloaden Sie, um offline zu lesen
25.03.2014
Welcome
to
uweseiler
25.03.2014
2
Your Travel Guide
Big Data Nerd
TravelpiratePhotography Enthusiast
Hadoop Trainer NoSQL Fan Boy
25.03.2014
2
Your Travel Agency
specializes on...
Big Data Nerds Agile Ninjas Continuous Delivery Gurus
Enterprise Java Specialists Performance Geeks
Join us!
25.03.2014
2
Your Travel Destination
Welcome Office
YARN Land
HDFS 2 Land
YARN App Land Enterprise Land
Goodbye Photo
25.03.2014
2
Stop #1: Welcome Office
Welcome Office
25.03.2014
2
…there was MapReduce
In the beginning of Hadoop
• It could handle data sizes way beyond those
of its competitors
• It was resilient in the face of failure
• It made it easy for users to bring their code
and algorithms to the data
25.03.2014
2
…but it was too low level
Hadoop 1 (2007)
25.03.2014
2
…but it was too rigid
Hadoop 1 (2007)
25.03.2014
2
HDFS
…but it was Batch
HDFS HDFS
Single App
Batch
Single App
Batch
Single App
Batch
Single App
Batch
Single App
Batch
Hadoop 1 (2007)
25.03.2014
2
…but it had limitations
Hadoop 1 (2007)
• Scalability
– Maximum cluster size ~ 4,500 nodes
– Maximum concurrent tasks – 40,000
– Coarse synchronization in JobTracker
• Availability
– Failure kills all queued and running jobs
• Hard partition of resources into map & reduce slots
– Low resource utilization
• Lacks support for alternate paradigms and services
25.03.2014
2
YARN to the rescue!
Hadoop 2 (2013)
25.03.2014
2
Stop #2: YARN Land
Welcome Office
YARN Land
25.03.2014
2
A brief history of Hadoop 2
• Originally conceived & architected by the
team at Yahoo!
– Arun Murthy created the original JIRA in 2008 and now is
the Hadoop 2 release manager
• The community has been working on
Hadoop 2 for over 4 years
• Hadoop 2 based architecture running at
scale at Yahoo!
– Deployed on 35,000+ nodes for 6+ months
25.03.2014
2
Hadoop 1
HDFS
Redundant, reliable
storage
Hadoop 2: Next-gen platform
MapReduce
Cluster resource mgmt.
+ data processing
Hadoop 2
HDFS 2
Redundant, reliable storage
MapReduce
Data processing
Single use system
Batch Apps
Multi-purpose platform
Batch, Interactive, Streaming, …
YARN
Cluster resource management
Others
Data processing
25.03.2014
2
Taking Hadoop beyond batch
Applications run natively in Hadoop
HDFS 2
Redundant, reliable storage
Batch
MapReduce
Store all data in one place
Interact with data in multiple ways
YARN
Cluster resource management
Interactive
Tez
Online
HOYA
Streaming
Storm, …
Graph
Giraph
In-Memory
Spark
Other
Search, …
25.03.2014
2
YARN: Design Goals
• Build a new abstraction layer by splitting up
the two major functions of the JobTracker
• Cluster resource management
• Application life-cycle management
• Allow other processing paradigms
• Flexible API for implementing YARN apps
• MapReduce becomes YARN app
• Lots of different YARN apps
25.03.2014
2
Hadoop: BigDataOS™
Storage:
File System
Traditional Operating
System
Execution / Scheduling:
Processes / Kernel
Storage:
HDFS
Hadoop
Execution / Scheduling:
YARN
25.03.2014
2
YARN: Architectural Overview
Split up the two major functions of the JobTracker
Cluster resource management & Application life-cycle management
ResourceManager
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
Scheduler
AM 1
Container 1.2
Container 1.1
AM 2
Container 2.1
Container 2.2
Container 2.3
25.03.2014
2
YARN: Multi-tenancy I
ResourceManager
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
Scheduler
MapReduce 1
map 1.2
map 1.1
MapReduce 2
map 2.1
map 2.2
reduce 2.1
NodeManager NodeManager NodeManager NodeManager
reduce 1.1 Tez map 2.3
reduce 2.2
vertex 1
vertex 2
vertex 3
vertex 4
HOYA
HBase Master
Region server 1
Region server 2
Region server 3 Storm
nimbus 1
nimbus 2
Different types of applications on the same cluster
25.03.2014
2
YARN: Multi-tenancy II
ResourceManager
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
Scheduler
MapReduce 1
map 1.2
map 1.1
MapReduce 2
map 2.1
map 2.2
reduce 2.1
NodeManager NodeManager NodeManager NodeManager
reduce 1.1 Tez map 2.3
reduce 2.2
vertex 1
vertex 2
vertex 3
vertex 4
HOYA
HBase Master
Region server 1
Region server 2
Region server 3 Storm
nimbus 1
nimbus 2
DWHUser
Ad-
Hoc
root
30% 60% 10%
Dev Prod
20% 80%
Dev Prod
Dev1 Dev2
25% 75%
60% 40%
Different users and
organizations on the
same cluster
25.03.2014
2
YARN: Your own app
ResourceManager
NodeManager NodeManager
NodeManager
Scheduler
Container
Container
MyApp
YARNClient
App specific
API
myAM
AMRMClient
NMClient
Application
Client
Protocol
Application
Master
Protocol
Container Management Protocol
DistributedShell is the
new WordCount!
25.03.2014
2
Stop #3: HDFS 2 Land
Welcome Office
YARN Land
HDFS 2 Land
25.03.2014
2
HDFS 2: In a nutshell
• Removes tight coupling of Block
Storage and Namespace
• Adds (built-in) High Availability
• Better Scalability & Isolation
• Increased performance
Details: https://issues.apache.org/jira/browse/HDFS-1052
25.03.2014
2
HDFS 2: Federation
NameNodes do not talk to each other
NameNodes manages
only slice of namespace
DataNodes can store
blocks managed by
any NameNode
NameNode 1
Namespace 1
Namespace
State
Block
Map
NameNode 2
Namespace 2
Block Pools
Pool 1 Pool 2
Block Storage as
generic storage service
Data Nodes
b3 b1
b2 b4
b2 b1
b5
b3 b2
b5 b4
JBOD JBOD JBOD
Namespace
State
Block
Map
Horizontally scale IO and storage
25.03.2014
2
HDFS 2: Architecture
Active NameNode Standby NameNode
DataNodeDataNodeDataNode DataNode DataNode
Maintains Block
Map and Edits File Simultaneously
reads and applies
the edits
Report to both NameNodes
Block
Map
Edits
File
Block
Map
Edits
File
NFS Shared state on NFS
OR
Quorum based storage
Journal
Node
Journal
Node
Journal
Node
Take orders
only from the
Active
or
25.03.2014
2
ZKFailover
Controller
ZKFailover
Controller
HDFS 2: High Availability
Active NameNode Standby NameNode
DataNodeDataNodeDataNode DataNode DataNode
Block
Map
Edits
File
Block
Map
Edits
File
ZooKeeper
Node
ZooKeeper
Node
ZooKeeper
Node
Send Heartbeats & Block Reports
Shared State
Monitors health
of NN, OS, HW
Heartbeat Heartbeat
Holds special
lock znode
Journal
Node
Journal
Node
Journal
Node
25.03.2014
2
HDFS 2: Write-Pipeline
• Earlier versions of HDFS
• Files were immutable
• Write-once-read-many model
• New features in HDFS 2
• Files can be reopened for append
• New primitives: hflush and hsync
• Replace data node on failure
• Read consistency
DataNode
1
DataNode
2
DataNode
3
DataNode
4
Writer
Add new node to
the pipeline
Reader
Data Data Data
Can read from any node and
then failover to any other node
25.03.2014
2
HDFS 2: Snapshots
• Admin can create point in time snapshots of HDFS
• Of the entire file system
• Of a specific data-set (sub-tree directory of file system)
• Restore state of entire file system or data-set to a
snapshot (like Apple Time Machine)
• Protect against user errors
• Snapshot diffs identify changes made to data set
• Keep track of how raw or derived/analytical data changes
over time
25.03.2014
2
HDFS 2: NFS Gateway
• Supports NFS v3 (NFS v4 is work in progress)
• Supports all HDFS commands
• List files
• Copy, move files
• Create and delete directories
• Ingest for large scale analytical workloads
• Load immutable files as source for analytical processing
• No random writes
• Stream files into HDFS
• Log ingest by applications writing directly to HDFS client
mount
25.03.2014
2
HDFS 2: Performance
• Many improvements
• New AppendableWrite-Pipeline
• Read path improvements for fewer memory copies
• Short-circuit local reads for 2-3x faster random
reads
• I/O improvements using posix_fadvise()
• libhdfs improvements for zero copy reads
• Significant improvements: I/O 2.5 - 5x faster
25.03.2014
2
Stop #4: YARN App Land
Welcome Office
YARN Land
HDFS 2 Land
YARN App Land
25.03.2014
2
YARN Apps: Overview
• MapReduce 2 Batch
• Tez DAG Processing
• Storm Stream Processing
• Samza Stream Processing
• Apache S4 Stream Processing
• Spark In-Memory
• HOYA HBase on YARN
• Apache Giraph Graph Processing
• Apache Hama Bulk Synchronous Parallel
• Elastic Search Scalable Search
25.03.2014
2
YARN Apps: Overview
• MapReduce 2 Batch
• Tez DAG Processing
• Storm Stream Processing
• Samza Stream Processing
• Apache S4 Stream Processing
• Spark In-Memory
• HOYA HBase on YARN
• Apache Giraph Graph Processing
• Apache Hama Bulk Synchronous Parallel
• Elastic Search Scalable Search
25.03.2014
2
MapReduce 2: In a nutshell
• MapReduce is now a YARN app
• No more map and reduce slots, it’s containers now
• No more JobTracker, it’s YarnAppmaster library now
• Multiple versions of MapReduce
• The older mapred APIs work without modification or recompilation
• The newer mapreduce APIs may need to be recompiled
• Still has one master server component: the Job History Server
• The Job History Server stores the execution of jobs
• Used to audit prior execution of jobs
• Will also be used by YARN framework to store charge backs at that level
• Better cluster utilization
• Increased scalability & availability
25.03.2014
2
MapReduce 2: Shuffle
• Faster Shuffle
• Better embedded server: Netty
• Encrypted Shuffle
• Secure the shuffle phase as data moves across the cluster
• Requires 2 way HTTPS, certificates on both sides
• Causes significant CPU overhead, reserve 1 core for this work
• Certificates stored on each node (provision with the cluster), refreshed every
10 secs
• Pluggable Shuffle Sort
• Shuffle is the first phase in MapReduce that is guaranteed to not be data-
local
• Pluggable Shuffle/Sort allows application developers or hardware
developers to intercept the network-heavy workload and optimize it
• Typical implementations have hardware components like fast networks and
software components like sorting algorithms
• API will change with future versions of Hadoop
25.03.2014
2
MapReduce 2: Performance
• Key Optimizations
• No hard segmentation of resource into map and reduce slots
• YARN scheduler is more efficient
• MR2 framework has become more efficient than MR1: shuffle
phase in MRv2 is more performant with the usage of Netty.
• 40.000+ nodes running YARN across over 365 PB of data.
• About 400.000 jobs per day for about 10 million hours of
compute time.
• Estimated 60% – 150% improvement on node usage per day
• Got rid of a whole 10,000 node datacenter because of their
increased utilization.
25.03.2014
2
YARN Apps: Overview
• MapReduce 2 Batch
• Tez DAG Processing
• Storm Stream Processing
• Samza Stream Processing
• Apache S4 Stream Processing
• Spark In-Memory
• HOYA HBase on YARN
• Apache Giraph Graph Processing
• Apache Hama Bulk Synchronous Parallel
• Elastic Search Scalable Search
25.03.2014
2
Apache Tez: In a nutshell
• Distributed execution framework that works on
computations represented as dataflow graphs
• Tez is Hindi for “speed”
• Naturally maps to execution plans
produced by query optimizers
• Highly customizable to meet a
broad spectrum of use cases and to
enable dynamic performance
optimizations at runtime
• Built on top of YARN
25.03.2014
2
Apache Tez: Architecture
• Task with pluggable Input, Processor & Output
Task
HDFS
Input
Map
Processor
Sorted
Output
„Classical“ Map
Task
Shuffle
Input
Reduce
Processor
HDFS
Output
„Classical“ Reduce YARN ApplicationMaster to run
DAG of Tez Tasks
25.03.2014
2
Apache Tez: Tez Service
• MapReduce Query Startup is expensive:
– Job launch & task-launch latencies are fatal for
short queries (in order of 5s to 30s)
• Solution:
– Tez Service (= Preallocated Application Master)
• Removes job-launch overhead (Application Master)
• Removes task-launch overhead (Pre-warmed Containers)
– Hive (or Pig)
• Submit query plan to Tez Service
– Native Hadoop service, not ad-hoc
25.03.2014
2
Hadoop 1
HDFS
Redundant, reliable storage
Apache Tez: The new primitive
MapReduce
Cluster resource mgmt. + data
processing
Hadoop 2
MapReduce as Base Apache Tez as Base
Pig Hive Other
HDFS
Redundant, reliable storage
YARN
Cluster resource management
Tez
Execution Engine
MR Pig Hive Real
time
Storm
O
t
h
e
r
25.03.2014
2
Apache Tez: Performance
SELECT a.state, COUNT(*),
AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Existing Hive
Parse Query 0.5s
Create Plan 0.5s
Launch Map-
Reduce
20s
Process Map-
Reduce
10s
Total 31s
Hive/Tez
Parse Query 0.5s
Create Plan 0.5s
Launch Map-
Reduce
20s
Process Map-
Reduce
2s
Total 23s
Tez & Hive Service
Parse Query 0.5s
Create Plan 0.5s
Submit to Tez
Service
0.5s
Process Map-Reduce 2s
Total 3.5s
* No exact numbers, for illustration only
25.03.2014
2
Stinger Initiative: In a nutshell
25.03.2014
2
Stinger: Overall Performance
* Real numbers, but handle with care!
25.03.2014
2
YARN Apps: Overview
• MapReduce 2 Batch
• Tez DAG Processing
• Storm Stream Processing
• Samza Stream Processing
• Apache S4 Stream Processing
• Spark In-Memory
• HOYA HBase on YARN
• Apache Giraph Graph Processing
• Apache Hama Bulk Synchronous Parallel
• Elastic Search Scalable Search
25.03.2014
2
Storm: In a nutshell
• Stream-processing
• Real-time processing
• Developed as standalone application
• https://github.com/nathanmarz/storm
• Ported on YARN
• https://github.com/yahoo/storm-yarn
25.03.2014
2
Storm: Conceptual view
Spout
Spout
Spout
Source of streams
Bolt
Bolt
Bolt
Bolt
Bolt
Tuple
Bolt
• Consumer of streams
• Processing of tuples
• Possibly emits new
tuplesStream
Unbound sequence of
tuples
Tuple
Tuple
Tuple
List of name-value pairs
Topology
Network of spouts & bolts as the nodes and
streams as the edges
25.03.2014
2
YARN Apps: Overview
• MapReduce 2 Batch
• Tez DAG Processing
• Storm Stream Processing
• Samza Stream Processing
• Apache S4 Stream Processing
• Spark In-Memory
• HOYA HBase on YARN
• Apache Giraph Graph Processing
• Apache Hama Bulk Synchronous Parallel
• Elastic Search Scalable Search
25.03.2014
2
Spark: In a nutshell
• High-speed in-memory analytics over
Hadoop and Hive
• Separate MapReduce-like engine
– Speedup of up to 100x
– On-disk queries 5-10x faster
• Spark is now a top-level Apache project
– http://spark.apache.org
• Compatible with Hadoop‘s Storage API
• Spark can be run on top of YARN
– http://spark.apache.org/docs/0.9.0/running-on-yarn.html
25.03.2014
2
Spark: RDD
• Key idea: Resilient Distributed Datasets
(RDDs)
• Read-only partitioned collection of
records
• Optionally cached in memory across cluster
• Manipulated through parallel operators
• Support only coarse-grained operations
• Map
• Reduce
• Group-by transformations
• Automatically recomputed on failure
RDD
A11
A12
A13
25.03.2014
2
Spark: Data Sharing
25.03.2014
2
YARN Apps: Overview
• MapReduce 2 Batch
• Tez DAG Processing
• Storm Stream Processing
• Samza Stream Processing
• Apache S4 Stream Processing
• Spark In-Memory
• HOYA HBase on YARN
• Apache Giraph Graph Processing
• Apache Hama Bulk Synchronous Parallel
• Elastic Search Scalable Search
25.03.2014
2
HOYA: In a nutshell
• Create on-demand HBase clusters
• Small HBase cluster in large YARN cluster
• Dynamic HBase clusters
• Self-healing HBase Cluster
• Elastic HBase clusters
• Transient/intermittent clusters for workflows
• Configure custom configurations & versions
• Better isolation
• More efficient utilization/sharing of cluster
25.03.2014
2
HOYA: Creation of AppMaster
ResourceManager
NodeManager NodeManager
NodeManager
Scheduler
Container
Container
HOYA Client
YARNClient
HOYA
specific API
HOYA
Application
Master
Container
Container
Container
Container
25.03.2014
2
HOYA: Deployment of HBase
ResourceManager
NodeManager NodeManager
NodeManager
Scheduler
Container
Container
HOYA Client
YARNClient
HOYA
specific API
HOYA
Application
Master
Container
Container
Container
Container
HBase Master
Region Server
Region Server
25.03.2014
2
HOYA: Bind via ZooKeeper
ResourceManager
NodeManager NodeManager
NodeManager
Scheduler
Container
Container
HOYA Client
YARNClient
HOYA
specific API
HOYA
Application
Master
Container
Container
Container
Container
HBase Master
Region Server
Region Server
HBase
Client
ZooKeeper
25.03.2014
2
YARN Apps: Overview
• MapReduce 2 Batch
• Tez DAG Processing
• Storm Stream Processing
• Samza Stream Processing
• Apache S4 Stream Processing
• Spark In-Memory
• HOYA HBase on YARN
• Apache Giraph Graph Processing
• Apache Hama Bulk Synchronous Parallel
• Elastic Search Scalable Search
25.03.2014
2
Giraph: In a nutshell
• Giraph is a framework for processing semi-
structured graph data on a massive scale
• Giraph is loosely based upon Google's
Pregel
• Both systems are inspired by the Bulk
Synchronous Parallel model
• Giraph performs iterative calculations
on top of an existing Hadoop cluster
• Uses Single Map-only Job
• Apache top level project since 2012
– http://giraph.apache.org
25.03.2014
2
Stop #5: Enterprise Land
Welcome Office
YARN Land
HDFS 2 Land
YARN App Land Enterprise Land
25.03.2014
2
Falcon: In a nutshell
• A framework for managing data processing
in Hadoop Clusters
• Falcon runs as a standalone server as part of
the Hadoop cluster
• Key Features:
• Data Replication Handling
• Data Lifecycle Management
• Process Coordination & Scheduling
• Declarative Data Process Programming
• Apache Incubation Status
• http://falcon.incubator.apache.org
25.03.2014
2
Falcon: One-stop Shop
Data Management Needs Tool Orchestration
Data Processing
Replication
Retention
Scheduling
Reprocessing
Multi Cluster Mgmt.
Oozie
Sqoop
Distcp
Flume
MapReduce
Pig & Hive
25.03.2014
2
Falcon: Weblog Use Case
• Weblogs saved hourly to primary cluster
• HDFS location is /weblogs/{date}
• Desired Data Policy:
• Replicate weblogs to secondary cluster
• Evict weblogs from primary cluster after 2 days
• Evict weblogs from secondary cluster after 1
week
25.03.2014
2
Falcon: Weblog Use Case
<feed description=“" name=”feed-weblogs” xmlns="uri:falcon:feed:0.1">
<frequency>hours(1)</frequency>
<clusters>
<cluster name=”cluster-primary" type="source">
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
<retention limit="days(2)" action="delete"/>
</cluster>
<cluster name=”cluster-secondary" type="target">
<validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/>
<retention limit=”days(7)" action="delete"/>
</cluster>
</clusters>
<locations>
<location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}-
${HOUR}"/>
</locations>
<ACL owner=”hdfs" group="users" permission="0755"/>
<schema location="/none" provider="none"/>
</feed>
25.03.2014
2
Knox: In a nutshell
• System that provides a single point of
authentication and access for Apache
Hadoop services in a cluster.
• The gateway runs as a server (or cluster of
servers) that provide centralized access to
one or more Hadoop clusters.
• The goal is to simplify Hadoop security for
both users and operators
• Apache Incubation Status
• http://knox.incubator.apache.org
25.03.2014
2
Knox: Architecture
ResourceManager
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
Scheduler
MapReduce 1
map 1.2
map 1.1
MapReduce
2
map 2.1
map 2.2
reduce 2.1
NodeManager NodeManager NodeManager NodeManager
reduce 1.1 Tez map 2.3
reduce 2.2
vertex 1
vertex 2
vertex 3
vertex 4
HOYA
HBase Master
Region server 1
Region server 2
Region server 3 Storm
nimbus 1
nimbus 2
ResourceManager
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
Scheduler
MapReduce 1
map 1.2
map 1.1
MapReduce
2
map 2.1
map 2.2
reduce 2.1
NodeManager NodeManager NodeManager NodeManager
reduce 1.1 Tez map 2.3
reduce 2.2
vertex 1
vertex 2
vertex 3
vertex 4
HOYA
HBase Master
Region server 1
Region server 2
Region server 3 Storm
nimbus 1
nimbus 2
Kerberos
SSO
Provider
Identity Providers
Secure Hadoop Cluster 1
Secure Hadoop Cluster 2
Gateway Cluster
#1 #3#2
Ambari / Hue
Server
Browser
Ambari
Client
REST
Client
JDBC
Client
DMZ
Firewall
Firewall
25.03.2014
2
Final Stop: Goodbye Photo
Welcome Office
YARN Land
HDFS2 Land
YARN App Land Enterprise Land
Goodbye Photo
25.03.2014
2
Hadoop 2: Summary
1. Scale
2. New programming models &
services
3. Beyond Java
4. Improved cluster utilization
5. Enterprise Readiness
25.03.2014
2
One more thing…
Let’s get started with
Hadoop 2!
25.03.2014
2
Hortonworks Sandbox 2.0
http://hortonworks.com/products/hortonworks-sandbox
25.03.2014
2
Trainings
Developer Training
• 19. - 22.05.2014, Düsseldorf
• 30.06 - 03.07.2014, München
• 04. - 07.08.2014, Frankfurt
Admin Training
• 26. - 28.05.2014, Düsseldorf
• 07. - 09.07.2014, München
• 11. - 13.08.2014, Frankfurt
Details:
https://www.codecentric.de/schulungen-und-workshops/
25.03.2014
2
Thanks for traveling with me

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopPrasanna Rajaperumal
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoophadooparchbook
 
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAmazon Web Services
 
From docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayFrom docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayDataWorks Summit
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Wei-Chiu Chuang
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storySunil Govindan
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopDataWorks Summit
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseDataWorks Summit/Hadoop Summit
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInDataWorks Summit
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDataWorks Summit
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersDataWorks Summit/Hadoop Summit
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Advanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterAdvanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterEdureka!
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 

Was ist angesagt? (20)

Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Hoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoopHoodie: Incremental processing on hadoop
Hoodie: Incremental processing on hadoop
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
 
From docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native wayFrom docker to kubernetes: running Apache Hadoop in a cloud native way
From docker to kubernetes: running Apache Hadoop in a cloud native way
 
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
 
Apache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration storyApache Hadoop 3 updates with migration story
Apache Hadoop 3 updates with migration story
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARNDeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise CustomersHadoop in the Cloud: Real World Lessons from Enterprise Customers
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Advanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterAdvanced Security In Hadoop Cluster
Advanced Security In Hadoop Cluster
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 

Andere mochten auch

MongoDB Evenings 2015: MongoDB Enterprise Data Hub
MongoDB Evenings 2015: MongoDB Enterprise Data HubMongoDB Evenings 2015: MongoDB Enterprise Data Hub
MongoDB Evenings 2015: MongoDB Enterprise Data HubDylan Tong
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureUwe Printz
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesTodd Palino
 
Webinar: When to Use MongoDB
Webinar: When to Use MongoDBWebinar: When to Use MongoDB
Webinar: When to Use MongoDBMongoDB
 
Deploying deep learning models with Docker and Kubernetes
Deploying deep learning models with Docker and KubernetesDeploying deep learning models with Docker and Kubernetes
Deploying deep learning models with Docker and KubernetesPetteriTeikariPhD
 
The Architect's Clue Bucket
The Architect's Clue BucketThe Architect's Clue Bucket
The Architect's Clue BucketRuth Malan
 

Andere mochten auch (6)

MongoDB Evenings 2015: MongoDB Enterprise Data Hub
MongoDB Evenings 2015: MongoDB Enterprise Data HubMongoDB Evenings 2015: MongoDB Enterprise Data Hub
MongoDB Evenings 2015: MongoDB Enterprise Data Hub
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
 
Webinar: When to Use MongoDB
Webinar: When to Use MongoDBWebinar: When to Use MongoDB
Webinar: When to Use MongoDB
 
Deploying deep learning models with Docker and Kubernetes
Deploying deep learning models with Docker and KubernetesDeploying deep learning models with Docker and Kubernetes
Deploying deep learning models with Docker and Kubernetes
 
The Architect's Clue Bucket
The Architect's Clue BucketThe Architect's Clue Bucket
The Architect's Clue Bucket
 

Ähnlich wie Welcome to Hadoop2Land!

Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceUwe Printz
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and FutureDataWorks Summit
 
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...DataWorks Summit
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Big Data Joe™ Rossi
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3tcloudcomputing-tw
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformBikas Saha
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopHortonworks
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Hadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingHadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingNandan Kumar
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsshrey mehrotra
 
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)BigDataEverywhere
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with YarnDavid Kaiser
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN ApplicationsHortonworks
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecturesaipriyacoool
 

Ähnlich wie Welcome to Hadoop2Land! (20)

Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
 
Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1Hadoop - Past, Present and Future - v1.1
Hadoop - Past, Present and Future - v1.1
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of HadoopApache Hadoop YARN: Understanding the Data Operating System of Hadoop
Apache Hadoop YARN: Understanding the Data Operating System of Hadoop
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Hadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingHadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch training
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
 
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with YarnScale 12 x   Efficient Multi-tenant Hadoop 2 Workloads with Yarn
Scale 12 x Efficient Multi-tenant Hadoop 2 Workloads with Yarn
 
Get Started Building YARN Applications
Get Started Building YARN ApplicationsGet Started Building YARN Applications
Get Started Building YARN Applications
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 

Mehr von Uwe Printz

Lightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesLightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesUwe Printz
 
MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)Uwe Printz
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)Uwe Printz
 
MongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererMongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererUwe Printz
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Uwe Printz
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Map/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBMap/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBUwe Printz
 
First meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtFirst meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtUwe Printz
 

Mehr von Uwe Printz (13)

Apache Spark
Apache SparkApache Spark
Apache Spark
 
Lightning Talk: Agility & Databases
Lightning Talk: Agility & DatabasesLightning Talk: Agility & Databases
Lightning Talk: Agility & Databases
 
MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)MongoDB für Java Programmierer (JUGKA, 11.12.13)
MongoDB für Java Programmierer (JUGKA, 11.12.13)
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)MongoDB for Coder Training (Coding Serbia 2013)
MongoDB for Coder Training (Coding Serbia 2013)
 
MongoDB für Java-Programmierer
MongoDB für Java-ProgrammiererMongoDB für Java-Programmierer
MongoDB für Java-Programmierer
 
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
Introduction to the Hadoop Ecosystem with Hadoop 2.0 aka YARN (Java Serbia Ed...
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Map/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDBMap/Confused? A practical approach to Map/Reduce with MongoDB
Map/Confused? A practical approach to Map/Reduce with MongoDB
 
First meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group FrankfurtFirst meetup of the MongoDB User Group Frankfurt
First meetup of the MongoDB User Group Frankfurt
 

Kürzlich hochgeladen

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Kürzlich hochgeladen (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Welcome to Hadoop2Land!

  • 2. 25.03.2014 2 Your Travel Guide Big Data Nerd TravelpiratePhotography Enthusiast Hadoop Trainer NoSQL Fan Boy
  • 3. 25.03.2014 2 Your Travel Agency specializes on... Big Data Nerds Agile Ninjas Continuous Delivery Gurus Enterprise Java Specialists Performance Geeks Join us!
  • 4. 25.03.2014 2 Your Travel Destination Welcome Office YARN Land HDFS 2 Land YARN App Land Enterprise Land Goodbye Photo
  • 5. 25.03.2014 2 Stop #1: Welcome Office Welcome Office
  • 6. 25.03.2014 2 …there was MapReduce In the beginning of Hadoop • It could handle data sizes way beyond those of its competitors • It was resilient in the face of failure • It made it easy for users to bring their code and algorithms to the data
  • 7. 25.03.2014 2 …but it was too low level Hadoop 1 (2007)
  • 8. 25.03.2014 2 …but it was too rigid Hadoop 1 (2007)
  • 9. 25.03.2014 2 HDFS …but it was Batch HDFS HDFS Single App Batch Single App Batch Single App Batch Single App Batch Single App Batch Hadoop 1 (2007)
  • 10. 25.03.2014 2 …but it had limitations Hadoop 1 (2007) • Scalability – Maximum cluster size ~ 4,500 nodes – Maximum concurrent tasks – 40,000 – Coarse synchronization in JobTracker • Availability – Failure kills all queued and running jobs • Hard partition of resources into map & reduce slots – Low resource utilization • Lacks support for alternate paradigms and services
  • 11. 25.03.2014 2 YARN to the rescue! Hadoop 2 (2013)
  • 12. 25.03.2014 2 Stop #2: YARN Land Welcome Office YARN Land
  • 13. 25.03.2014 2 A brief history of Hadoop 2 • Originally conceived & architected by the team at Yahoo! – Arun Murthy created the original JIRA in 2008 and now is the Hadoop 2 release manager • The community has been working on Hadoop 2 for over 4 years • Hadoop 2 based architecture running at scale at Yahoo! – Deployed on 35,000+ nodes for 6+ months
  • 14. 25.03.2014 2 Hadoop 1 HDFS Redundant, reliable storage Hadoop 2: Next-gen platform MapReduce Cluster resource mgmt. + data processing Hadoop 2 HDFS 2 Redundant, reliable storage MapReduce Data processing Single use system Batch Apps Multi-purpose platform Batch, Interactive, Streaming, … YARN Cluster resource management Others Data processing
  • 15. 25.03.2014 2 Taking Hadoop beyond batch Applications run natively in Hadoop HDFS 2 Redundant, reliable storage Batch MapReduce Store all data in one place Interact with data in multiple ways YARN Cluster resource management Interactive Tez Online HOYA Streaming Storm, … Graph Giraph In-Memory Spark Other Search, …
  • 16. 25.03.2014 2 YARN: Design Goals • Build a new abstraction layer by splitting up the two major functions of the JobTracker • Cluster resource management • Application life-cycle management • Allow other processing paradigms • Flexible API for implementing YARN apps • MapReduce becomes YARN app • Lots of different YARN apps
  • 17. 25.03.2014 2 Hadoop: BigDataOS™ Storage: File System Traditional Operating System Execution / Scheduling: Processes / Kernel Storage: HDFS Hadoop Execution / Scheduling: YARN
  • 18. 25.03.2014 2 YARN: Architectural Overview Split up the two major functions of the JobTracker Cluster resource management & Application life-cycle management ResourceManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager Scheduler AM 1 Container 1.2 Container 1.1 AM 2 Container 2.1 Container 2.2 Container 2.3
  • 19. 25.03.2014 2 YARN: Multi-tenancy I ResourceManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager Scheduler MapReduce 1 map 1.2 map 1.1 MapReduce 2 map 2.1 map 2.2 reduce 2.1 NodeManager NodeManager NodeManager NodeManager reduce 1.1 Tez map 2.3 reduce 2.2 vertex 1 vertex 2 vertex 3 vertex 4 HOYA HBase Master Region server 1 Region server 2 Region server 3 Storm nimbus 1 nimbus 2 Different types of applications on the same cluster
  • 20. 25.03.2014 2 YARN: Multi-tenancy II ResourceManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager Scheduler MapReduce 1 map 1.2 map 1.1 MapReduce 2 map 2.1 map 2.2 reduce 2.1 NodeManager NodeManager NodeManager NodeManager reduce 1.1 Tez map 2.3 reduce 2.2 vertex 1 vertex 2 vertex 3 vertex 4 HOYA HBase Master Region server 1 Region server 2 Region server 3 Storm nimbus 1 nimbus 2 DWHUser Ad- Hoc root 30% 60% 10% Dev Prod 20% 80% Dev Prod Dev1 Dev2 25% 75% 60% 40% Different users and organizations on the same cluster
  • 21. 25.03.2014 2 YARN: Your own app ResourceManager NodeManager NodeManager NodeManager Scheduler Container Container MyApp YARNClient App specific API myAM AMRMClient NMClient Application Client Protocol Application Master Protocol Container Management Protocol DistributedShell is the new WordCount!
  • 22. 25.03.2014 2 Stop #3: HDFS 2 Land Welcome Office YARN Land HDFS 2 Land
  • 23. 25.03.2014 2 HDFS 2: In a nutshell • Removes tight coupling of Block Storage and Namespace • Adds (built-in) High Availability • Better Scalability & Isolation • Increased performance Details: https://issues.apache.org/jira/browse/HDFS-1052
  • 24. 25.03.2014 2 HDFS 2: Federation NameNodes do not talk to each other NameNodes manages only slice of namespace DataNodes can store blocks managed by any NameNode NameNode 1 Namespace 1 Namespace State Block Map NameNode 2 Namespace 2 Block Pools Pool 1 Pool 2 Block Storage as generic storage service Data Nodes b3 b1 b2 b4 b2 b1 b5 b3 b2 b5 b4 JBOD JBOD JBOD Namespace State Block Map Horizontally scale IO and storage
  • 25. 25.03.2014 2 HDFS 2: Architecture Active NameNode Standby NameNode DataNodeDataNodeDataNode DataNode DataNode Maintains Block Map and Edits File Simultaneously reads and applies the edits Report to both NameNodes Block Map Edits File Block Map Edits File NFS Shared state on NFS OR Quorum based storage Journal Node Journal Node Journal Node Take orders only from the Active or
  • 26. 25.03.2014 2 ZKFailover Controller ZKFailover Controller HDFS 2: High Availability Active NameNode Standby NameNode DataNodeDataNodeDataNode DataNode DataNode Block Map Edits File Block Map Edits File ZooKeeper Node ZooKeeper Node ZooKeeper Node Send Heartbeats & Block Reports Shared State Monitors health of NN, OS, HW Heartbeat Heartbeat Holds special lock znode Journal Node Journal Node Journal Node
  • 27. 25.03.2014 2 HDFS 2: Write-Pipeline • Earlier versions of HDFS • Files were immutable • Write-once-read-many model • New features in HDFS 2 • Files can be reopened for append • New primitives: hflush and hsync • Replace data node on failure • Read consistency DataNode 1 DataNode 2 DataNode 3 DataNode 4 Writer Add new node to the pipeline Reader Data Data Data Can read from any node and then failover to any other node
  • 28. 25.03.2014 2 HDFS 2: Snapshots • Admin can create point in time snapshots of HDFS • Of the entire file system • Of a specific data-set (sub-tree directory of file system) • Restore state of entire file system or data-set to a snapshot (like Apple Time Machine) • Protect against user errors • Snapshot diffs identify changes made to data set • Keep track of how raw or derived/analytical data changes over time
  • 29. 25.03.2014 2 HDFS 2: NFS Gateway • Supports NFS v3 (NFS v4 is work in progress) • Supports all HDFS commands • List files • Copy, move files • Create and delete directories • Ingest for large scale analytical workloads • Load immutable files as source for analytical processing • No random writes • Stream files into HDFS • Log ingest by applications writing directly to HDFS client mount
  • 30. 25.03.2014 2 HDFS 2: Performance • Many improvements • New AppendableWrite-Pipeline • Read path improvements for fewer memory copies • Short-circuit local reads for 2-3x faster random reads • I/O improvements using posix_fadvise() • libhdfs improvements for zero copy reads • Significant improvements: I/O 2.5 - 5x faster
  • 31. 25.03.2014 2 Stop #4: YARN App Land Welcome Office YARN Land HDFS 2 Land YARN App Land
  • 32. 25.03.2014 2 YARN Apps: Overview • MapReduce 2 Batch • Tez DAG Processing • Storm Stream Processing • Samza Stream Processing • Apache S4 Stream Processing • Spark In-Memory • HOYA HBase on YARN • Apache Giraph Graph Processing • Apache Hama Bulk Synchronous Parallel • Elastic Search Scalable Search
  • 33. 25.03.2014 2 YARN Apps: Overview • MapReduce 2 Batch • Tez DAG Processing • Storm Stream Processing • Samza Stream Processing • Apache S4 Stream Processing • Spark In-Memory • HOYA HBase on YARN • Apache Giraph Graph Processing • Apache Hama Bulk Synchronous Parallel • Elastic Search Scalable Search
  • 34. 25.03.2014 2 MapReduce 2: In a nutshell • MapReduce is now a YARN app • No more map and reduce slots, it’s containers now • No more JobTracker, it’s YarnAppmaster library now • Multiple versions of MapReduce • The older mapred APIs work without modification or recompilation • The newer mapreduce APIs may need to be recompiled • Still has one master server component: the Job History Server • The Job History Server stores the execution of jobs • Used to audit prior execution of jobs • Will also be used by YARN framework to store charge backs at that level • Better cluster utilization • Increased scalability & availability
  • 35. 25.03.2014 2 MapReduce 2: Shuffle • Faster Shuffle • Better embedded server: Netty • Encrypted Shuffle • Secure the shuffle phase as data moves across the cluster • Requires 2 way HTTPS, certificates on both sides • Causes significant CPU overhead, reserve 1 core for this work • Certificates stored on each node (provision with the cluster), refreshed every 10 secs • Pluggable Shuffle Sort • Shuffle is the first phase in MapReduce that is guaranteed to not be data- local • Pluggable Shuffle/Sort allows application developers or hardware developers to intercept the network-heavy workload and optimize it • Typical implementations have hardware components like fast networks and software components like sorting algorithms • API will change with future versions of Hadoop
  • 36. 25.03.2014 2 MapReduce 2: Performance • Key Optimizations • No hard segmentation of resource into map and reduce slots • YARN scheduler is more efficient • MR2 framework has become more efficient than MR1: shuffle phase in MRv2 is more performant with the usage of Netty. • 40.000+ nodes running YARN across over 365 PB of data. • About 400.000 jobs per day for about 10 million hours of compute time. • Estimated 60% – 150% improvement on node usage per day • Got rid of a whole 10,000 node datacenter because of their increased utilization.
  • 37. 25.03.2014 2 YARN Apps: Overview • MapReduce 2 Batch • Tez DAG Processing • Storm Stream Processing • Samza Stream Processing • Apache S4 Stream Processing • Spark In-Memory • HOYA HBase on YARN • Apache Giraph Graph Processing • Apache Hama Bulk Synchronous Parallel • Elastic Search Scalable Search
  • 38. 25.03.2014 2 Apache Tez: In a nutshell • Distributed execution framework that works on computations represented as dataflow graphs • Tez is Hindi for “speed” • Naturally maps to execution plans produced by query optimizers • Highly customizable to meet a broad spectrum of use cases and to enable dynamic performance optimizations at runtime • Built on top of YARN
  • 39. 25.03.2014 2 Apache Tez: Architecture • Task with pluggable Input, Processor & Output Task HDFS Input Map Processor Sorted Output „Classical“ Map Task Shuffle Input Reduce Processor HDFS Output „Classical“ Reduce YARN ApplicationMaster to run DAG of Tez Tasks
  • 40. 25.03.2014 2 Apache Tez: Tez Service • MapReduce Query Startup is expensive: – Job launch & task-launch latencies are fatal for short queries (in order of 5s to 30s) • Solution: – Tez Service (= Preallocated Application Master) • Removes job-launch overhead (Application Master) • Removes task-launch overhead (Pre-warmed Containers) – Hive (or Pig) • Submit query plan to Tez Service – Native Hadoop service, not ad-hoc
  • 41. 25.03.2014 2 Hadoop 1 HDFS Redundant, reliable storage Apache Tez: The new primitive MapReduce Cluster resource mgmt. + data processing Hadoop 2 MapReduce as Base Apache Tez as Base Pig Hive Other HDFS Redundant, reliable storage YARN Cluster resource management Tez Execution Engine MR Pig Hive Real time Storm O t h e r
  • 42. 25.03.2014 2 Apache Tez: Performance SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Existing Hive Parse Query 0.5s Create Plan 0.5s Launch Map- Reduce 20s Process Map- Reduce 10s Total 31s Hive/Tez Parse Query 0.5s Create Plan 0.5s Launch Map- Reduce 20s Process Map- Reduce 2s Total 23s Tez & Hive Service Parse Query 0.5s Create Plan 0.5s Submit to Tez Service 0.5s Process Map-Reduce 2s Total 3.5s * No exact numbers, for illustration only
  • 44. 25.03.2014 2 Stinger: Overall Performance * Real numbers, but handle with care!
  • 45. 25.03.2014 2 YARN Apps: Overview • MapReduce 2 Batch • Tez DAG Processing • Storm Stream Processing • Samza Stream Processing • Apache S4 Stream Processing • Spark In-Memory • HOYA HBase on YARN • Apache Giraph Graph Processing • Apache Hama Bulk Synchronous Parallel • Elastic Search Scalable Search
  • 46. 25.03.2014 2 Storm: In a nutshell • Stream-processing • Real-time processing • Developed as standalone application • https://github.com/nathanmarz/storm • Ported on YARN • https://github.com/yahoo/storm-yarn
  • 47. 25.03.2014 2 Storm: Conceptual view Spout Spout Spout Source of streams Bolt Bolt Bolt Bolt Bolt Tuple Bolt • Consumer of streams • Processing of tuples • Possibly emits new tuplesStream Unbound sequence of tuples Tuple Tuple Tuple List of name-value pairs Topology Network of spouts & bolts as the nodes and streams as the edges
  • 48. 25.03.2014 2 YARN Apps: Overview • MapReduce 2 Batch • Tez DAG Processing • Storm Stream Processing • Samza Stream Processing • Apache S4 Stream Processing • Spark In-Memory • HOYA HBase on YARN • Apache Giraph Graph Processing • Apache Hama Bulk Synchronous Parallel • Elastic Search Scalable Search
  • 49. 25.03.2014 2 Spark: In a nutshell • High-speed in-memory analytics over Hadoop and Hive • Separate MapReduce-like engine – Speedup of up to 100x – On-disk queries 5-10x faster • Spark is now a top-level Apache project – http://spark.apache.org • Compatible with Hadoop‘s Storage API • Spark can be run on top of YARN – http://spark.apache.org/docs/0.9.0/running-on-yarn.html
  • 50. 25.03.2014 2 Spark: RDD • Key idea: Resilient Distributed Datasets (RDDs) • Read-only partitioned collection of records • Optionally cached in memory across cluster • Manipulated through parallel operators • Support only coarse-grained operations • Map • Reduce • Group-by transformations • Automatically recomputed on failure RDD A11 A12 A13
  • 52. 25.03.2014 2 YARN Apps: Overview • MapReduce 2 Batch • Tez DAG Processing • Storm Stream Processing • Samza Stream Processing • Apache S4 Stream Processing • Spark In-Memory • HOYA HBase on YARN • Apache Giraph Graph Processing • Apache Hama Bulk Synchronous Parallel • Elastic Search Scalable Search
  • 53. 25.03.2014 2 HOYA: In a nutshell • Create on-demand HBase clusters • Small HBase cluster in large YARN cluster • Dynamic HBase clusters • Self-healing HBase Cluster • Elastic HBase clusters • Transient/intermittent clusters for workflows • Configure custom configurations & versions • Better isolation • More efficient utilization/sharing of cluster
  • 54. 25.03.2014 2 HOYA: Creation of AppMaster ResourceManager NodeManager NodeManager NodeManager Scheduler Container Container HOYA Client YARNClient HOYA specific API HOYA Application Master Container Container Container Container
  • 55. 25.03.2014 2 HOYA: Deployment of HBase ResourceManager NodeManager NodeManager NodeManager Scheduler Container Container HOYA Client YARNClient HOYA specific API HOYA Application Master Container Container Container Container HBase Master Region Server Region Server
  • 56. 25.03.2014 2 HOYA: Bind via ZooKeeper ResourceManager NodeManager NodeManager NodeManager Scheduler Container Container HOYA Client YARNClient HOYA specific API HOYA Application Master Container Container Container Container HBase Master Region Server Region Server HBase Client ZooKeeper
  • 57. 25.03.2014 2 YARN Apps: Overview • MapReduce 2 Batch • Tez DAG Processing • Storm Stream Processing • Samza Stream Processing • Apache S4 Stream Processing • Spark In-Memory • HOYA HBase on YARN • Apache Giraph Graph Processing • Apache Hama Bulk Synchronous Parallel • Elastic Search Scalable Search
  • 58. 25.03.2014 2 Giraph: In a nutshell • Giraph is a framework for processing semi- structured graph data on a massive scale • Giraph is loosely based upon Google's Pregel • Both systems are inspired by the Bulk Synchronous Parallel model • Giraph performs iterative calculations on top of an existing Hadoop cluster • Uses Single Map-only Job • Apache top level project since 2012 – http://giraph.apache.org
  • 59. 25.03.2014 2 Stop #5: Enterprise Land Welcome Office YARN Land HDFS 2 Land YARN App Land Enterprise Land
  • 60. 25.03.2014 2 Falcon: In a nutshell • A framework for managing data processing in Hadoop Clusters • Falcon runs as a standalone server as part of the Hadoop cluster • Key Features: • Data Replication Handling • Data Lifecycle Management • Process Coordination & Scheduling • Declarative Data Process Programming • Apache Incubation Status • http://falcon.incubator.apache.org
  • 61. 25.03.2014 2 Falcon: One-stop Shop Data Management Needs Tool Orchestration Data Processing Replication Retention Scheduling Reprocessing Multi Cluster Mgmt. Oozie Sqoop Distcp Flume MapReduce Pig & Hive
  • 62. 25.03.2014 2 Falcon: Weblog Use Case • Weblogs saved hourly to primary cluster • HDFS location is /weblogs/{date} • Desired Data Policy: • Replicate weblogs to secondary cluster • Evict weblogs from primary cluster after 2 days • Evict weblogs from secondary cluster after 1 week
  • 63. 25.03.2014 2 Falcon: Weblog Use Case <feed description=“" name=”feed-weblogs” xmlns="uri:falcon:feed:0.1"> <frequency>hours(1)</frequency> <clusters> <cluster name=”cluster-primary" type="source"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name=”cluster-secondary" type="target"> <validity start="2012-01-01T00:00Z" end="2099-12-31T00:00Z"/> <retention limit=”days(7)" action="delete"/> </cluster> </clusters> <locations> <location type="data” path="/weblogs/${YEAR}-${MONTH}-${DAY}- ${HOUR}"/> </locations> <ACL owner=”hdfs" group="users" permission="0755"/> <schema location="/none" provider="none"/> </feed>
  • 64. 25.03.2014 2 Knox: In a nutshell • System that provides a single point of authentication and access for Apache Hadoop services in a cluster. • The gateway runs as a server (or cluster of servers) that provide centralized access to one or more Hadoop clusters. • The goal is to simplify Hadoop security for both users and operators • Apache Incubation Status • http://knox.incubator.apache.org
  • 65. 25.03.2014 2 Knox: Architecture ResourceManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager Scheduler MapReduce 1 map 1.2 map 1.1 MapReduce 2 map 2.1 map 2.2 reduce 2.1 NodeManager NodeManager NodeManager NodeManager reduce 1.1 Tez map 2.3 reduce 2.2 vertex 1 vertex 2 vertex 3 vertex 4 HOYA HBase Master Region server 1 Region server 2 Region server 3 Storm nimbus 1 nimbus 2 ResourceManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager Scheduler MapReduce 1 map 1.2 map 1.1 MapReduce 2 map 2.1 map 2.2 reduce 2.1 NodeManager NodeManager NodeManager NodeManager reduce 1.1 Tez map 2.3 reduce 2.2 vertex 1 vertex 2 vertex 3 vertex 4 HOYA HBase Master Region server 1 Region server 2 Region server 3 Storm nimbus 1 nimbus 2 Kerberos SSO Provider Identity Providers Secure Hadoop Cluster 1 Secure Hadoop Cluster 2 Gateway Cluster #1 #3#2 Ambari / Hue Server Browser Ambari Client REST Client JDBC Client DMZ Firewall Firewall
  • 66. 25.03.2014 2 Final Stop: Goodbye Photo Welcome Office YARN Land HDFS2 Land YARN App Land Enterprise Land Goodbye Photo
  • 67. 25.03.2014 2 Hadoop 2: Summary 1. Scale 2. New programming models & services 3. Beyond Java 4. Improved cluster utilization 5. Enterprise Readiness
  • 68. 25.03.2014 2 One more thing… Let’s get started with Hadoop 2!
  • 70. 25.03.2014 2 Trainings Developer Training • 19. - 22.05.2014, Düsseldorf • 30.06 - 03.07.2014, München • 04. - 07.08.2014, Frankfurt Admin Training • 26. - 28.05.2014, Düsseldorf • 07. - 09.07.2014, München • 11. - 13.08.2014, Frankfurt Details: https://www.codecentric.de/schulungen-und-workshops/