08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
堵俊平:Hadoop virtualization extensions
1. Big Data in Cloud
堵俊平
Apache Hadoop Committer
Staff Engineer, VMware
2. Bio 堵俊平 (Junping Du)
- Join VMware in 2008 for cloud product
first
- Initiate earliest effort on big data within
VMware since 2010
- Automate Hadoop deployment on
vSphere which becomes Open Source
project – Serengeti later
- Start contributing to Apache Hadoop
community since 2012
- Become Apache Hadoop committer
recently only 1 in +8 timezone today
3. Agenda
- Virtualization, SDDC and Cloud
- Trends from my observation in Big
Data
- YARN: resource hub for Big Data
Applications
- YARN in the Cloud
4. What is Virtualization?
- @see VMware’s vSphere
Guest
TCP/IP
Guest
Monitor
File
System
Monitor
Virtual NIC
Physical
Hardware
Scheduler
Memory
Manager
Virtual Switch
File System
NIC Drivers
VMkernel
Virtual SCSI
I/O Drivers
Monitor Emulates Physical
Devices: CPU, Memory, I/O
CPU is controlled by scheduler
and virtualized by monitor
Memory is allocated by the
VMkernel and virtualized by
the monitor
Network and I/O devices are
emulated and proxied though
native device drivers
5. Server Virtualization Adoption on
Path to 80% Over Next 5 Years
% Virtualized of x86 Workloads
80%
Total x86 Workloads
200
100%
180
IDC
2012 to 2016
Change = +12 pts
90%
160
Gartner
2012 to 2016
Change = +22 pts
140
80%
x86 % Physical
Servers
Unvirtualized
70%
百万
120
40%
100
60%
IDC+ VMW
Estimate:
Workloads1
2012 to 2016
CAGR = 21%
50%
80
60
30%
40
20%
20
0%
40%
10%
2010 2011 2012 2013 2014 2015 2016 2017 2018
0%
2009 2010 2011 2012 2013 2014 2015 2016
Source(s): IDC: Annual Virtualization Forecast, Feb-13; Gartner: x86 Server Virtualization, Worldwide, 3Q12 Update; Gartner: Forecast x86 Server Virtualization, Worldwide, 2008-2018, Jul-11; VMware estimates,
Note: Server workloads only 1 Installed Base totals assume 5-year refresh
6. Apps on Traditional Infrastructure
Windows
Linux
Databases
Mission
Critical
HPC
Big Data
7. Apps on Software-Defined Data Center
Windows
Linux
Mission
Critical
Databases
HPC
Big Data
Software-Defined Data Center
VDC
VDC
VDC
VDC
VDC
Software-Defined Data Center Services
Abstract
Pool
Automate
8. Infrastructure for Traditional Apps
Traditional Applications
2016
141M
70%
Infrastructure for Traditional Enterprise Apps
Existing Application bound to vendor specific HW
2012
83M
Hardware-based Resiliency
Hardware-based QOS
Hard To automate
Complex to scale
9. Infrastructure for New Apps
Infrastructure for New/Cloud/Data Apps
Application Specific Network and Storage
Next Gen Cloud Applications
2016
48M
700%
2012
6M
Software-based Infrastructure
Transformational Economics
Automation and Agility
Designed For Scale
10. SDDC Delivers Single Architecture for New and Existing Apps
Infrastructure for New/Cloud/Data Apps
Application Specific Network and Storage
Any Application
Infrastructure for Existing Enterprise Apps
Existing Application bound to vendor specific HW
Any Hardware
11. Let’s back to Big Data …
New Trends of Big Data from my observation
- Hadoop 2.0, YARN plays as key resource hub in big
data ecosystem
- MapReduce is not good enough, we need faster one,
like: Tez, Spark, etc.
- HDFS tries to support more scenarios, i.e. cache for
low-latency apps, snapshot for disaster recovery,
storage tiers awareness, etc.
- More Hadoop-based SQL engines: Apache Drill,
Impala, Stinger, Hawq, etc.
- For enterprise-ready, more efforts are spent on
Security, HA, QoS, Monitor & Management
13. MapReduce v1 Limitations
• Scalability
– Manage cluster resources and job scheduling
• SPOF (Single Point Of Failure)
• JobTracker failure cause all queued and running job
failure
– Restart is very tricky due to complex state
• Hard partition of resources into map and reduce
slots
– Low resource utilization
• Lacks support for alternate paradigms
• Lack of wire-compatible protocols
14. YARN Architecture
• Splits up the two major functions of
JobTracker
– Resource Manager (RM) - Cluster resource
management
– Application Master (AM) - Task scheduling and
monitoring
• NodeManager (NM) - A new per-node
slave
– launching the applications’ containers
– monitoring their resource usage (cpu,
memory) and reporting to the Resource
Manager.
• YARN maintains compatibility with existing
MapReduce application and support other
applications
15. YARN – Hub for Big Data Applications
OpenMPI
Impala
HBase
Distributed Shell
Spark
MapReduce
Tez
Storm
YARN
HDFS
• App-specific AM
• HOYA (Hbase On YArn)
– Long running services (YARN-896)
• LLAMA (Low Latency Application MAster)
– Gang Scheduler (YARN-624)
16. YARN and Cloud
• Two different prospective:
– YARN-centric prospective
• YARN is the key platform to apps
• YARN is independent of infrastructure, running on top of
Cloud shows YARN’s generality
– Cloud-centric prospective
• YARN is an umbrella kind of applications
• Supporting YARN shows Cloud’s generality
17. YARN and Cloud: YARN-centric Prospective
• YARN is “OS”
Big Data Apps
• Infrastructure (no matter physical or cloud) is “hardware”
HBase
Open MPI
Distributed Shell
Spark
…
Impala
MapReduce
Tez
Storm
YARN
Infrastructure
Bare-metal machines
Cloud Infrastructure
…
VMware
Open Stack
…
18. YARN and Cloud: Cloud-centric Prospective
• Cloud Infrastructure is “OS”
• YARN is a group of “process”
Legacy Apps
Other
Big Data Apps
YARN Apps
Open MPI
D.S
Spark
Impala
…
HBase
MapReduce
Tez
Storm
…
YARN
Cloud Infrastructure (VMware, Open Stack, etc.)
19. YARN vs. Cloud
• Similarity
– Target to share resources across applications
– Provide Global Resource Management
• YARN vs. Cloud
– YARN managing resource in OS layer vs. Cloud managing
resources in Hypervisor (Not comparable, but Hypervisor
is more powerful than OS in isolation)
– Apps managed by YARN need specific AppMaster, Apps
managed by Cloud is exactly the same as running on
physical machines (Cloud +1)
– YARN layer is closed to big data app, better
understand/estimate app’s requirement (YARN +1)
– Cloud layer is closed to hardware resources, easier to
track real time and global resource utilization (Cloud +1)
20. YARN + Cloud
• Why YARN + Cloud?
– Leverage virtualization in strong isolation, fine-grained
resource sharing and other benefits
– Uniform infrastructure to simplify IT in enterprise
• What it looks like?
– Running YARN NM inside of VMs managed by Cloud
Infrastructure
– Build communication channel between YARN RM and
Cloud Resource Manager for coordination
• How we do?
– First thing above is very easy and smoothly
– Second things to achieve in two ways
• YARN can aware/manipulate Cloud resource change
• YARN provide a generic resource notification mechanism so
Cloud Manager can use when resource changing
21. Elastic YARN Node in the Cloud
Container
Add/Remove
Resources?
Container
Other
Workload
Virtual
YARN
Node
NodeManager
Datanode
Virtualization Host
Grow/Shrink resource of a VM
VMDK
Grow/Shrink
by tens of GB in
memory?
22. Elastic YARN Node in the Cloud
• VM’s resource boundary can be elastic
–
–
–
–
CPU is easy – time slicing (with constraints)
Memory is harder – page sharing and memory ballooning
In case of contention, enforce limits and proportional sharing
“Stealing” resources behind apps could cause bad
performance (paging)
– App aware resource management could address these issues
• Hadoop YARN Resource Model
– Dynamic with adding/removing nodes
– But static for per node
• In this case, shall we enable resource elasticity on VM?
– If yes, low performance when resource contention happens.
– If no, low utilization as physical boxes because free resources
cannot be leveraged by other busy VMs
• We need better answer .
23. HVE provide the answer!
• Hadoop Virtualization Extensions
– A project initiated from VMware to enhance Hadoop
running on virtualization
– A “driver” for Hadoop “OS” running on cloud
“hardware”
• Goal: Make Hadoop Cloud-Ready
– Provide Virtualization-awareness to Hadoop, i.e.
virtual topology, virtual resources, etc.
– Deliver generic utility that can be leveraged by
virtualized platform
• Independent of virtualization platform and cloud
infrastructure
• 100% contribute to Apache Hadoop Community
24. HVE
• Philosophy
– make infrastructure related components abstract
– deliver different implementations that can be
configured properly
• E.g.
BlockPlacementPolicy
(Abstract)
BlockPlacementPolicy
BlockPlacementPolicy
Default
BlockPlacementPolicy
For Virtualization
25. Elastic YARN Node in the Cloud
• In this case, shall we enable resource elasticity
on VM?
• Yes, and we try to get rid of resource contention
– Notify YARN that node’s resource get changed
– YARN RM scheduler won’t schedule new tasks on
nodes get congestion
– YARN scheduler preempt low priority tasks if
necessary
– The work is addressed in YARN-291
27. Welcome contribution to Apache Hadoop!
• Hadoop is the key platform
– For architecting Big Data
– Contribute a bit can change the world!
• Open source project is a great platform
– For people to share great ideas, works from different
organizations
– Community is a great work place
• Companies and persons get credit
– From work and resources they are putting
– Also easy to build a ecosystem and show expertise
• So many challenges in Big Data, like building Babel
– Open source is the common language to make sure we can
work together
28. Key messages in today’s talk
• SDDC and Cloud are the future for architecting
enterprise IT
• New trends in big data: YARN plays as a “OS” for
big data apps
• In VMware, we tries to support any “OS”, include
“YARN”
• HVE plays as “driver” to enable Hadoop on
virtualization/cloud
• Contribute to Apache Hadoop