Weitere ähnliche Inhalte Ähnlich wie SAP Virtualization Week 2012 - The Lego Cloud (20) Kürzlich hochgeladen (20) SAP Virtualization Week 2012 - The Lego Cloud1. SAP Virtualization Week 2012: TRND04
SAP DKOM 2012: NA 6747
The Lego Cloud
Benoit Hudzia Sr. Researcher; SAP Research CEC Belfast (UK)
Aidan Shribman Sr. Researcher; SAP Research Israel
4. Evolution of Virtualization
Resources Disaggregation
(True Utility Cloud)
Flexible Resources
Management
(Cloud)
Basic
Consolidation
No
virtualization
© 2012 SAP AG. All rights reserved. 4
5. Why Disaggregate Resources?
Better Performance
Replacing slow local devices (e.g. disk) with fast remote devices (e.g. DRAM).
Many remote devices working in parallel (e.g. DRAM, disk, compute)
Superior Scalability
Going beyond boundaries of the single node
Improved Economics
Do more with existing hardware
Reach better hardware utilization levels
© 2012 SAP AG. All rights reserved. 5
6. The Hecatonchire Project
Hecatonchires in Greek mythology means “Hundred Handed
Ones” – the original idea: provide Distributed Shared Memory
(DSM) capabilities to the cloud
Strategic goal : full resource liberation brought to the cloud by:
Breaking down physical nodes to their core elements (CPU, Memory, I/O)
Extend existing cloud software stack (KVM, QEMU, libvirt, OpenStack)
without degrading any existing capabilities
Using commodity cloud hardware and standard interconnects
Initiated by Benoit Hudzia in 2011. Currently developed by two
SAP Research TI Practice teams located in Belfast and Ra’anana
Hecatonchire is not a monolithic project – but a set of separate
capabilities. We are currently identifying stake holder and
defining use cases for each such capability.
© 2012 SAP AG. All rights reserved. 6
7. Hecatonchire Architecture
Cluster Servers
Guests
Commodity hosts (e.g. 64 GB 16 core)
Commodity network adapters: VM
VM VM
– Standard: softiwarp over 1 GbE App App
App
OS
– Enterprise: RoCE/iWARP over 10 GbE or native IB VM
OS OS H/W
Ap
A modified version of QEMU/KVM hypervisor
p
OS
H/W
H/W
An RDMA remote memory kernel module
H/W
Guests / VMs
Server #1 Server #2 Server #n
Use resource from one or several underlaying hosts
CPUs CPUs
Existing OS/application can run transparently CPUs
Memory Memory
– Not exactly … but we will get to this later Memory
I/O I/O I/O
Fast RDMA Communication
© 2012 SAP AG. All rights reserved. 7
8. The Team - Panoramic View
© 2012 SAP AG. All rights reserved. 8
10. DRAM Latency Has Remained Constant
CPU clock speed and memory bandwidth
increased steadily while memory latency
remained constant
As a result local memory has appears slower
from the CPU perspective
Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010
© 2012 SAP AG. All rights reserved. 10
11. CPU Cores Stopped Getting Faster
Moore’s law prevailed until 2005 when cores
hit a practical limit of about 3.4 GHz
The “single threaded free lunch” (as coined by
Herb Sutter) is over
Source: http://www.intel.com/pressroom/kits/quickrefyr.htm
So CPU cores have stopped getting faster -
but you do get more cores now
Source: “The Free Lunch Is Over..” by Herb Sutter
© 2012 SAP AG. All rights reserved. 11
12. But … Interconnects Continue to Evolve
(providing higher bandwidth and lower latency)
© 2012 SAP AG. All rights reserved. 12
13. Result: Remote Nodes Are Becoming “Closer”
Accessing DRAM on a remote host via IB interconnects is only 20x slower than local DRAM.
Remote DRAM is 100x or 5000x faster than local SSD or HDD devices respectively.
HANA Performance Analysis, Intel Westmere (formally Nehelem-C) and IB QDR, Chaim Bendelac, 2011
© 2012 SAP AG. All rights reserved. 13
14. Result: Blurring the Boundaries of the Physical Host
15ns-80ns 60ns-100ns 10,000,000 ns
2,000ns 2,000ns 10,000,000 ns
2,000ns 2,000ns 10,000,000 ns
© 2012 SAP AG. All rights reserved. 14
16. Enabling Live Migration of SAP Workloads
Business Problem
Typical SAP workloads such as SAP ERP are
transactional, large, with a fast rate of memory writes.
Classic live migration fails for such workloads as rapid
memory writes cause memory pages to be re-sent over
and over again
Hecatonchire’s Solution
Enable live migration by reducing both the number of
pages re-sent and the cost of a page re-send
Across the board improvement of live migration metrics
– Downtime - reduced
– service degradation - reduced
– total migration time - reduced
© 2012 SAP AG. All rights reserved. 16
17. Classic Pre-Copy Live Migration
Pre-migration process
Reservation process
• Suspend on host A
VM activeVM on host A
• Activate on host in
Copy dirty pagesB successive
Iterative pre-copy • Redirect network traffic
Initialize container on target
Destination host selected host
• VM state
rounds on host A released
• Synch devices mirrored)
(Block remaining state
Stop and copy
Commitment
© 2012 SAP AG. All rights reserved. 17
18. Hecatonchire Pre-copy Live Migration
Reducing number of page re-sends
Page LRU reordering such that pages with a low
chance of being re-dirtied are sent first
Contribution to QEMU planned for 2012
Reducing the cost of a page re-sends
By using XBZRLE delta encoder we can much more
efficiently represent page changes
Contributed to QEMU during 2011
© 2012 SAP AG. All rights reserved. 18
19. More Than One Way to Live Migrate…
Iterative Stop
Pre-Copy Live- Pre-migrate;
Pre-copy X and
Commit
Migration Reservation
Rounds Copy
Live on A Downtime Live on B
Total Migration Time
Stop Page Pushing
Post-Copy Live- Pre-migrate;
and 1
Commit
Migration Reservation
Copy Round
Live on A Downtime Degraded on B Live on B
Total Migration Time
Iterative
Stop Page Pushing
Hybrid Post-Copy Pre-migrate; Pre-Copy
and 1
Commit
Live-Migration Reservation X
Copy Round
Rounds
Live on A Downtime Degraded on B Live on B
Total Migration Time
© 2012 SAP AG. All rights reserved. 19
20. Hecatonchire Post-copy Live Migration
In post-copy live migration we reverse order
1. Transfer of state: Transfer the VM running state from A to
B and Immediately activate the VM on B
2. Transfer of memory: B can initiate a network bound page
fault handled by A; Background actively push memory from
A to B until completion
Post-copy has some unique advantages
Downtime is minimal as only a few MBs for a GB sized VM
need to be transferred before re-activation
Total migration time is minimal and predictable
Hecatonchire unique enhancements
Low latency RDMA page transfer protocol
Demand pre-paging (pre-fetching) mechanism
Full Linux MMU integration
Hybrid post-copy supported
© 2012 SAP AG. All rights reserved. 20
24. Automated Elasticity
Elasticity is basis for cloud economics
You can scale-up or scale-down on-demand
You only pay for what you use
Chart depicts scaling evolution
Scale-up approach: purchase bigger machines to meet
rising demands
Traditional scale-out approach: reconfigure the cluster
size according to demand
Automated elasticity: grow and shrink your resources
automatically responding to changing demands
represented by monitored metrics
If you can’t respond fast enough you may either
miss business opportunities or have to increase
your margin of purchased resources
Amazon Web Services - Guide
© 2012 SAP AG. All rights reserved. 24
25. Hecatonchire Flash Cloning
Business Problem
AWS auto scaling (and others) take minutes to scale-up:
– Disk image clone from a template (AMI) image
– Full boot up sequence of VM
– Acquiring of an IP address via DHCP
– Starting up the application
Hecatonchire Solution
Provide just in time (sub-second) scaling according to demand
– Clone a paused source VM Copy-on-Write (CoW) including:
Disk Image, VM Memory, VM State (registers, etc.)
– Use a post-copy live-migration schema including page-faulting to
fetch missing pages with background active page pushing
– Create a private network switch per clone (to save the need for
assigning a new MAC and performing IP reconfigure)
© 2012 SAP AG. All rights reserved. 25
27. Hecatonchire Breakthrough Capability
Breaking the Memory Box Barrier for Memory Intensive Applications
nsec
Access Speed
usec
SSD
Performance
Networked
Embedded
Resources
Resources
Resources
Barrier
Local Disk
msec
Local
NAS
SAN
MB GB TB PB
Capacity
© 2012 SAP AG. All rights reserved. 27
28. The Memory Cloud
Turns memory into a distributed memory service
Server
Server 1 Server
Server 2 Server
Server 3
Server1 1
VM Server2 2
VM Server3 3
VM
Applications App App App
Memory RAM RAM RAM
Storage
Business Problem Hecatonchire Solution
Large amounts of DRAM required on-demand – from shared cloud Access remote DRAM via low-latency RDMA stack (using pre-
hosts pushing to hide latency)
Current cloud offerings are limited by the size of their physical host -
MMU Integration for transport consumption for applications and
AWS can’t go beyond 68 GB DRAM as these large memory
VMs. And as a result also support : compression (zcache), de-
instances fully occupy the physical host
duplication (KSM), N-tier storage
No hardware investment needed! No need for dedicated servers!
© 2012 SAP AG. All rights reserved. 28
29. RRAIM : Remote Redundant Array of Inexpensive Memory
Memory Fault Tolerance as Part of a Full HA Solution
RRAIM-1 (Mirroring) VM High Availability
Hecatonchire KVM Kemari / Xen Remus
Active Active Master Slave
RAM RAM App App
RRAIM-1
VM VM VM VM
Cloud
Management
Stack
VM High Availability
Many Physical Nodes Hecatonchire RRAIM
Hosting a variety of VMs
© 2012 SAP AG. All rights reserved. 29
31. Cache-Coherent Non Uniform Memory Access (ccNUMA)
Traditional cluster ccNUMA
Distributed memory Cache coherent shared memory
Standard interconnects Fast interconnects
OS instance on each node One OS instance
Distribution handled by application Distribution handled by hardware/hypervisor
© 2012 SAP AG. All rights reserved. 31
33. Hecatonchire DSM – Cache Coherency (CC) Challenge
Standard ccNUMA
ccNUMA
Inter-node (2000ns) cache-coherency takes too long
Inter-node read is expensive while processor cache not large enough
Adding COMA (Cache Only Memory Access)
Can help to improve performance for multi-read scenario
COMA implementation requires 4k cache-line leading to false data share
NUMA Topology / Dynamic NUMA Topology
COMA
Application NUMA-aware implementation may not be complete
Dynamic changes in NUMA will not be supported by most current apps
We need to attempt to hide some of the performance challenges (so that we
can expose a fixed NUMA topology
Adding vCPU live migration
Compact vCPU state (only several KB) can be live migrated
© 2012 SAP AG. All rights reserved. 33
35. Roadmap
• Live Migration
• Pre-copy XBZRLE Delta Encoding
• Pre-copy LRU page reordering
2011 • Post-copy using RDMA interconnects
• Memory Cloud
• Memory Pooling
• Memory Fault Tolerance (RRAIM)
2012 • Flash Cloning
• Lego Landscape
• Distributed Shared Memory
2013 • Flexible resource management
© 2012 SAP AG. All rights reserved. 35
36. Key takeaways
Hecatonchire extends standard Linux stack requiring
only standard commodity hardware
With Hecatonchire unmodified applications or VMs
(which are NUMA-aware) can tape into remote resources
tranparently
To be released as open source under GPLv2 and LGPL
licenses to Qemu and Linux communities
Developed by SAP Research Technology Infrastructure
(TI) Practice
© 2012 SAP AG. All rights reserved. 36
37. Thank you
Benoit Hudzia; Sr. Researcher;
SAP Research CEC Belfast
benoit.hudzia@sap.com
Aidan Shribman; Sr. Researcher;
SAP Research Israel
aidan.Shribman@sap.com
39. Communication Stacks have Become Leaner
Traditional network interface
Application / OS context switches
Intermediate buffer copies
OS handling transport processing
RDMA adapters
Zero copy directly from/to
application physical memory
Offloading of transport processing
to RDMA adapter and effectively
bypassing OS and CPU
A standard interface OFED “Verbs”
supporting all RDMA adapters (IB,
RoCE, iWARP)
© 2012 SAP AG. All rights reserved. 39
40. Linux Kernel Virtual Machine (KVM)
Released as a Linux Kernel Module (LKM)
under GPLv2 license in 2007 by Qumranet
Full virtualization via Intel VT-x and AMD-V
virtualization extensions to the x86 instruction
set
Uses Qemu for invoking KVM, for handling of
I/O and for advanced capabilities such as VM
live migration
KVM considered the primary hypervisor on
most major Linux distributions such as
RedHat and SuSE
© 2012 SAP AG. All rights reserved. 40
41. Remote Page Faulting Architecture Comparison
Hecatonchire Yobusame
No context switches Context switches into user mode
Zero-copy Use standard TCP/IP transport
Use iWarp RDMA
Hudzia and Shribman, SYSTOR 2012 Horofuchi and Yamahata, KVM Forum 2011
© 2012 SAP AG. All rights reserved. 41
42. Hecatonchire DSM VM – ccNUMA Challenge
Linux NUMA topology
Linux is aware of NUMA topology (which cores
and memory banks reside in each zone/node).
Linux exposes this topology for applications to
make use of it.
But is up to the application to be NUMA-
aware … if not it may suffer when
running on NUMA topology
And even if the application is NUMA
aware the longer time needed for Cache-
Coherency (cc) may hurt performance
Inter-core: L3 Cache 20 ns
Inter-socket: Main Memory 100 ns
Inter-node (IB): Remote Memory 2,000 ns
Intel Nehalem Memory Hierarchy
© 2012 SAP AG. All rights reserved. 42
43. Legal Disclaimer
The information in this presentation is confidential and proprietary to SAP and may not be disclosed without the permission of
SAP. This presentation is not subject to your license agreement or any other service or subscription agreement with SAP. SAP
has no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or
release any functionality mentioned therein. This document, or any related presentation and SAP's strategy and possible future
developments, products and or platforms directions and functionality are all subject to change and may be changed by SAP at
any time for any reason without notice. The information on this document is not a commitment, promise or legal obligation to
deliver any material, code or functionality. This document is provided without a warranty of any kind, either express or implied,
including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. This
document is for informational purposes and may not be incorporated into a contract. SAP assumes no responsibility for errors or
omissions in this document, except if such damages were caused by SAP intentionally or grossly negligent.
All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially
from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, which speak only as
of their dates, and they should not be relied upon in making purchasing decisions.
© 2012 SAP AG. All rights reserved. 43