SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Downloaden Sie, um offline zu lesen
SAP Virtualization Week 2012: TRND04
SAP DKOM 2012: NA 6747

The Lego Cloud
Benoit Hudzia Sr. Researcher; SAP Research CEC Belfast (UK)
Aidan Shribman Sr. Researcher; SAP Research Israel
Agenda


Introduction
Hardware Trends
Live Migration
Flash Cloning
Memory Pooling
Distributed Shared Memory
Summary



© 2012 SAP AG. All rights reserved.   2
Introduction
The evolution of the datacenter
Evolution of Virtualization



                                                                           Resources Disaggregation
                                                                           (True Utility Cloud)
                                                      Flexible Resources
                                                      Management
                                                      (Cloud)
                                      Basic
                                      Consolidation



                   No
                   virtualization




© 2012 SAP AG. All rights reserved.                                                                   4
Why Disaggregate Resources?


Better Performance
Replacing slow local devices (e.g. disk) with fast remote devices (e.g. DRAM).
Many remote devices working in parallel (e.g. DRAM, disk, compute)

Superior Scalability
Going beyond boundaries of the single node

Improved Economics
Do more with existing hardware
Reach better hardware utilization levels




© 2012 SAP AG. All rights reserved.                                              5
The Hecatonchire Project
Hecatonchires in Greek mythology means “Hundred Handed
Ones” – the original idea: provide Distributed Shared Memory
(DSM) capabilities to the cloud

Strategic goal : full resource liberation brought to the cloud by:
 Breaking down physical nodes to their core elements (CPU, Memory, I/O)
 Extend existing cloud software stack (KVM, QEMU, libvirt, OpenStack)
  without degrading any existing capabilities
 Using commodity cloud hardware and standard interconnects

Initiated by Benoit Hudzia in 2011. Currently developed by two
SAP Research TI Practice teams located in Belfast and Ra’anana

Hecatonchire is not a monolithic project – but a set of separate
capabilities. We are currently identifying stake holder and
defining use cases for each such capability.

© 2012 SAP AG. All rights reserved.                                        6
Hecatonchire Architecture
Cluster Servers
                                                                          Guests
 Commodity hosts (e.g. 64 GB 16 core)
 Commodity network adapters:                                        VM
                                                                                      VM           VM
 – Standard: softiwarp over 1 GbE                                   App                            App
                                                                                      App
                                                                                                   OS
  – Enterprise: RoCE/iWARP over 10 GbE or native IB      VM
                                                                    OS                OS           H/W
                                                         Ap


 A modified version of QEMU/KVM hypervisor
                                                         p

                                                        OS
                                                                                      H/W
                                                                    H/W
 An RDMA remote memory kernel module
                                                        H/W




Guests / VMs
                                                       Server #1          Server #2           Server #n
 Use resource from one or several underlaying hosts
                                                         CPUs               CPUs
 Existing OS/application can run transparently                                                 CPUs
                                                        Memory             Memory
  – Not exactly … but we will get to this later                                                Memory
                                                              I/O            I/O                 I/O

                                                                    Fast RDMA Communication



© 2012 SAP AG. All rights reserved.                                                                       7
The Team - Panoramic View




© 2012 SAP AG. All rights reserved.   8
Hardware Trends
The blurring of physical host boundaries
DRAM Latency Has Remained Constant


CPU clock speed and memory bandwidth
increased steadily while memory latency
remained constant

As a result local memory has appears slower
from the CPU perspective




                                              Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010

© 2012 SAP AG. All rights reserved.                                                                         10
CPU Cores Stopped Getting Faster


Moore’s law prevailed until 2005 when cores
hit a practical limit of about 3.4 GHz

The “single threaded free lunch” (as coined by
Herb Sutter) is over

                                                 Source: http://www.intel.com/pressroom/kits/quickrefyr.htm
So CPU cores have stopped getting faster -
but you do get more cores now




                                                 Source: “The Free Lunch Is Over..” by Herb Sutter

© 2012 SAP AG. All rights reserved.                                                                           11
But … Interconnects Continue to Evolve
(providing higher bandwidth and lower latency)




© 2012 SAP AG. All rights reserved.              12
Result: Remote Nodes Are Becoming “Closer”

Accessing DRAM on a remote host via IB interconnects is only 20x slower than local DRAM.
Remote DRAM is 100x or 5000x faster than local SSD or HDD devices respectively.




                 HANA Performance Analysis, Intel Westmere (formally Nehelem-C) and IB QDR, Chaim Bendelac, 2011

© 2012 SAP AG. All rights reserved.                                                                                13
Result: Blurring the Boundaries of the Physical Host



 15ns-80ns              60ns-100ns                        10,000,000 ns




                                      2,000ns   2,000ns   10,000,000 ns




                                      2,000ns   2,000ns   10,000,000 ns

© 2012 SAP AG. All rights reserved.                                   14
Live Migration
Serving as a platform to evaluate remote page faulting
Enabling Live Migration of SAP Workloads

Business Problem
 Typical SAP workloads such as SAP ERP are
  transactional, large, with a fast rate of memory writes.
 Classic live migration fails for such workloads as rapid
  memory writes cause memory pages to be re-sent over
  and over again

Hecatonchire’s Solution
 Enable live migration by reducing both the number of
  pages re-sent and the cost of a page re-send
 Across the board improvement of live migration metrics
  – Downtime - reduced
  – service degradation - reduced
  – total migration time - reduced



© 2012 SAP AG. All rights reserved.                          16
Classic Pre-Copy Live Migration


                  Pre-migration process


                    Reservation process

                                             •   Suspend on host A
                                                 VM activeVM on host A
                                             •   Activate on host in
                                                 Copy dirty pagesB successive
                        Iterative pre-copy   •   Redirect network traffic
                                                 Initialize container on target
                                                 Destination host selected host
                                             •   VM state
                                                 rounds on host A released
                                             •   Synch devices mirrored)
                                                 (Block remaining state

                            Stop and copy


                              Commitment

© 2012 SAP AG. All rights reserved.                                               17
Hecatonchire Pre-copy Live Migration


Reducing number of page re-sends
 Page LRU reordering such that pages with a low
  chance of being re-dirtied are sent first
 Contribution to QEMU planned for 2012

Reducing the cost of a page re-sends
 By using XBZRLE delta encoder we can much more
  efficiently represent page changes
 Contributed to QEMU during 2011




© 2012 SAP AG. All rights reserved.                18
More Than One Way to Live Migrate…
                                                                                     Iterative                                Stop
 Pre-Copy Live-                        Pre-migrate;
                                                                                    Pre-copy X                                and
                                                                                                                                                 Commit
   Migration                           Reservation
                                                                                     Rounds                                   Copy

                                                                     Live on A                                         Downtime                       Live on B

                                                                             Total Migration Time




                                                                             Stop                Page Pushing
Post-Copy Live-                                       Pre-migrate;
                                                                             and                       1
                                                                                                                        Commit
  Migration                                           Reservation
                                                                             Copy                   Round

                                          Live on A                       Downtime          Degraded on B                                Live on B

                                                                        Total Migration Time



                                                                             Iterative
                                                                                                                Stop              Page Pushing
Hybrid Post-Copy                                      Pre-migrate;           Pre-Copy
                                                                                                                and                     1
                                                                                                                                                     Commit
 Live-Migration                                       Reservation                X
                                                                                                                Copy                 Round
                                                                              Rounds
                                                       Live on A                                        Downtime           Degraded on B         Live on B

                                                                                         Total Migration Time




 © 2012 SAP AG. All rights reserved.                                                                                                                              19
Hecatonchire Post-copy Live Migration

In post-copy live migration we reverse order
1. Transfer of state: Transfer the VM running state from A to
   B and Immediately activate the VM on B
2. Transfer of memory: B can initiate a network bound page
   fault handled by A; Background actively push memory from
   A to B until completion

Post-copy has some unique advantages
 Downtime is minimal as only a few MBs for a GB sized VM
  need to be transferred before re-activation
 Total migration time is minimal and predictable

Hecatonchire unique enhancements
   Low latency RDMA page transfer protocol
   Demand pre-paging (pre-fetching) mechanism
   Full Linux MMU integration
   Hybrid post-copy supported


© 2012 SAP AG. All rights reserved.                             20
Demo
Flash Cloning
Sub-second elastic auto scaling
Automated Elasticity

Elasticity is basis for cloud economics
 You can scale-up or scale-down on-demand
 You only pay for what you use

Chart depicts scaling evolution
Scale-up approach: purchase bigger machines to meet
rising demands
Traditional scale-out approach: reconfigure the cluster
size according to demand
Automated elasticity: grow and shrink your resources
automatically responding to changing demands
represented by monitored metrics

If you can’t respond fast enough you may either
miss business opportunities or have to increase
your margin of purchased resources
                                                          Amazon Web Services - Guide
© 2012 SAP AG. All rights reserved.                                                     24
Hecatonchire Flash Cloning

Business Problem
 AWS auto scaling (and others) take minutes to scale-up:
  – Disk image clone from a template (AMI) image
  – Full boot up sequence of VM
  – Acquiring of an IP address via DHCP
  – Starting up the application

Hecatonchire Solution
 Provide just in time (sub-second) scaling according to demand
  – Clone a paused source VM Copy-on-Write (CoW) including:
    Disk Image, VM Memory, VM State (registers, etc.)
  – Use a post-copy live-migration schema including page-faulting to
    fetch missing pages with background active page pushing
  – Create a private network switch per clone (to save the need for
    assigning a new MAC and performing IP reconfigure)


© 2012 SAP AG. All rights reserved.                                    25
Memory Pooling
Tapping into unused memory resources of remote hosts
Hecatonchire Breakthrough Capability
Breaking the Memory Box Barrier for Memory Intensive Applications


                               nsec
                Access Speed

                               usec



                                                  SSD

                                                                                                  Performance




                                                                                      Networked
                                      Embedded




                                                  Resources




                                                                                      Resources
                                      Resources




                                                                                                  Barrier
                                                                    Local Disk
                               msec




                                                  Local
                                                                                                       NAS
                                                                                                       SAN


                                          MB        GB                           TB                             PB
                                                              Capacity


© 2012 SAP AG. All rights reserved.                                                                                  27
The Memory Cloud
Turns memory into a distributed memory service



                                         Server
                                       Server 1      Server
                                                   Server 2        Server
                                                                 Server 3
                                      Server1 1
                                        VM        Server2 2
                                                    VM          Server3 3
                                                                  VM
  Applications                          App         App            App


         Memory                         RAM         RAM           RAM


         Storage



Business Problem                                                             Hecatonchire Solution
 Large amounts of DRAM required on-demand – from shared cloud                Access remote DRAM via low-latency RDMA stack (using pre-
  hosts                                                                        pushing to hide latency)
 Current cloud offerings are limited by the size of their physical host -
                                                                              MMU Integration for transport consumption for applications and
  AWS can’t go beyond 68 GB DRAM as these large memory
                                                                               VMs. And as a result also support : compression (zcache), de-
  instances fully occupy the physical host
                                                                               duplication (KSM), N-tier storage

                                                                              No hardware investment needed! No need for dedicated servers!

© 2012 SAP AG. All rights reserved.                                                                                                             28
RRAIM : Remote Redundant Array of Inexpensive Memory
Memory Fault Tolerance as Part of a Full HA Solution

                 RRAIM-1 (Mirroring)                                       VM High Availability
                         Hecatonchire                                      KVM Kemari / Xen Remus




                       Active  Active                                         Master Slave
               RAM                        RAM                         App                       App

                                                     RRAIM-1
               VM                         VM                          VM                        VM




                                                                                                           Cloud
                                                                                                         Management
                                                                                                            Stack

                                                                                                      VM High Availability

                                                 Many Physical Nodes                                  Hecatonchire RRAIM
                                                Hosting a variety of VMs

© 2012 SAP AG. All rights reserved.                                                                                          29
Distributed Shared Memory
Our next challenge
Cache-Coherent Non Uniform Memory Access (ccNUMA)
Traditional cluster                       ccNUMA
   Distributed memory                       Cache coherent shared memory
   Standard interconnects                   Fast interconnects
   OS instance on each node                 One OS instance
   Distribution handled by application      Distribution handled by hardware/hypervisor




© 2012 SAP AG. All rights reserved.                                                         31
Hecatonchire Distributed Shared Memory (DSM) VM




© 2012 SAP AG. All rights reserved.               32
Hecatonchire DSM – Cache Coherency (CC) Challenge

Standard ccNUMA
                                                                              ccNUMA
 Inter-node (2000ns) cache-coherency takes too long
 Inter-node read is expensive while processor cache not large enough

Adding COMA (Cache Only Memory Access)
 Can help to improve performance for multi-read scenario
 COMA implementation requires 4k cache-line  leading to false data share

NUMA Topology / Dynamic NUMA Topology
                                                                              COMA
 Application NUMA-aware implementation may not be complete
 Dynamic changes in NUMA will not be supported by most current apps
 We need to attempt to hide some of the performance challenges (so that we
  can expose a fixed NUMA topology

Adding vCPU live migration
 Compact vCPU state (only several KB) can be live migrated
© 2012 SAP AG. All rights reserved.                                                    33
Summary
Roadmap
                         • Live Migration
                           • Pre-copy XBZRLE Delta Encoding
                           • Pre-copy LRU page reordering
         2011              • Post-copy using RDMA interconnects


                         • Memory Cloud
                           • Memory Pooling
                           • Memory Fault Tolerance (RRAIM)
         2012            • Flash Cloning



                         • Lego Landscape
                           • Distributed Shared Memory
         2013              • Flexible resource management




© 2012 SAP AG. All rights reserved.                               35
Key takeaways


Hecatonchire extends standard Linux stack requiring
only standard commodity hardware

With Hecatonchire unmodified applications or VMs
(which are NUMA-aware) can tape into remote resources
tranparently

To be released as open source under GPLv2 and LGPL
licenses to Qemu and Linux communities

Developed by SAP Research Technology Infrastructure
(TI) Practice




© 2012 SAP AG. All rights reserved.                     36
Thank you
Benoit Hudzia; Sr. Researcher;
SAP Research CEC Belfast
benoit.hudzia@sap.com

Aidan Shribman; Sr. Researcher;
SAP Research Israel
aidan.Shribman@sap.com
Appendix
Communication Stacks have Become Leaner

Traditional network interface
  Application / OS context switches
  Intermediate buffer copies
  OS handling transport processing

RDMA adapters
  Zero copy directly from/to
   application physical memory
  Offloading of transport processing
   to RDMA adapter and effectively
   bypassing OS and CPU
  A standard interface OFED “Verbs”
   supporting all RDMA adapters (IB,
   RoCE, iWARP)


 © 2012 SAP AG. All rights reserved.      39
Linux Kernel Virtual Machine (KVM)


Released as a Linux Kernel Module (LKM)
under GPLv2 license in 2007 by Qumranet

Full virtualization via Intel VT-x and AMD-V
virtualization extensions to the x86 instruction
set

Uses Qemu for invoking KVM, for handling of
I/O and for advanced capabilities such as VM
live migration

KVM considered the primary hypervisor on
most major Linux distributions such as
RedHat and SuSE


© 2012 SAP AG. All rights reserved.                40
Remote Page Faulting Architecture Comparison


Hecatonchire                                        Yobusame
No context switches                                 Context switches into user mode
Zero-copy                                           Use standard TCP/IP transport
Use iWarp RDMA




                 Hudzia and Shribman, SYSTOR 2012         Horofuchi and Yamahata, KVM Forum 2011
© 2012 SAP AG. All rights reserved.                                                                41
Hecatonchire DSM VM – ccNUMA Challenge

Linux NUMA topology
 Linux is aware of NUMA topology (which cores
  and memory banks reside in each zone/node).
 Linux exposes this topology for applications to
  make use of it.


But is up to the application to be NUMA-
aware … if not it may suffer when
running on NUMA topology

And even if the application is NUMA
aware the longer time needed for Cache-
Coherency (cc) may hurt performance
 Inter-core: L3 Cache 20 ns
 Inter-socket: Main Memory 100 ns
 Inter-node (IB): Remote Memory 2,000 ns
                                                    Intel Nehalem Memory Hierarchy

© 2012 SAP AG. All rights reserved.                                                  42
Legal Disclaimer

The information in this presentation is confidential and proprietary to SAP and may not be disclosed without the permission of
SAP. This presentation is not subject to your license agreement or any other service or subscription agreement with SAP. SAP
has no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or
release any functionality mentioned therein. This document, or any related presentation and SAP's strategy and possible future
developments, products and or platforms directions and functionality are all subject to change and may be changed by SAP at
any time for any reason without notice. The information on this document is not a commitment, promise or legal obligation to
deliver any material, code or functionality. This document is provided without a warranty of any kind, either express or implied,
including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. This
document is for informational purposes and may not be incorporated into a contract. SAP assumes no responsibility for errors or
omissions in this document, except if such damages were caused by SAP intentionally or grossly negligent.
All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially
from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, which speak only as
of their dates, and they should not be relied upon in making purchasing decisions.




© 2012 SAP AG. All rights reserved.                                                                                              43

Weitere ähnliche Inhalte

Was ist angesagt?

9sept2009 concept electronics
9sept2009 concept electronics9sept2009 concept electronics
9sept2009 concept electronics
Agora Group
 
PRIMERGY Bladeframe: Caratteristiche e benefici
PRIMERGY Bladeframe: Caratteristiche e beneficiPRIMERGY Bladeframe: Caratteristiche e benefici
PRIMERGY Bladeframe: Caratteristiche e benefici
FSCitalia
 

Was ist angesagt? (20)

9sept2009 concept electronics
9sept2009 concept electronics9sept2009 concept electronics
9sept2009 concept electronics
 
Protecting Linux Workloads with PlateSpin Disaster Recovery
Protecting Linux Workloads with PlateSpin Disaster RecoveryProtecting Linux Workloads with PlateSpin Disaster Recovery
Protecting Linux Workloads with PlateSpin Disaster Recovery
 
Workload Optimization
Workload OptimizationWorkload Optimization
Workload Optimization
 
XS Oracle 2009 Intro Slides
XS Oracle 2009 Intro SlidesXS Oracle 2009 Intro Slides
XS Oracle 2009 Intro Slides
 
Branch repeater technical training presentation 26 oct-12
Branch repeater technical training presentation 26 oct-12Branch repeater technical training presentation 26 oct-12
Branch repeater technical training presentation 26 oct-12
 
XS Japan 2008 Services English
XS Japan 2008 Services EnglishXS Japan 2008 Services English
XS Japan 2008 Services English
 
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC clusterToward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
Toward a practical “HPC Cloud”: Performance tuning of a virtualized HPC cluster
 
Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09Scaling With Sun Systems For MySQL Jan09
Scaling With Sun Systems For MySQL Jan09
 
Presentation from physical to virtual to cloud emc
Presentation   from physical to virtual to cloud emcPresentation   from physical to virtual to cloud emc
Presentation from physical to virtual to cloud emc
 
PRIMERGY Bladeframe: Caratteristiche e benefici
PRIMERGY Bladeframe: Caratteristiche e beneficiPRIMERGY Bladeframe: Caratteristiche e benefici
PRIMERGY Bladeframe: Caratteristiche e benefici
 
Performance in a virtualized environment
Performance in a virtualized environmentPerformance in a virtualized environment
Performance in a virtualized environment
 
Ian Pratt Usenix 08 Keynote
Ian Pratt Usenix 08 KeynoteIan Pratt Usenix 08 Keynote
Ian Pratt Usenix 08 Keynote
 
Linux on System z Optimizing Resource Utilization for Linux under z/VM – Part II
Linux on System z Optimizing Resource Utilization for Linux under z/VM – Part IILinux on System z Optimizing Resource Utilization for Linux under z/VM – Part II
Linux on System z Optimizing Resource Utilization for Linux under z/VM – Part II
 
Nakajima numa-final
Nakajima numa-finalNakajima numa-final
Nakajima numa-final
 
SANsymphony V
SANsymphony VSANsymphony V
SANsymphony V
 
Nakajima hvm-be final
Nakajima hvm-be finalNakajima hvm-be final
Nakajima hvm-be final
 
High speed networks and Java (Ryan Sciampacone)
High speed networks and Java (Ryan Sciampacone)High speed networks and Java (Ryan Sciampacone)
High speed networks and Java (Ryan Sciampacone)
 
Advanced performance troubleshooting using esxtop
Advanced performance troubleshooting using esxtopAdvanced performance troubleshooting using esxtop
Advanced performance troubleshooting using esxtop
 
Ian Pratt Nsdi Keynote Apr2008
Ian Pratt Nsdi Keynote Apr2008Ian Pratt Nsdi Keynote Apr2008
Ian Pratt Nsdi Keynote Apr2008
 
TechNet Live spor 1 sesjon 6 - more vdi
TechNet Live spor 1   sesjon 6 - more vdiTechNet Live spor 1   sesjon 6 - more vdi
TechNet Live spor 1 sesjon 6 - more vdi
 

Ähnlich wie SAP Virtualization Week 2012 - The Lego Cloud

Building a Distributed Block Storage System on Xen
Building a Distributed Block Storage System on XenBuilding a Distributed Block Storage System on Xen
Building a Distributed Block Storage System on Xen
The Linux Foundation
 
Track A-Shmuel Panijel, Windriver
Track A-Shmuel Panijel, WindriverTrack A-Shmuel Panijel, Windriver
Track A-Shmuel Panijel, Windriver
chiportal
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 

Ähnlich wie SAP Virtualization Week 2012 - The Lego Cloud (20)

Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012
 
Apache Hadoop on Virtual Machines
Apache Hadoop on Virtual MachinesApache Hadoop on Virtual Machines
Apache Hadoop on Virtual Machines
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing Hadoop
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in Cloud
 
Virtual Hadoop Introduction In Chinese
Virtual Hadoop Introduction In ChineseVirtual Hadoop Introduction In Chinese
Virtual Hadoop Introduction In Chinese
 
Accelerating Server Hardware Upgrades with PlateSpin Migrate P2P
Accelerating Server Hardware Upgrades with PlateSpin Migrate P2PAccelerating Server Hardware Upgrades with PlateSpin Migrate P2P
Accelerating Server Hardware Upgrades with PlateSpin Migrate P2P
 
Building a Distributed Block Storage System on Xen
Building a Distributed Block Storage System on XenBuilding a Distributed Block Storage System on Xen
Building a Distributed Block Storage System on Xen
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
 
Big data on virtualized infrastucture
Big data on virtualized infrastuctureBig data on virtualized infrastucture
Big data on virtualized infrastucture
 
VMware vSphere 5.1 Overview
VMware vSphere 5.1 OverviewVMware vSphere 5.1 Overview
VMware vSphere 5.1 Overview
 
Sql saturday dc vm ware
Sql saturday dc vm wareSql saturday dc vm ware
Sql saturday dc vm ware
 
Linux on System z – disk I/O performance
Linux on System z – disk I/O performanceLinux on System z – disk I/O performance
Linux on System z – disk I/O performance
 
DataCore Software - The one and only Storage Hypervisor
DataCore Software - The one and only Storage HypervisorDataCore Software - The one and only Storage Hypervisor
DataCore Software - The one and only Storage Hypervisor
 
Track A-Shmuel Panijel, Windriver
Track A-Shmuel Panijel, WindriverTrack A-Shmuel Panijel, Windriver
Track A-Shmuel Panijel, Windriver
 
Xen Virtualization 2008
Xen Virtualization 2008Xen Virtualization 2008
Xen Virtualization 2008
 
How swift is your Swift - SD.pptx
How swift is your Swift - SD.pptxHow swift is your Swift - SD.pptx
How swift is your Swift - SD.pptx
 
Data center Technologies
Data center TechnologiesData center Technologies
Data center Technologies
 
Ina Pratt Fosdem Feb2008
Ina Pratt Fosdem Feb2008Ina Pratt Fosdem Feb2008
Ina Pratt Fosdem Feb2008
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

SAP Virtualization Week 2012 - The Lego Cloud

  • 1. SAP Virtualization Week 2012: TRND04 SAP DKOM 2012: NA 6747 The Lego Cloud Benoit Hudzia Sr. Researcher; SAP Research CEC Belfast (UK) Aidan Shribman Sr. Researcher; SAP Research Israel
  • 2. Agenda Introduction Hardware Trends Live Migration Flash Cloning Memory Pooling Distributed Shared Memory Summary © 2012 SAP AG. All rights reserved. 2
  • 4. Evolution of Virtualization Resources Disaggregation (True Utility Cloud) Flexible Resources Management (Cloud) Basic Consolidation No virtualization © 2012 SAP AG. All rights reserved. 4
  • 5. Why Disaggregate Resources? Better Performance Replacing slow local devices (e.g. disk) with fast remote devices (e.g. DRAM). Many remote devices working in parallel (e.g. DRAM, disk, compute) Superior Scalability Going beyond boundaries of the single node Improved Economics Do more with existing hardware Reach better hardware utilization levels © 2012 SAP AG. All rights reserved. 5
  • 6. The Hecatonchire Project Hecatonchires in Greek mythology means “Hundred Handed Ones” – the original idea: provide Distributed Shared Memory (DSM) capabilities to the cloud Strategic goal : full resource liberation brought to the cloud by:  Breaking down physical nodes to their core elements (CPU, Memory, I/O)  Extend existing cloud software stack (KVM, QEMU, libvirt, OpenStack) without degrading any existing capabilities  Using commodity cloud hardware and standard interconnects Initiated by Benoit Hudzia in 2011. Currently developed by two SAP Research TI Practice teams located in Belfast and Ra’anana Hecatonchire is not a monolithic project – but a set of separate capabilities. We are currently identifying stake holder and defining use cases for each such capability. © 2012 SAP AG. All rights reserved. 6
  • 7. Hecatonchire Architecture Cluster Servers Guests  Commodity hosts (e.g. 64 GB 16 core)  Commodity network adapters: VM VM VM – Standard: softiwarp over 1 GbE App App App OS – Enterprise: RoCE/iWARP over 10 GbE or native IB VM OS OS H/W Ap  A modified version of QEMU/KVM hypervisor p OS H/W H/W  An RDMA remote memory kernel module H/W Guests / VMs Server #1 Server #2 Server #n  Use resource from one or several underlaying hosts CPUs CPUs  Existing OS/application can run transparently CPUs Memory Memory – Not exactly … but we will get to this later Memory I/O I/O I/O Fast RDMA Communication © 2012 SAP AG. All rights reserved. 7
  • 8. The Team - Panoramic View © 2012 SAP AG. All rights reserved. 8
  • 9. Hardware Trends The blurring of physical host boundaries
  • 10. DRAM Latency Has Remained Constant CPU clock speed and memory bandwidth increased steadily while memory latency remained constant As a result local memory has appears slower from the CPU perspective Source: J. Karstens: In-Memory Technology at SAP. DKOM 2010 © 2012 SAP AG. All rights reserved. 10
  • 11. CPU Cores Stopped Getting Faster Moore’s law prevailed until 2005 when cores hit a practical limit of about 3.4 GHz The “single threaded free lunch” (as coined by Herb Sutter) is over Source: http://www.intel.com/pressroom/kits/quickrefyr.htm So CPU cores have stopped getting faster - but you do get more cores now Source: “The Free Lunch Is Over..” by Herb Sutter © 2012 SAP AG. All rights reserved. 11
  • 12. But … Interconnects Continue to Evolve (providing higher bandwidth and lower latency) © 2012 SAP AG. All rights reserved. 12
  • 13. Result: Remote Nodes Are Becoming “Closer” Accessing DRAM on a remote host via IB interconnects is only 20x slower than local DRAM. Remote DRAM is 100x or 5000x faster than local SSD or HDD devices respectively. HANA Performance Analysis, Intel Westmere (formally Nehelem-C) and IB QDR, Chaim Bendelac, 2011 © 2012 SAP AG. All rights reserved. 13
  • 14. Result: Blurring the Boundaries of the Physical Host 15ns-80ns 60ns-100ns 10,000,000 ns 2,000ns 2,000ns 10,000,000 ns 2,000ns 2,000ns 10,000,000 ns © 2012 SAP AG. All rights reserved. 14
  • 15. Live Migration Serving as a platform to evaluate remote page faulting
  • 16. Enabling Live Migration of SAP Workloads Business Problem  Typical SAP workloads such as SAP ERP are transactional, large, with a fast rate of memory writes.  Classic live migration fails for such workloads as rapid memory writes cause memory pages to be re-sent over and over again Hecatonchire’s Solution  Enable live migration by reducing both the number of pages re-sent and the cost of a page re-send  Across the board improvement of live migration metrics – Downtime - reduced – service degradation - reduced – total migration time - reduced © 2012 SAP AG. All rights reserved. 16
  • 17. Classic Pre-Copy Live Migration Pre-migration process Reservation process • Suspend on host A VM activeVM on host A • Activate on host in Copy dirty pagesB successive Iterative pre-copy • Redirect network traffic Initialize container on target Destination host selected host • VM state rounds on host A released • Synch devices mirrored) (Block remaining state Stop and copy Commitment © 2012 SAP AG. All rights reserved. 17
  • 18. Hecatonchire Pre-copy Live Migration Reducing number of page re-sends  Page LRU reordering such that pages with a low chance of being re-dirtied are sent first  Contribution to QEMU planned for 2012 Reducing the cost of a page re-sends  By using XBZRLE delta encoder we can much more efficiently represent page changes  Contributed to QEMU during 2011 © 2012 SAP AG. All rights reserved. 18
  • 19. More Than One Way to Live Migrate… Iterative Stop Pre-Copy Live- Pre-migrate; Pre-copy X and Commit Migration Reservation Rounds Copy Live on A Downtime Live on B Total Migration Time Stop Page Pushing Post-Copy Live- Pre-migrate; and 1 Commit Migration Reservation Copy Round Live on A Downtime Degraded on B Live on B Total Migration Time Iterative Stop Page Pushing Hybrid Post-Copy Pre-migrate; Pre-Copy and 1 Commit Live-Migration Reservation X Copy Round Rounds Live on A Downtime Degraded on B Live on B Total Migration Time © 2012 SAP AG. All rights reserved. 19
  • 20. Hecatonchire Post-copy Live Migration In post-copy live migration we reverse order 1. Transfer of state: Transfer the VM running state from A to B and Immediately activate the VM on B 2. Transfer of memory: B can initiate a network bound page fault handled by A; Background actively push memory from A to B until completion Post-copy has some unique advantages  Downtime is minimal as only a few MBs for a GB sized VM need to be transferred before re-activation  Total migration time is minimal and predictable Hecatonchire unique enhancements  Low latency RDMA page transfer protocol  Demand pre-paging (pre-fetching) mechanism  Full Linux MMU integration  Hybrid post-copy supported © 2012 SAP AG. All rights reserved. 20
  • 21. Demo
  • 22.
  • 24. Automated Elasticity Elasticity is basis for cloud economics  You can scale-up or scale-down on-demand  You only pay for what you use Chart depicts scaling evolution Scale-up approach: purchase bigger machines to meet rising demands Traditional scale-out approach: reconfigure the cluster size according to demand Automated elasticity: grow and shrink your resources automatically responding to changing demands represented by monitored metrics If you can’t respond fast enough you may either miss business opportunities or have to increase your margin of purchased resources Amazon Web Services - Guide © 2012 SAP AG. All rights reserved. 24
  • 25. Hecatonchire Flash Cloning Business Problem  AWS auto scaling (and others) take minutes to scale-up: – Disk image clone from a template (AMI) image – Full boot up sequence of VM – Acquiring of an IP address via DHCP – Starting up the application Hecatonchire Solution  Provide just in time (sub-second) scaling according to demand – Clone a paused source VM Copy-on-Write (CoW) including: Disk Image, VM Memory, VM State (registers, etc.) – Use a post-copy live-migration schema including page-faulting to fetch missing pages with background active page pushing – Create a private network switch per clone (to save the need for assigning a new MAC and performing IP reconfigure) © 2012 SAP AG. All rights reserved. 25
  • 26. Memory Pooling Tapping into unused memory resources of remote hosts
  • 27. Hecatonchire Breakthrough Capability Breaking the Memory Box Barrier for Memory Intensive Applications nsec Access Speed usec SSD Performance Networked Embedded Resources Resources Resources Barrier Local Disk msec Local NAS SAN MB GB TB PB Capacity © 2012 SAP AG. All rights reserved. 27
  • 28. The Memory Cloud Turns memory into a distributed memory service Server Server 1 Server Server 2 Server Server 3 Server1 1 VM Server2 2 VM Server3 3 VM Applications App App App Memory RAM RAM RAM Storage Business Problem Hecatonchire Solution  Large amounts of DRAM required on-demand – from shared cloud  Access remote DRAM via low-latency RDMA stack (using pre- hosts pushing to hide latency)  Current cloud offerings are limited by the size of their physical host -  MMU Integration for transport consumption for applications and AWS can’t go beyond 68 GB DRAM as these large memory VMs. And as a result also support : compression (zcache), de- instances fully occupy the physical host duplication (KSM), N-tier storage  No hardware investment needed! No need for dedicated servers! © 2012 SAP AG. All rights reserved. 28
  • 29. RRAIM : Remote Redundant Array of Inexpensive Memory Memory Fault Tolerance as Part of a Full HA Solution RRAIM-1 (Mirroring) VM High Availability Hecatonchire KVM Kemari / Xen Remus Active  Active Master Slave RAM RAM App App RRAIM-1 VM VM VM VM Cloud Management Stack VM High Availability Many Physical Nodes Hecatonchire RRAIM Hosting a variety of VMs © 2012 SAP AG. All rights reserved. 29
  • 31. Cache-Coherent Non Uniform Memory Access (ccNUMA) Traditional cluster ccNUMA  Distributed memory  Cache coherent shared memory  Standard interconnects  Fast interconnects  OS instance on each node  One OS instance  Distribution handled by application  Distribution handled by hardware/hypervisor © 2012 SAP AG. All rights reserved. 31
  • 32. Hecatonchire Distributed Shared Memory (DSM) VM © 2012 SAP AG. All rights reserved. 32
  • 33. Hecatonchire DSM – Cache Coherency (CC) Challenge Standard ccNUMA ccNUMA  Inter-node (2000ns) cache-coherency takes too long  Inter-node read is expensive while processor cache not large enough Adding COMA (Cache Only Memory Access)  Can help to improve performance for multi-read scenario  COMA implementation requires 4k cache-line  leading to false data share NUMA Topology / Dynamic NUMA Topology COMA  Application NUMA-aware implementation may not be complete  Dynamic changes in NUMA will not be supported by most current apps  We need to attempt to hide some of the performance challenges (so that we can expose a fixed NUMA topology Adding vCPU live migration  Compact vCPU state (only several KB) can be live migrated © 2012 SAP AG. All rights reserved. 33
  • 35. Roadmap • Live Migration • Pre-copy XBZRLE Delta Encoding • Pre-copy LRU page reordering 2011 • Post-copy using RDMA interconnects • Memory Cloud • Memory Pooling • Memory Fault Tolerance (RRAIM) 2012 • Flash Cloning • Lego Landscape • Distributed Shared Memory 2013 • Flexible resource management © 2012 SAP AG. All rights reserved. 35
  • 36. Key takeaways Hecatonchire extends standard Linux stack requiring only standard commodity hardware With Hecatonchire unmodified applications or VMs (which are NUMA-aware) can tape into remote resources tranparently To be released as open source under GPLv2 and LGPL licenses to Qemu and Linux communities Developed by SAP Research Technology Infrastructure (TI) Practice © 2012 SAP AG. All rights reserved. 36
  • 37. Thank you Benoit Hudzia; Sr. Researcher; SAP Research CEC Belfast benoit.hudzia@sap.com Aidan Shribman; Sr. Researcher; SAP Research Israel aidan.Shribman@sap.com
  • 39. Communication Stacks have Become Leaner Traditional network interface  Application / OS context switches  Intermediate buffer copies  OS handling transport processing RDMA adapters  Zero copy directly from/to application physical memory  Offloading of transport processing to RDMA adapter and effectively bypassing OS and CPU  A standard interface OFED “Verbs” supporting all RDMA adapters (IB, RoCE, iWARP) © 2012 SAP AG. All rights reserved. 39
  • 40. Linux Kernel Virtual Machine (KVM) Released as a Linux Kernel Module (LKM) under GPLv2 license in 2007 by Qumranet Full virtualization via Intel VT-x and AMD-V virtualization extensions to the x86 instruction set Uses Qemu for invoking KVM, for handling of I/O and for advanced capabilities such as VM live migration KVM considered the primary hypervisor on most major Linux distributions such as RedHat and SuSE © 2012 SAP AG. All rights reserved. 40
  • 41. Remote Page Faulting Architecture Comparison Hecatonchire Yobusame No context switches Context switches into user mode Zero-copy Use standard TCP/IP transport Use iWarp RDMA Hudzia and Shribman, SYSTOR 2012 Horofuchi and Yamahata, KVM Forum 2011 © 2012 SAP AG. All rights reserved. 41
  • 42. Hecatonchire DSM VM – ccNUMA Challenge Linux NUMA topology  Linux is aware of NUMA topology (which cores and memory banks reside in each zone/node).  Linux exposes this topology for applications to make use of it. But is up to the application to be NUMA- aware … if not it may suffer when running on NUMA topology And even if the application is NUMA aware the longer time needed for Cache- Coherency (cc) may hurt performance  Inter-core: L3 Cache 20 ns  Inter-socket: Main Memory 100 ns  Inter-node (IB): Remote Memory 2,000 ns Intel Nehalem Memory Hierarchy © 2012 SAP AG. All rights reserved. 42
  • 43. Legal Disclaimer The information in this presentation is confidential and proprietary to SAP and may not be disclosed without the permission of SAP. This presentation is not subject to your license agreement or any other service or subscription agreement with SAP. SAP has no obligation to pursue any course of business outlined in this document or any related presentation, or to develop or release any functionality mentioned therein. This document, or any related presentation and SAP's strategy and possible future developments, products and or platforms directions and functionality are all subject to change and may be changed by SAP at any time for any reason without notice. The information on this document is not a commitment, promise or legal obligation to deliver any material, code or functionality. This document is provided without a warranty of any kind, either express or implied, including but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. This document is for informational purposes and may not be incorporated into a contract. SAP assumes no responsibility for errors or omissions in this document, except if such damages were caused by SAP intentionally or grossly negligent. All forward-looking statements are subject to various risks and uncertainties that could cause actual results to differ materially from expectations. Readers are cautioned not to place undue reliance on these forward-looking statements, which speak only as of their dates, and they should not be relied upon in making purchasing decisions. © 2012 SAP AG. All rights reserved. 43