SlideShare ist ein Scribd-Unternehmen logo
1 von 18
Building A KVM-based Hypervisor for A
Heterogeneous System Architecture
Compliant System
National Chiao Tung University & National Tsing Hua University & National Taiwan University
Yu-Ju Huang, Hsuan-Heng Wu,
Yeh-Ching Chung, Wei-Chung Hsu
Agenda
• Motivation
• Background
• HSA features
• AMD’s implementation on Kaveri, the HSA-
compliant platform
• Design and Implementation
• Evaluation
• Conclusion
2
Motivation
• Problem of heterogeneous computing
• Data communication between CPU & GPU
• Inefficiency
• Programmability inconvenience
• Heterogeneous System Architecture (HSA)
• Developed by HSA Foundation
• Goal
• Improving computation efficiency for heterogeneous computing
• Reducing programmability barrier
• Make virtual machines also get benefit of HSA !
3
HSA
Hypervisor
Guest
OS
Guest
OS
A
p
p
A
p
p
A
p
p
A
p
p HSA!!!
HSA Features
• Shared virtual memory
• I/O page faulting
• User-level queueing
• Memory based signaling
4
CPU Memory
GPUCPU
GPU
Memory
Data copy
Before HSA
Physical Memory
HSA GPUCPU
Virtual Memory
HSA
Application
Queues
Operating System
GPU Driver
GPU
Before HSA
HSA GPU
Application
Queues
HSA
• Shared virtual memory
• I/O page faulting
• User-level queueing
• Memory based signaling
Shared Virtual Memory - IOMMU
• Set process page table to IOMMU to carry out virtual to
physical address translation
• CPU and GPU share same process page table
5
System Memory
GPU CPU
IOMMU MMUProcess Page Table
I/O Page Faulting - PPR
• PPR(peripheral page service request) issued by IOMMU as
interrupt
• PPR logs contains fault process ID and fault address
• get_user_pages API can be used to fix page fault
6
IOMMU CPU
Call PPR handler
Get PPR logs
Fix fault fault
COMPLETE command
PPR Interrupt
1
2
3
4
5
User Level Queueing -
Kernel Fusion Driver (KFD)
• Help applications set address of user level queues to GPU
7
Kernel Space
GPU
Userspace
KFD
Addr of user
level queue
User Level Queues
Computation
Design - How to Virtualize
• User-level queueing
• VirtIO-KFD
• Shared virtual memory
• Shadow page table
• Why not hardware-assisted nested paging ?
• I/O Page faulting
• Shadow PPR
• VirtIO-IOMMU
8
Virtualize User Level Queueing
VirtIO-KFD
9
Guest OS
Host OS
KFD
Qemu
Guest
App
VirtIO-KFD
(Back-end)
VirtIO-KFD
(Front-end)
Guest
App
Guest
App
GPU
Share virtqueue
HSA Runtime Library
1
2
3
4
KVM
Virtualize Shared Virtual Memory
Shadow Page Table
10
Guest OS
Host OS
KFD
Qemu
Guest
App
VirtIO-KFD
(Back-end)
VirtIO-KFD
(Front-end)
Guest
App
Guest
App
Share virtqueue
HSA Runtime Library
1
2
3
4IOMMU
Driver
KVM
IOMMU
Addr of
shadow
page table
5
6
GPU
IOMMU
Memory
ID System Page table
1 Host, process 1 Addr of PT
2 Guest 1,
process 1
Addr of SPT
Page
Table
ID=1
HVA
MPA
Native ScenarioGuest Scenario
 More guest processes in different guest OSes are also allowed.
11
IOMMU Snapshot During GPU Execution
GVA
MPA
ID=2
Virtualize I/O Page Faulting
VirtIO-IOMMU, Shadow PPR
12
Guest OS
Host OS
Shadow
PPR
Qemu
Guest
App
VirtIO-
IOMMU
Guest
App
Guest
App
IOMMU
HSA Runtime Library
IOMMU
Driver
KVM
Interrupt1
3
5
4
2
PPR: Peripheral Page Request
System Architecture
13
Guest OS
Host OS
KVM
Shadow
PPR
KFD
Qemu
(Host Process)
HSA Runtime Library
Guest
App
VirtIO-
IOMMU
VirtIO-
IOMMU
VirtIO-KFD
VirtIO-KFD
Guest
App
Guest
App
IOMMU GPU
User level
queuing
IOMMU
Driver
 KFD: Kernel Fusion Driver
 PPR: Peripheral Page Request
Shared
virtual
memory
I/O page
faulting
Evaluation
• Queue initialization time
• Measuring overheads of VirtIO-KFD
• GPU execution time
• Measuring overheads of shadow page table and shadow PPR
14
Configurations Native Guest
Hardware platform Kaveri
Memory 8G 4G
Number of CPUs 4 4
OS Ubuntu 13.10
Queue Initialization Time
15
Average 30% performance drop.
GPU Execution Time
16
Achieve average 95% of native performance in most cases.
GPU time
(sec)
BinarySea
rch
FastWalsh
Transform
BitonocSort FloydWars
hall
MatrixMulti
plication
MatrixTrans
pose
MoteCarlo
Asian
Native 0.0108 0.0018 0.014 16.094 8.012 0.502 17.458
Guest 0.0113 0.0019 0.016 16.603 8.286 0.538 18.342
Small benchmark
Enqueue Task
Kick GPU
Wait Signal
World Switch to Host
Switch Back
Guest Application
World Switch to Host
Signal
delay
Enqueue many times
Conclusion
• Successfully implementing a hypervisor virtualizing HSA
features.
• Guest system can get benefit of HSA and carry out
heterogeneous computing.
• GPU in Kaveri is shareable between multiple guest OSes and
host OS.
17
Thanks!
Q&A
gic4107@gmail.com
18

Weitere ähnliche Inhalte

Was ist angesagt?

HSA Design (2015-04-30)
HSA Design (2015-04-30)HSA Design (2015-04-30)
HSA Design (2015-04-30)
Jay Wang
 

Was ist angesagt? (20)

HSA Design (2015-04-30)
HSA Design (2015-04-30)HSA Design (2015-04-30)
HSA Design (2015-04-30)
 
Xen on ARM for embedded and IoT: from secure containers to dom0less systems
Xen on ARM for embedded and IoT: from secure containers to dom0less systemsXen on ARM for embedded and IoT: from secure containers to dom0less systems
Xen on ARM for embedded and IoT: from secure containers to dom0less systems
 
Kvm virtualization platform
Kvm virtualization platformKvm virtualization platform
Kvm virtualization platform
 
Static Partitioning with Xen, LinuxRT, and Zephyr: A Concrete End-to-end Exam...
Static Partitioning with Xen, LinuxRT, and Zephyr: A Concrete End-to-end Exam...Static Partitioning with Xen, LinuxRT, and Zephyr: A Concrete End-to-end Exam...
Static Partitioning with Xen, LinuxRT, and Zephyr: A Concrete End-to-end Exam...
 
VMware ESXi - Intel and Qlogic NIC throughput difference v0.6
VMware ESXi - Intel and Qlogic NIC throughput difference v0.6VMware ESXi - Intel and Qlogic NIC throughput difference v0.6
VMware ESXi - Intel and Qlogic NIC throughput difference v0.6
 
Xen and the art of embedded virtualization (ELC 2017)
Xen and the art of embedded virtualization (ELC 2017)Xen and the art of embedded virtualization (ELC 2017)
Xen and the art of embedded virtualization (ELC 2017)
 
Virtualization Technology Overview
Virtualization Technology OverviewVirtualization Technology Overview
Virtualization Technology Overview
 
VIO LPAR Introduction | Basics | Demo
VIO LPAR Introduction | Basics | DemoVIO LPAR Introduction | Basics | Demo
VIO LPAR Introduction | Basics | Demo
 
Android's Multimedia Framework
Android's Multimedia FrameworkAndroid's Multimedia Framework
Android's Multimedia Framework
 
Ibm power ha v7 technical deep dive workshop
Ibm power ha v7 technical deep dive workshopIbm power ha v7 technical deep dive workshop
Ibm power ha v7 technical deep dive workshop
 
EMC ScaleIO Overview
EMC ScaleIO OverviewEMC ScaleIO Overview
EMC ScaleIO Overview
 
[KubeConEU2023] Lima pavilion
[KubeConEU2023] Lima pavilion[KubeConEU2023] Lima pavilion
[KubeConEU2023] Lima pavilion
 
2. OS vs. VMM
2. OS vs. VMM2. OS vs. VMM
2. OS vs. VMM
 
Hypervisors
HypervisorsHypervisors
Hypervisors
 
Reconnaissance of Virtio: What’s new and how it’s all connected?
Reconnaissance of Virtio: What’s new and how it’s all connected?Reconnaissance of Virtio: What’s new and how it’s all connected?
Reconnaissance of Virtio: What’s new and how it’s all connected?
 
Tizen 3.0's Window System Integration Layer of OpenGLES/EGL & Vulkan Driver
Tizen 3.0's Window System Integration Layer of OpenGLES/EGL & Vulkan DriverTizen 3.0's Window System Integration Layer of OpenGLES/EGL & Vulkan Driver
Tizen 3.0's Window System Integration Layer of OpenGLES/EGL & Vulkan Driver
 
Power Management from Linux Kernel to Android
Power Management from Linux Kernel to AndroidPower Management from Linux Kernel to Android
Power Management from Linux Kernel to Android
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
ELC21: VM-to-VM Communication Mechanisms for Embedded
ELC21: VM-to-VM Communication Mechanisms for EmbeddedELC21: VM-to-VM Communication Mechanisms for Embedded
ELC21: VM-to-VM Communication Mechanisms for Embedded
 
IBM Spectrum Scale Authentication for File Access - Deep Dive
IBM Spectrum Scale Authentication for File Access - Deep DiveIBM Spectrum Scale Authentication for File Access - Deep Dive
IBM Spectrum Scale Authentication for File Access - Deep Dive
 

Ähnlich wie Building a KVM-based Hypervisor for a Heterogeneous System Architecture Compliant System

Vmwareperformancetroubleshooting 100224104321-phpapp02
Vmwareperformancetroubleshooting 100224104321-phpapp02Vmwareperformancetroubleshooting 100224104321-phpapp02
Vmwareperformancetroubleshooting 100224104321-phpapp02
Suresh Kumar
 
2virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp01
2virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp012virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp01
2virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp01
Vietnam Open Infrastructure User Group
 
Presentation architecting a cloud infrastructure
Presentation   architecting a cloud infrastructurePresentation   architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
solarisyourep
 
s6196-chris-huybregts-microsoft-new-gpu-virtualization-technologies
s6196-chris-huybregts-microsoft-new-gpu-virtualization-technologiess6196-chris-huybregts-microsoft-new-gpu-virtualization-technologies
s6196-chris-huybregts-microsoft-new-gpu-virtualization-technologies
Chris Huybregts
 
Storage and hyper v - the choices you can make and the things you need to kno...
Storage and hyper v - the choices you can make and the things you need to kno...Storage and hyper v - the choices you can make and the things you need to kno...
Storage and hyper v - the choices you can make and the things you need to kno...
Louis Göhl
 
Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDI...
Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDI...Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDI...
Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDI...
inside-BigData.com
 
Microsoft (Virtualization 2008)
Microsoft (Virtualization 2008)Microsoft (Virtualization 2008)
Microsoft (Virtualization 2008)
Vinayak Hegde
 

Ähnlich wie Building a KVM-based Hypervisor for a Heterogeneous System Architecture Compliant System (20)

PCI Pass-through - FreeBSD VM on Hyper-V (MeetBSD California 2016)
PCI Pass-through - FreeBSD VM on Hyper-V (MeetBSD California 2016)PCI Pass-through - FreeBSD VM on Hyper-V (MeetBSD California 2016)
PCI Pass-through - FreeBSD VM on Hyper-V (MeetBSD California 2016)
 
Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)
Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)
Vmwareperformancetroubleshooting 100224104321-phpapp02 (1)
 
Vmwareperformancetroubleshooting 100224104321-phpapp02
Vmwareperformancetroubleshooting 100224104321-phpapp02Vmwareperformancetroubleshooting 100224104321-phpapp02
Vmwareperformancetroubleshooting 100224104321-phpapp02
 
2virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp01
2virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp012virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp01
2virtualizationtechnologyoverview 13540659831745-phpapp02-121127193019-phpapp01
 
VDI Design Guide
VDI Design GuideVDI Design Guide
VDI Design Guide
 
V mware view™ poc jumpstart service
V mware view™ poc jumpstart serviceV mware view™ poc jumpstart service
V mware view™ poc jumpstart service
 
5. IO virtualization
5. IO virtualization5. IO virtualization
5. IO virtualization
 
Cloud-computing.ppt
Cloud-computing.pptCloud-computing.ppt
Cloud-computing.ppt
 
Virtualization overheads
Virtualization overheadsVirtualization overheads
Virtualization overheads
 
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management blissStupid Boot Tricks: using ipxe and chef to get to boot management bliss
Stupid Boot Tricks: using ipxe and chef to get to boot management bliss
 
Presentation architecting a cloud infrastructure
Presentation   architecting a cloud infrastructurePresentation   architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
 
Presentation architecting a cloud infrastructure
Presentation   architecting a cloud infrastructurePresentation   architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
 
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
XPDDS17: Shared Virtual Memory Virtualization Implementation on Xen - Yi Liu,...
 
s6196-chris-huybregts-microsoft-new-gpu-virtualization-technologies
s6196-chris-huybregts-microsoft-new-gpu-virtualization-technologiess6196-chris-huybregts-microsoft-new-gpu-virtualization-technologies
s6196-chris-huybregts-microsoft-new-gpu-virtualization-technologies
 
Storage and hyper v - the choices you can make and the things you need to kno...
Storage and hyper v - the choices you can make and the things you need to kno...Storage and hyper v - the choices you can make and the things you need to kno...
Storage and hyper v - the choices you can make and the things you need to kno...
 
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
 
Get Your GeekOn with Ron - Session One: Designing your VDI Servers
Get Your GeekOn with Ron - Session One: Designing your VDI ServersGet Your GeekOn with Ron - Session One: Designing your VDI Servers
Get Your GeekOn with Ron - Session One: Designing your VDI Servers
 
Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDI...
Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDI...Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDI...
Accelerating & Optimizing Machine Learning on VMware vSphere leveraging NVIDI...
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
 
Microsoft (Virtualization 2008)
Microsoft (Virtualization 2008)Microsoft (Virtualization 2008)
Microsoft (Virtualization 2008)
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 

Building a KVM-based Hypervisor for a Heterogeneous System Architecture Compliant System

  • 1. Building A KVM-based Hypervisor for A Heterogeneous System Architecture Compliant System National Chiao Tung University & National Tsing Hua University & National Taiwan University Yu-Ju Huang, Hsuan-Heng Wu, Yeh-Ching Chung, Wei-Chung Hsu
  • 2. Agenda • Motivation • Background • HSA features • AMD’s implementation on Kaveri, the HSA- compliant platform • Design and Implementation • Evaluation • Conclusion 2
  • 3. Motivation • Problem of heterogeneous computing • Data communication between CPU & GPU • Inefficiency • Programmability inconvenience • Heterogeneous System Architecture (HSA) • Developed by HSA Foundation • Goal • Improving computation efficiency for heterogeneous computing • Reducing programmability barrier • Make virtual machines also get benefit of HSA ! 3 HSA Hypervisor Guest OS Guest OS A p p A p p A p p A p p HSA!!!
  • 4. HSA Features • Shared virtual memory • I/O page faulting • User-level queueing • Memory based signaling 4 CPU Memory GPUCPU GPU Memory Data copy Before HSA Physical Memory HSA GPUCPU Virtual Memory HSA Application Queues Operating System GPU Driver GPU Before HSA HSA GPU Application Queues HSA • Shared virtual memory • I/O page faulting • User-level queueing • Memory based signaling
  • 5. Shared Virtual Memory - IOMMU • Set process page table to IOMMU to carry out virtual to physical address translation • CPU and GPU share same process page table 5 System Memory GPU CPU IOMMU MMUProcess Page Table
  • 6. I/O Page Faulting - PPR • PPR(peripheral page service request) issued by IOMMU as interrupt • PPR logs contains fault process ID and fault address • get_user_pages API can be used to fix page fault 6 IOMMU CPU Call PPR handler Get PPR logs Fix fault fault COMPLETE command PPR Interrupt 1 2 3 4 5
  • 7. User Level Queueing - Kernel Fusion Driver (KFD) • Help applications set address of user level queues to GPU 7 Kernel Space GPU Userspace KFD Addr of user level queue User Level Queues Computation
  • 8. Design - How to Virtualize • User-level queueing • VirtIO-KFD • Shared virtual memory • Shadow page table • Why not hardware-assisted nested paging ? • I/O Page faulting • Shadow PPR • VirtIO-IOMMU 8
  • 9. Virtualize User Level Queueing VirtIO-KFD 9 Guest OS Host OS KFD Qemu Guest App VirtIO-KFD (Back-end) VirtIO-KFD (Front-end) Guest App Guest App GPU Share virtqueue HSA Runtime Library 1 2 3 4 KVM
  • 10. Virtualize Shared Virtual Memory Shadow Page Table 10 Guest OS Host OS KFD Qemu Guest App VirtIO-KFD (Back-end) VirtIO-KFD (Front-end) Guest App Guest App Share virtqueue HSA Runtime Library 1 2 3 4IOMMU Driver KVM IOMMU Addr of shadow page table 5 6
  • 11. GPU IOMMU Memory ID System Page table 1 Host, process 1 Addr of PT 2 Guest 1, process 1 Addr of SPT Page Table ID=1 HVA MPA Native ScenarioGuest Scenario  More guest processes in different guest OSes are also allowed. 11 IOMMU Snapshot During GPU Execution GVA MPA ID=2
  • 12. Virtualize I/O Page Faulting VirtIO-IOMMU, Shadow PPR 12 Guest OS Host OS Shadow PPR Qemu Guest App VirtIO- IOMMU Guest App Guest App IOMMU HSA Runtime Library IOMMU Driver KVM Interrupt1 3 5 4 2 PPR: Peripheral Page Request
  • 13. System Architecture 13 Guest OS Host OS KVM Shadow PPR KFD Qemu (Host Process) HSA Runtime Library Guest App VirtIO- IOMMU VirtIO- IOMMU VirtIO-KFD VirtIO-KFD Guest App Guest App IOMMU GPU User level queuing IOMMU Driver  KFD: Kernel Fusion Driver  PPR: Peripheral Page Request Shared virtual memory I/O page faulting
  • 14. Evaluation • Queue initialization time • Measuring overheads of VirtIO-KFD • GPU execution time • Measuring overheads of shadow page table and shadow PPR 14 Configurations Native Guest Hardware platform Kaveri Memory 8G 4G Number of CPUs 4 4 OS Ubuntu 13.10
  • 15. Queue Initialization Time 15 Average 30% performance drop.
  • 16. GPU Execution Time 16 Achieve average 95% of native performance in most cases. GPU time (sec) BinarySea rch FastWalsh Transform BitonocSort FloydWars hall MatrixMulti plication MatrixTrans pose MoteCarlo Asian Native 0.0108 0.0018 0.014 16.094 8.012 0.502 17.458 Guest 0.0113 0.0019 0.016 16.603 8.286 0.538 18.342 Small benchmark Enqueue Task Kick GPU Wait Signal World Switch to Host Switch Back Guest Application World Switch to Host Signal delay Enqueue many times
  • 17. Conclusion • Successfully implementing a hypervisor virtualizing HSA features. • Guest system can get benefit of HSA and carry out heterogeneous computing. • GPU in Kaveri is shareable between multiple guest OSes and host OS. 17

Hinweis der Redaktion

  1. Hello everyone. My name is Yu-Ju Huang. Here is the author list, this is me, my partner, and two professors. We all from Taiwan, a country in the east Asia. <NEED funny intro> This is my topic today. It’s a little long, right :D? So now, I’m gonna give you a brief introduction and image about this work. Hope you can enjoy it ! In this work, our target is a special HW architecture called Heterogeneous System Architecture, or HSA in short. HSA is mainly focus on helping heterogeneous computing system more powerful and more efficient. Given the HSA-compliant HW platform, we implement a hypervisor running on top of it. And the hypervisor tries to virtualize the features provided by HSA such that the virtual machines can also get the benefits of HSA.
  2. In the beginning, I’ll introduce the motivation of this work. And then a brief background about HSA including the HSA features and the AMD’s implementation on Kaveri which is the first HSA-compliant platform, and also is our target platform. After that, we can talk about our design and implementation. And then the evaluation and conclusion.
  3. About the motivation, we start from the heterogeneous computing. The heterogeneous computing programming model requires data communication between devices. This communication cause inefficiency and programmability inconvenience. So HSA foundation propose the HSA architecture to resolve this problems. For the motivation of our work, the motivation is that if we believe the heterogeneous computing will be more and more popular in the future, then there must be a hypervisor to support virtual machines to get benefits of HSA. Here, though our discussion is based on HSA and the implementation is based on AMD’s platform. Our design philosophy can also be applied to other platform, or even other architecture that tries to improve heterogeneous computing systems.
  4. OK, let’s start to introduce HSA. As previous description, HSA tries to solve the communication inefficiency and inconvenience. Here is the solution of HSA. It proposes many features. And here the list is the features focusing on how a program is able to execute. These features are also what we need to virtualize. The first, shared virtual memory. Before HSA, CPU and GPU use different memory and address space, so data copy is required. For HSA, all the computing resource, like CPU and GPU or other HSA-aware devices, see the same virtual address space so they can access the system memory with virtual address. This way can eliminate the data copy. For the I/O page faulting feature, this is a requirement for shared virtual memory because we allow I/O device to access system memory directly, then the page fault service must also support it And the user-level queuing. Before HSA, tasks can only be dispatched to GPU by OS, or GPU driver. As for HSA, GPU is able to see all the user level queues. So the jobs dispatching don’t need trap into GPU driver any more. This design reduce the latency of dispatching jobs. Final, the memory based signaling is also designed for reduce OS intervention latency. Previous to HSA, once GPU finishes its task, it issue an interrupt to CPU and let CPU to notify user-space program. This path incurs OS intervention overhead. So HSA makes GPU able to access a particular memory address for job finishing notification. The particular memory address is assigned by application when it dispatch jobs. For these fours features, the memory based signaling can be achieved once GPU is able to access process address space. So actually, we have only take care to virtualize the first three features.
  5. Well, in the following page, I will introduce the AMD’s implementation of the HSA features. The shared virtual memory. AMD implement IOMMU for GPU or other HSA-aware devices to translate virtual address physical address. And since the CPU and GPU see the same process address space, the page table of IOMMU should be same as what CPU MMU uses. So with setting the page table properly, the shared virtual memory feature can be achieved.
  6. About the I/O page faulting, AMD designs a mechanism call PPR, peripheral page service request. This request is issued by IOMMU as an interrupt to CPU once a failure occurs in address translation, such as page doesn’t exist or insufficient permission to access the page. The IOMMU will also write log containing fault process ID and fault address. With these information, Linux API get_user_pages can be used to fix the I/O page fault. Here is the brief flow of the I/O page fault handling.
  7. As for the user-level queuing feature. The key idea is how to make GPU know where is the address of user-level queues. AMD designs a driver call kernel fusion driver, or KFD, to complete this function. During user-program initialization, the CREATE_QUEUE API will send the address of user-level queue to the KFD, and the KFD set this address to GPU. After this setting, driver’s intervention can be moved out. The driver is only used during initialization, the computation time is co-worked between GPU and user-program.
  8. Good? In previous slides, I describe what we need to virtualize. And from now on, I will introduce you about how we virtualize these HSA features. You can see on this page, I will elaborate more in the following page. For one thing I need to mention is that, we use the shadow page table to virtualize the shared virtual memory. I know you may feel strange why SPT is adopted rather than the nested paging. This is due to the constrain of the AMD’s IOMMU, and it’s a little complicated so I will not describe it in this talk. But you can still find the explanation in proceeding and the paper.
  9. As I previously describe, the key to support user level queuing is to let the GPU know where is the address of user level queue. So we implemented VirtIO-KFD, as you can see in the slide. The VirtIO-KFD help guest application to bypass the address of its queue to the real KFD. And the KFD will set it to GPU. With this way, the GPU can know where is the address of guest application queue.
  10. And then the shared virtual memory. As we know, the shadow page table guides the MMU to translate guest virtual address to machine physical address. So in our work, we just need to find the address of shadow page table and set it to IOMMU when guest application tries to use GPU.
  11. This is a snapshot of the GPU executing state. IOMMU maintains a table to map process address space ID to the corresponding page table address. In this scenario, there are two process use GPU. For native execution, like GPU run a program dispatched by a host application. Then it will know where to find the host application’s page table. For guest execution, GPU run a program dispatched by a guest application. And this program is encoded in the guest virtual address space. So IOMMU will find the corresponding SPT to translation the GVA to MPA. As you can expect, this table can be extended. So in our design, multiple processes from difference guest OSes or even host OS can share the GPU. So we kind of achieving the GPU sharing in our work.
  12. Final one, I/O page faulting. One challenge to virtualize this feature is that the PPR log region, where is used to store the page fault information, is inside a special IO region. Usually, guest system is not allowed to access this region. So we implemented a module called shadow-PPR. This module is used to store the information about guest GPU program’s page faults. Once a PPR occurs, the PPR handler will decide whether it is caused by guest program. If so, then store the information into shadow PPR. Then shadow PPR kick up the KVM and send a virtual interrupt into guest OS. Inside guest OS, we implemented a VirtIO-IOMMU to handle the I/O page fault. It will get page fault information from shadow PPR and fix the page fault. So this how we virtual the I/O page faulting.
  13. Whole system architecture. VirtIO-KFD for user level queuing. SPT for SVM. VirtIO-IOMMU for I/O page fault.
  14. About the experiment. We use AMD SDK as our benchmark. Data is shown in initialization time and execution time to evaluate our design.
  15. The data is normalized against native scenario. It’s about 30% performance drop. This drop is mainly caused by the propagation from VirtIO-KFD to real KFD. Since there are world switch overhead in this path. But usually, an application only do this initialization process once. So this performance drop is not a great concern.
  16. For GPU execution time. The major cause of performance drop in GPU execution time is the I/O page fault handling. But as you can see, our design does get a good result, around 95% of native performance in most cases. As for the two poor case, FWT, BS. These two benchmark does have a little poor performance. The reason is that, let’s see this figure. This is about the flow of an application dispatching jobs, waiting for signal, and getting notification when GPU finishes the job. There is possible that during guest application waits for signal, the CPU may switch to other process. So if in a particular time, GPU finish the job and send a notification. But in this particular time, the CPU is owned by other process rather than the guest system. So the application will get the signal lately. These red arrows shows the this delay. And why only the two benchmarks suffer from it. We can see the raw data here. Because they are small benchmark, about only 10 ms GPU execution time. For the long benchmark, this signal delay can be amortized. Another reason is that these two benchmark enqueue many time. So they keep inside this loop. And the overhead becomes large. For BinarySearch, though it is a small benchmark, it only enqueue once, so the overhead is invisible.
  17. Conclusion of our work. We implement a hypervisor that makes guest system can also get the benefit of HSA. And furthermore, we also achieve GPU sharing.