Virtualization is an old concept that provides abstraction and security through CPU execution modes and virtual memory management. There are two main challenges for virtualization - handling privileges and virtual memory address translation. Hardware virtualization techniques address these by introducing a hypervisor mode and advanced page table structures, improving performance over older binary translation methods. Database performance in virtualized environments is impacted by the overhead of interrupts, system calls, and memory management, so optimization techniques focus on reducing these costs through configuration tuning, hardware choices, and I/O isolation strategies.
1. VIRTUALIZED DATABASES?
Approach: mechanics of virtualization
"certain big players" will not be mentioned
Talk is general, mostly about hardware issues which are the same for any platform
2. ME
• Liz van Dijk (@lizztheblizz)
• Working at Sizing Servers Research Lab
• First-timer at FOSDEM!
• Not really a developer, not really a sysadmin, not really a DBA
•I just like knowing how stuff works.
3. SO... VIRTUALIZATION, HUH.
• It’s far too broad a term
• It’s a pretty old concept. (about half a century, actually)
• Its main purposes are abstraction and security
• Making use of the correct CPU execution mode
• Managing Virtual Memory
History!
Broad term, 100 different meanings
Full-system virtualization on the mainframes in the 60's
IBM m44, trap and emulate
Recently:
* x86 did not support full virtualization, trap and emulate did not work
* multicore hardware, single threaded software. Inefficient datacenters.
Full Virtualization is not the only virtualization
combination of different methods
Who uses RAID?
Who uses Virtual Memory?
2 big issues that all solutions try to work around
Focus on these, the next steps should be more or less logical
Problem 1: matter of privileges
kernels assume full control over hardware
how does the hardware deal with this?
layer-based security system (onion)
2-bit code in memory address, cpu verifies the code, does or doesn't do the instruction
x86: 4 layers
code 00: supervisor mode
code 11: user mode
4. SO... VIRTUALIZATION, HUH.
• It’s far too broad a term
• It’s a pretty old concept. (about half a century, actually)
• Its main purposes are abstraction and security
• Making use of the correct CPU execution mode
• Managing Virtual Memory
History!
Broad term, 100 different meanings
Full-system virtualization on the mainframes in the 60's
IBM m44, trap and emulate
Recently:
* x86 did not support full virtualization, trap and emulate did not work
* multicore hardware, single threaded software. Inefficient datacenters.
Full Virtualization is not the only virtualization
combination of different methods
Who uses RAID?
Who uses Virtual Memory?
2 big issues that all solutions try to work around
Focus on these, the next steps should be more or less logical
Problem 1: matter of privileges
kernels assume full control over hardware
how does the hardware deal with this?
layer-based security system (onion)
2-bit code in memory address, cpu verifies the code, does or doesn't do the instruction
x86: 4 layers
code 00: supervisor mode
code 11: user mode
5. SO... VIRTUALIZATION, HUH.
• It’s far too broad a term
• It’s a pretty old concept. (about half a century, actually)
• Its main purposes are abstraction and security
• Making use of the correct CPU execution mode
• Managing Virtual Memory
History!
Broad term, 100 different meanings
Full-system virtualization on the mainframes in the 60's
IBM m44, trap and emulate
Recently:
* x86 did not support full virtualization, trap and emulate did not work
* multicore hardware, single threaded software. Inefficient datacenters.
Full Virtualization is not the only virtualization
combination of different methods
Who uses RAID?
Who uses Virtual Memory?
2 big issues that all solutions try to work around
Focus on these, the next steps should be more or less logical
Problem 1: matter of privileges
kernels assume full control over hardware
how does the hardware deal with this?
layer-based security system (onion)
2-bit code in memory address, cpu verifies the code, does or doesn't do the instruction
x86: 4 layers
code 00: supervisor mode
code 11: user mode
6. X86 VIRTUALIZATION
• Binary Translation, aka “faking it”
• Applies ring deprivileging, and translates “bad calls” on the
fly
• “Full” Hardware Virtualization
• Introduced Ring -1: Hypervisor mode
• Only intervenes when absolutely necessary
BT, old awesome, employed by QEMU and wine.
Less relevant now for full-virtualization
ring deprivileging, look it up!
Intel/AMD caught up, implemented VT-x and AMD-V
ring -1: hypervisor
Let OS'es do whatever they want, but use trap and emulate
extra roundtrip, extra overhead
CPU has more tasks to perform, but they also take longer
newer cpu is better
7. X86 VIRTUALIZATION
• Binary Translation, aka “faking it”
• Applies ring deprivileging, and translates “bad calls” on the
fly
• “Full” Hardware Virtualization
• Introduced Ring -1: Hypervisor mode
• Only intervenes when absolutely necessary
BT, old awesome, employed by QEMU and wine.
Less relevant now for full-virtualization
ring deprivileging, look it up!
Intel/AMD caught up, implemented VT-x and AMD-V
ring -1: hypervisor
Let OS'es do whatever they want, but use trap and emulate
extra roundtrip, extra overhead
CPU has more tasks to perform, but they also take longer
newer cpu is better
8. X86 VIRTUALIZATION
• Binary Translation, aka “faking it”
• Applies ring deprivileging, and translates “bad calls” on the
fly
• “Full” Hardware Virtualization
• Introduced Ring -1: Hypervisor mode
• Only intervenes when absolutely necessary
BT, old awesome, employed by QEMU and wine.
Less relevant now for full-virtualization
ring deprivileging, look it up!
Intel/AMD caught up, implemented VT-x and AMD-V
ring -1: hypervisor
Let OS'es do whatever they want, but use trap and emulate
extra roundtrip, extra overhead
CPU has more tasks to perform, but they also take longer
newer cpu is better
9. VIRTUAL MEMORY
0xA
0xB
0xC
0xD
0xE
0xF
0xG
0xH
CPU
Mem
Managed by software
Actual Hardware
Problem 2: Virtual memory
4kb physical segments with physical addresses
software: pages
very easy to manage in OS, all software gets a continuous block
page table keeps track of physical to virtual mapping
TLB cache keeps track of these mappings, very fast
needs to flush every context switch.
10. VIRTUAL MEMORY
Virtual
0xA
Memory
0xB
1 0xC
2 0xD
3 0xE
4 0xF
5 0xG
OS
6
7
0xH
CPU
8
9
Mem
10
11
12
Managed by software
Actual Hardware
Problem 2: Virtual memory
4kb physical segments with physical addresses
software: pages
very easy to manage in OS, all software gets a continuous block
page table keeps track of physical to virtual mapping
TLB cache keeps track of these mappings, very fast
needs to flush every context switch.
11. VIRTUAL MEMORY
Virtual
Page Table 0xA
Memory
0xB
1 1 | 0xD 0xC
2 2 | 0xC 0xD
3 3 | 0xF 0xE
4 4 | 0xA 0xF
5 5 | 0xH 0xG
OS
6
7
6 | 0xG
7 | 0xB
0xH
CPU
8
9
8 | 0xE
Mem
10
11
etc.
12
Managed by software
Actual Hardware
Problem 2: Virtual memory
4kb physical segments with physical addresses
software: pages
very easy to manage in OS, all software gets a continuous block
page table keeps track of physical to virtual mapping
TLB cache keeps track of these mappings, very fast
needs to flush every context switch.
12. VIRTUAL MEMORY
Virtual
Page Table 0xA
Memory
0xB
1 1 | 0xD 0xC
2 2 | 0xC 0xD TLB
3 3 | 0xF 0xE
1 | 0xD
4 4 | 0xA 0xF
5 | 0xH
5 5 | 0xH 0xG
2 | 0xC
OS
6
7
6 | 0xG
7 | 0xB
0xH
CPU
8
9
8 | 0xE
Mem
etc.
10
11
etc.
12
Managed by software
Actual Hardware
Problem 2: Virtual memory
4kb physical segments with physical addresses
software: pages
very easy to manage in OS, all software gets a continuous block
page table keeps track of physical to virtual mapping
TLB cache keeps track of these mappings, very fast
needs to flush every context switch.
13. SPT VS HAP
“Read-only”
0xA
Page Table
0xB
1 | 0xD 0xC
1
2 5 | 0xH 0xD
VM A 3 2 | 0xC 0xE
0xF
4
5 0xG
N
0xH
CPU
1
12 | 0xB
10 | 0xE
Mem
2
VM B 3 9 | 0xA
4
12 etc.
Managed by VM OS
Managed by hypervisor
Actual Hardware
2 methods
locked page table, access generates trap, VMM handles memory access
much slower memory access
EPT/RVI/HAP
Make TLB much bigger, make it smarter, VM-aware
much more complex to fill up, though. slow initial memory access
filled TLB is very fast, tho.
14. SPT VS HAP
“Read-only” “Shadow”
0xA
Page Table Page Table
0xB
1 | 0xD 1 | 0xG 0xC
1
5 | 0xH 5 | 0xD 0xD
2
VM A 3 2 | 0xC 2 | 0xF 0xE
0xF
4
5
N
A 0xG
0xH
CPU
1
12 | 0xB
10 | 0xE
12 | 0xE
10 | 0xB
Mem
2
VM B 3 9 | 0xA 9 | 0xC
4
12 etc.
B
Managed by VM OS
Managed by hypervisor
Actual Hardware
2 methods
locked page table, access generates trap, VMM handles memory access
much slower memory access
EPT/RVI/HAP
Make TLB much bigger, make it smarter, VM-aware
much more complex to fill up, though. slow initial memory access
filled TLB is very fast, tho.
15. SPT VS HAP
“Read-only”
0xA
Page Table TLB
0xB
1 | 0xD 0xC A1 | 0xD
1
5 | 0xH 0xD A5 | 0xH
2
VM A 3 2 | 0xC 0xE
0xF
A2 | 0xC
B12 | 0xB
4
5 0xG B10 | 0xE
N
0xH B9 | 0xA
CPU
1
12 | 0xB
10 | 0xE
Mem
2
VM B 3 9 | 0xA
4
12 etc. etc.
Managed by VM OS
Managed by hypervisor
Actual Hardware
2 methods
locked page table, access generates trap, VMM handles memory access
much slower memory access
EPT/RVI/HAP
Make TLB much bigger, make it smarter, VM-aware
much more complex to fill up, though. slow initial memory access
filled TLB is very fast, tho.
16. WHAT DOES THIS TEACH US?
• All “kernel” activity is a lot more costly:
• Interrupts
• System Calls (I/O)
• Memory page management
so, 3 actions are slower in virtualization
Interrupts - hardware asking for attention
System Calls - software asking for kernel attention
Page Management - memory access
17. IN THE WILD...
• From best to worst case scenario...
• Bare-metal (Xen, KVM, ESX, Hyper-V)
• Host-based (VirtualBox, VMware Workstation, etc.)
• Cloud-based (Amazon, Terremark, etc.)
18. BARE-METAL OPTIONS
• Know your my.cnf inside out
• Use hardware-assisted paging + Large Pages! (InnoDB: large-
pages)
• Make use of paravirtualized HW options
• Take care of all your caching levels
• Use DirectIO (innodb_flush_method=O_DIRECT)
smalls mistakes in a native environment get bigger in virtual one
memory allocations are expensive
optimize your my.cnf!!!
tools.percona.com good starting point
connection-specific buffers (join-buffer, sort-buffer, etc)
sweet spot = test!!
SWAPPING = EVIL
swappiness
Large Pages
DirectIO
19. BARE-METAL OPTIONS
• Know your my.cnf inside out
• Use hardware-assisted paging + Large Pages! (InnoDB: large-
pages)
• Make use of paravirtualized HW options
• Take care of all your caching levels
• Use DirectIO (innodb_flush_method=O_DIRECT)
smalls mistakes in a native environment get bigger in virtual one
memory allocations are expensive
optimize your my.cnf!!!
tools.percona.com good starting point
connection-specific buffers (join-buffer, sort-buffer, etc)
sweet spot = test!!
SWAPPING = EVIL
swappiness
Large Pages
DirectIO
20. BARE-METAL OPTIONS
• Know your my.cnf inside out
• Use hardware-assisted paging + Large Pages! (InnoDB: large-
pages)
• Make use of paravirtualized HW options
• Take care of all your caching levels
• Use DirectIO (innodb_flush_method=O_DIRECT)
smalls mistakes in a native environment get bigger in virtual one
memory allocations are expensive
optimize your my.cnf!!!
tools.percona.com good starting point
connection-specific buffers (join-buffer, sort-buffer, etc)
sweet spot = test!!
SWAPPING = EVIL
swappiness
Large Pages
DirectIO
21. HARDWARE CHOICES
• Choosing the right CPU’s
• Intel5500/7500 and later types
(Nehalem) / All AMD quadcore
Opterons (HW-assisted/MMU
virtualization)
• Choosing the right NIC’s (VMDQ)
• Choosing the right storage system
(iSCSI vs FC SAN)
CPU's listed here support both HW-assist and HAP
virtual machine device queueing
22. HOST-BASED
• All of the above, if possible :)
• IO becomes the bigger issue on standard client hardware
• Focus on moving database IO away from the same disk
you run the host- and guest-OS on.
• Consider installing an SSD :)
Keep in mind all of the previous things
IO is a bigger issue
2 OS'es + DB running on the same disk always a problem
separate disk, maybe iSCSI lun?
buy an SSD!
23. CLOUD-BASED
• No control whatsoever over host-system :(
• Sometimes unreliable IO
• Change strategy! Design for easy sharding and replication!
• Caching caching caching!
• Consider RDS to reduce operational overhead?
Can't escape the hurt
unreliable disk IO
CACHING
sharding/replication to spread write/read load
very write-heavy may be more trouble than it's worth
asynchronous writes? not very durable
Use RDS to cut back operational cost