A day in the life of a VSAN I/O - STO7875

A day in the life of a VSAN I/O
Duncan Epping (@DuncanYB)
John Nicholson (@lost_signal)
Diving in to the I/O flow of Virtual SAN
VMworld session: #SDDC7875

Agenda
1 Introduction (Duncan)
2 Virtual SAN, what is it? (Duncan)
3 Virtual SAN, a bit of a deeper dive (Duncan)
4 What about failures? (John)
5 IO Deep Dive (John)
6 Wrapping up (John)
2

The Software Defined Data Center
Compute Networking Storage
Management
• All infrastructure services virtualized:
compute, networking, storage
• Underlying hardware abstracted,
resources are pooled
• Control of data center automated by
software (management, security)
• Virtual Machines are first class citizens
of the SDDC
• Today’s session will focus on one
aspect of the SDDC - storage
3

Hardware
evolution
started the
infrastructure
revolution

Hyper-Converged Infrastructure: new IT model
“A lazy admin is the
best admin”
5

Simplicity: Operational / Management
6

The Hypervisor is the Strategic High Ground
SAN/NASx86 - HCI Object Storage
VMware vSphere
Cloud Storage
7

Storage Policy-Based Management – App centric automation
Overview
• Intelligent placement
• Fine control of services at VM level
• Automation at scale through policy
• Need new services for VM?
• Change current policy on-the-fly
• Attach new policy on-the-fly
Virtual Machine Storage policy
Reserve Capacity 40GB
Availability 2 Failures to tolerate
Read Cache 50%
Stripe Width 6
Storage Policy-Based Management
vSphere
Virtual SAN Virtual Volumes
Virtual Datastore
8

Virtual SAN Primer
So that we are all on the same page
9

Virtual SAN, what is it?
Hyper-Converged Infrastructure
Distributed, Scale-out Architecture
Integrated with vSphere platform
Ready for today’s vSphere use cases
Software-Defined Storage
vSphere & Virtual SAN
10

But what does that really mean?
VSAN network
Generic x86 hardware
VMware vSphere & Virtual SAN Integrated with your Hypervisor
Leveraging local storage resources
Exposing a single shared datastore
Virtual SAN
11

VSAN is the Most Widely Adopted HCI Product
Simplicity is key, on an oil
platform there are no
virtualization, storage or network
admins. The infrastructure is
managed over a satellite link via
a centralized vCenter Server.
Reliability, availability and
predictability is key.
12

Virtual SAN Use Cases
VMware vSphere + Virtual SAN
End User
Computing Test/Dev
ROBOStagingManagementDMZ
Business
Critical Apps DR / DA
13

Tiered Hybrid vs All-Flash
14
All-Flash
100K IOPS per Host
+
sub-millisecond latency
Caching
Writes cached first,
Reads from capacity tier
Capacity Tier
Flash Devices
Reads go directly to capacity tier
SSD PCIe
Data
Persistence
Hybrid
40K IOPS per Host
Read and Write Cache
Capacity Tier
SAS / NL-SAS / SATA
SSD PCIe
NVMe
Virtual SAN
NVMe

Flash Devices
All writes and the vast majority of reads are served by flash storage
1. Write-back Buffer (30%) (or 100% in all-flash)
– Writes acknowledged as soon as they are persisted on flash (on all replicas)
2. Read Cache (70%)
– Active data set always in flash, hot data replace cold data
– Cache miss – read data from HDD and put in cache
A performance tier tuned for virtualized workloads
– High IOPS, low $/IOPS
– Low, predictable latency
15

Virtual SAN,
a bit of a deeper dive
16

Virtual Machine as a set of Objects on VSAN
• VM Home Namespace
• VM Swap Object
• Virtual Disk (VMDK) Object
• Snapshot (delta) Object
• Snapshot (delta) Memory Object
VM Home
VM Swap
VMDK
Snap delta
Snap memory
Snapshot
17

Define a policy first…
Virtual SAN currently surfaces multiple storage capabilities to vCenter Server
Determines
layout of
components!
18

ESXi Host
Virtual SAN Objects and Components
VSAN is an object store!
• Object Tree with Branches
• Each Object has multiple Components
– This allows you to meet availability and
performance requirements
• Here is one example of “Distributed RAID” using
2 techniques:
– Striping (RAID-0)
– Mirroring (RAID-1)
• Data is distributed based on VM Storage Policy
ESXi HostESXi Host
Mirror Copy
stripe-2b
stripe-2a
RAID-0
Mirror Copy
stripe-1b
stripe-1a
RAID-0
witness
VMDK Object
RAID-1
19

Number of Failures to Tolerate (How many copies of your data?)
• Defines the number of hosts, disk or network failures a storage object can tolerate.
• RAID-1 Mirroring used when Failure Tolerance Method set to Performance (default).
• For “n” failures tolerated, “n+1” copies of the object are created and “2n+1” host contributing
storage are required!
esxi-01 esxi-02 esxi-03 esxi-04
Virtual SAN Policy: “Number of failures to tolerate = 1”
vmdk
~50% of I/O
vmdk witness
~50% of I/O
RAID-1
20

Number of Disk Stripes Per Object (on how many devices?)
• Number of disk stripes per object
– The number of HDDs across which each replica of a storage object is distributed. Higher values
may result in better performance.
21
vmdk vmdk witness
RAID-1
vmdk
vmdk
RAID-0RAID-0

Fault Domains, increasing availability through rack awareness
• Create fault domains to increase availability
• 8 node cluster with 4 defined fault domains (2 nodes in each)
FD1 = esxi-01, esxi-02 FD3 = esxi-05, esxi-06
FD2 = esxi-03, esxi-04 FD4 = esxi-7, esxi-08
• To protect against one rack failure only 2 replicas are required and a witness across 3 failure domains!
22
FD2 FD3 FD4
esxi-01
esxi-02
esxi-03
esxi-04
esxi-05
esxi-06
esxi-07
esxi-08
FD1
vmdk vmdk witness
RAID-1
22

VSAN 1 host isolated – HA restart
• HA detects an isolation
– ESXi-01 cannot ping master
– Master receives no pings
– ESXi-01 cannot ping Gateway
– Isolation declared!
• HA kills VM on ESXi-01
– Note that the Isolation Response
needs to be configured!
– Shutdown / Power Off / Disabled
• VM can now be restarted on any of
the remaining hosts
Isolated!
vmdk vmdk witness
RAID-1
HA restart

VSAN 2 hosts partitioned – HA restart
• This is not an isolation, but rather a
partition
• ESXi-01 can ping ESXi-02
• ESXi-01 cannot ping the rest of the
cluster
• VSAN kills VM on ESXi-01
– It does this as as all components
are inaccessible
– AutoTerminateGhostVm
• HA detects that VM is missing
• HA sees no hosts is accessing
components
• HA restarts the VM!
Partitioned
vmdk vmdk witness
RAID-1
HA restart
FD2 FD3 FD4FD1

• Double partition scenario!
• Again, VSAN kills VM on ESXi-01
– AutoTerminateGhostVm
• HA detects that VM is missing
• HA sees no hosts is accessing
components
• HA restarts the VM in either FD2 or
FD3!
– They have majority
Partitioned
vmdk vmdk witness
RAID-1
HA restart
FD2 FD3 FD4FD1
Partitioned

• Double partition scenario!
• Note that VM remains running in FD1
• VM runs headless, cannot write to
disk!
• HA sees that access to storage is
lost
• HA restarts the VM in either FD3 or
FD4!
– They have majority
• As soon as partition is lifted VM is
killed in FD1 as it lost its lock!
Partitioned
vmdk vmdk witness
RAID-1
HA restart
FD2 FD3 FD4FD1
Partitioned

VSAN IO flow – Write Acknowledgement
• VSAN mirrors write IOs to all active
mirrors
• These are acknowledged when they
hit the write buffer!
• The write buffer is flash based,
persistent to avoid data loss
• Writes will be de-staged to the
capacity tier
– VSAN takes locality in to account
when destaging for spindles
– Optimizes IO pattern
vmdk vmdk witness
RAID-1

Anatomy of a Hybrid Read
1. Guest OS issues a read on virtual disk
2. Owner chooses replica to read from
• Load balance across replicas
• Not necessarily local replica (if one)
• A block always reads from same replica
3. At chosen replica (esxi-03): read data from flash
Read Cache or client cache, if present
4. Otherwise, read from HDD and place data in flash
Read Cache
• Replace ‘cold’ data
5. Return data to owner
6. Complete read and return data to VM
vmdk vmdk
1
2 3
4
5
6
esxi-01 esxi-02 esxi-03

Anatomy of a All-Flash Read
1. Guest OS issues a read on virtual disk
2. Owner chooses replica to read from
– Load balance across replicas
– Not necessarily local replica (if one)
– A block always read from same replica
3. At chosen replica (esxi-03): read data from
(write) Flash Cache or client cache, if present
4. Otherwise, read from capacity flash device
5. Return data to owner
6. Complete read and return data to VMvmdk vmdk
1
2 3
4
5
6

32
vmdk vmdk
esxi-01 esxi-02 Witness
Client Cache
• Always Local
• Up to 1GB of memory per Host
• Memory Latency < Network Latency
• Horizon 7 Testing - 75% fewer Read
IOPS, 25% better latency.
• Complements CBRC
• Enabled by default in 6.2

Anatomy of Checksum
1. Guest OS issues a write on virtual disk
2. Host generates Checksum before it leaves host
3. Transferred over network
4. Checksum verified on host where it will write to disk.
5. ACK is returned to the virtual machine
6. On Read the checksum is verified by the host with
the VM. If any component fails it is repaired from
the other copy or parity.
7. Scrubs of cold data performedvmdk vmdk
1
2 3
4
5
6
7

Deduplication and Compression for Space Efficiency
• deduplication and compression per disk group level.
– Enabled on a cluster level
– Fixed block length deduplication (4KB Blocks)
• Compression after deduplication
– LZ4 is used, low CPU!
– Single feature, no schedules required!
– File System stripes all IO across disk group
Beta
vmdk vmdk
vmdk
34
All-Flash Only

Deduplication and Compression Disk Group Stripes
• deduplication and compression
per disk group level.
– Data stripes across the disk group
• Fault domain isolated to disk group
– Fault of device leads to rebuild of
disk group
– Stripes reduce hotspots
– Endurance/Throughput Impact
Beta
35
vmdkvmdk

Costs of Deduplication (nothing is free)
• CPU overhead
• Metadata and Memory overhead
– Overhead for Metadata?
• IO Overhead (metadata lookup)
• IO Overhead (Data movement from WB)
• IO Overhead (Fragmenation)
• Endurance Overhead
36
vmdk
Deduplication
vmdk
Compression
1
2
3
4
5

Costs of Compression (nothing is free)
• CPU overhead
• Capacity overhead
• Memory overhead
• IO overhead
37
vmdk
Deduplication
vmdk
Compression
1
2
3
4
5

Deduplication and Compression (I/O Path)
• Avoids Inline or post process downsides
• Performed at disk group level
• 4KB fixed block
• LZ4 compression after deduplication
38
All-Flash Only
SSD
SSD
1. VM issues write
2. Write acknowledged by cache
3. Cold data to memory
4. Deduplication
5. Compression
6. Data written to capacity

RAID 5/6
• All Flash enabled RAID 5 and RAID 6.
• SPBM Policy – Set per Object
39
vmdk vmdk vmdk
Raid-5
vmdk
All-Flash Only

RAID-5 Inline Erasure Coding
• When Number of Failures to Tolerate = 1 and Failure Tolerance Method = Capacity  RAID-5
– 3+1 (4 host minimum)
– 1.33x instead of 2x overhead
• 20GB disk consumes 40GB with RAID-1, now consumes ~27GB with RAID-5
40
RAID-5
ESXi Host
parity
data
data
data
ESXi Host
data
parity
data
data
ESXi Host
data
data
parity
data
ESXi Host
data
data
data
parity

RAID-6 Inline Erasure Coding
• When Number of Failures to Tolerate = 2 and Failure Tolerance Method = Capacity  RAID-6
– 4+2 (6 host minimum)
– 1.5x instead of 3x overhead
• 20GB disk consumes 60GB with RAID-1, now consumes ~30GB with RAID-6
41
All Flash Only
ESXi Host
parity
data
data
RAID-6
ESXi Host
parity
data
data
ESXi Host
data
parity
data
ESXi Host
data
parity
data
ESXi Host
data
data
parity
ESXi Host
data
data
parity

Swap Placement?
42
Sparse Swap
• Reclaim Space used by memory swap
• Host advanced option enables setting
• How to set it?
esxcfg-advcfg -g /VSAN/SwapThickProvisionDisabled
https://github.com/jasemccarty/SparseSwap

Snapshots for VSAN
43
• Not using VMFS Redo Logs
• Writes allocated into 4MB allocations
• snapshot metadata cache (avoids read amplification)
• Performs Pre-Fetch of metadata cache
• Maximum 31

Three Ways to Get Started with Virtual SAN Today
VSAN
Assessment32Download
Evaluation
Online
Hands-on Lab1
• Test-drive Virtual SAN right
from your browser—with
an instant Hands-on Lab
• Register and your free,
self-paced lab is up and
running in minutes
• 60-day Free Virtual SAN
Evaluation
• VMUG members get a 6-
month EVAL or 1-year
EVALExperience for $200
• Reach out to your VMware
Partner, SEs or Rep for a
FREE VSAN Assessment
• Results in just 1 week!
• The VSAN Assessment tool
collects and analyzes data
from your vSphere storage
environment and provides
technical and business
recommendations.
Learn more…
vmware.com/go/virtual-san
• Virtual SAN Product
Overview Video
• Virtual SAN Datasheet
• Virtual SAN Customer
References
• Virtual SAN Assessment
• VMware Storage Blog
• @vmwarevsan
vmware.com/go/try-vsan-en vmware.com/go/try-vsan-en
45

A day in the life of a VSAN I/O - STO7875

A day in the life of a VSAN I/O - STO7875

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie A day in the life of a VSAN I/O - STO7875

Ähnlich wie A day in the life of a VSAN I/O - STO7875 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

A day in the life of a VSAN I/O - STO7875

Hinweis der Redaktion