SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
VMFS Introduction


Bergwolf@linuxfb.org
Agenda

ESX Introduction
VMFS Design Goals
VMFS Architecture
SAN Impact
Conclusion
ESX System Setup
Guest Memory Layers


               Shadow page tables (VA-
               MA).

               Page sharing (BA-MA).
ESX IO Stack

       Average IO requests just
          involves offset remapping.
Agenda

ESX Introduction
VMFS Design Goals
VMFS Architecture
SAN Influence and Impact
Conclusion
Use Case

Small number of files (30~100 per VM)
Files either very small (~a few KBs), or very
large (many GBs)
SAN storage is the underlying substrate.
All storage exported by these storage systems
is shared among all ESX servers
Design Goals

Metadata overhead should be very low
VM IO throughput and latency should be as
good as directly attached raw device
A clustered lock manager for moderating
access to files among ESX servers
Help VM deterministically react to transient
and non-transient SAN events and error
conditions.
Agenda

ESX Introduction
VMFS Design Goals
VMFS Architecture
SAN Influence and Impact
Conclusion
VMFS Architecture
A volume is an aggregation of resources and on-disk
locks.
A resource is either an inode, a file block, a sub-
block or an indirect block.
Each lock moderates access to a subset of resources.
Hosts negotiate access to resource by acquiring
relevant locks.
VMFS = a clustered lock manager + a resource
manager + a journaling module + a data mover + a
VM IO manager + POSIX system call frantend
VMKernel Logical Volume

VMFS are by default created inside VMKernel
 logical volumes. VMKernel logical volumes can
 be spanned across multiple devices.
VMFS on disk Layout
Four Resources

  file blocks
  sub-blocks
  pointer blocks
  file descriptors

Resources are grouped together into collections called
  CLUSTERs and clusters are further grouped together
  into CLUSTER GROUPS.
Block Mapping

 Packed inside inode
 Sub block addressing
 File block addressing
 Pointer block addressing

Can upgrade automatically.
System Files

System files are created at file system format
  time, and each manages one type of
  resources.
System Files

Use file blocks.
Same read/write method as regular files.
Checking file data consistency essentially
provides metadata consistency.
Cluster Groups
Cluster groups are repeated to create a file system.
An existing VMFS volume grows over unused space
on the disk or spans new disks by laying out new
cluster groups that refer to the newly added space.
VMFS resource manager makes hosts operate on
different and distant cluster groups within a system
file. This reduces the possibility of mutiple hosts
contending on the same lock(s) and increases the
efficiency of the clustered lock manager.
On-disk Lock

A single sector data
structure.
Locking is based on lease.
Atomic disk operations (SCSI
reserve-read-modify-write-
SCSI release)
On-disk Lock Data Structure
HostID: This is a 128-bit unique identifier that identifies the ESX host that
owns the lock at a given point in time. All zeros means no owner.
Mode: A set of non-zero values to indicate whether a lock is free, held
exclusively, held by multiple hosts for shared read access, or held by
multiple hosts for shared read and write access.
Generation: A monotonically increasing counter, updates every time a lock
is acquired, released or broken. While the hostID field sufficiently
disambiguates operations on a lock from different hosts, this field
disambiguates multiple operations on a lock by the same host.
HBregion: For each valid hostID (if any) currently using the lock, a pointer
to the on disk heartbeat region of the host.
HBgen: A generation number to validate the HBregion reference as being
current or stale. It disambiguates locks held by a given host before and
after a host crash and before and after a storage outage.
On-disk Heartbeat

A single sector data structure
Every host accessing a VMSF volume acquires
a heartbeat on disk to declare liveness to
other hosts.
Allocated from a 1MB reserved region of the
volume. 2048 concurrent hosts access.
HB Failure Handling

Hosts are free to break locks if heartbeat’s
timestamp does not change for 20 second. Should
replay journal when taking stale lock.
If failing to update heartbeat timestamp in five HB
period (about 15 sec and 40 HB IO tries), host will
fence itself and abort all inflight IOs.
Lock manager tries to rejoin the cluster if IO error is
not permanent, and reclaims HB slot.
On-disk Lock & HB

Each host can join a cluster by acquiring a on-
disk HB.
It can also hold thousands of on-disk locks
Journaling

Each host maintains its own journal on the
volume.
HB region on disk stores journal location.
Transaction State Machine
Optimistic Locking

All hosts in a VMFS cluster generally operate on
mutually exclusive subsets of locks on the volume.
A host that is interested in acquiring a given lock will
typically find it to be free on disk.
In stead of acquiring all locks, host first reads all
locks, if they are free, modify in memory metadata
and then upgrade locks and commit.
Transaction State Machine w/ op lock
Transaction State Machine w/ op lock
            Upgrade Lock
1: reserve disk;
2: issue asynchronous (async) reads of all
required locks;
3: if any lock is acquired by remote host,
abort and fall back to normal TSM;
4: issue async writes of all required locks;
5: wait for all async writes to complete;
6: release disk;
Agenda

ESX Introduction
VMFS Design Goals
VMFS Architecture
SAN Influence and Impact
Conclusion
Adaptive SAN-aware retries

For some SAN errors, instead of letting guest
OS retry IO, VMkernel retries the IO after an
optimal time.
Adaptive SAN-aware retries
Data Mover

clone(srcFileHandle, srcFileOffset,
dstFileHandle, dstFileOffset, length, policies)
Data Mover
Directive SCSI CMD

operator(VMID, source_blocklist,
destination_blocklist)
Zero, clone, delete
Directive SCSI CMD

atomic_test_and_set(block_number, old_image,
new_image)
For VMFS lock manager, new lock algorithm: reads a
lock image from disk, and if the lock is free, issues
an atomic_test_and_set with a new_image
containing host specific hostID, generation and
heartbeat information.
4 IOs -> 2 IOs
Agenda

ESX Introduction
VMFS Design Goals
VMFS Architecture
SAN Influence and Impact
Conclusion
Performance

Weitere ähnliche Inhalte

Andere mochten auch

Google Megastore
Google MegastoreGoogle Megastore
Google Megastorebergwolf
 
How to use any static site generator with GitLab Pages.
How to use any static site generator with GitLab Pages. How to use any static site generator with GitLab Pages.
How to use any static site generator with GitLab Pages. Ivan Nemytchenko
 

Andere mochten auch (6)

RCU
RCURCU
RCU
 
Google Megastore
Google MegastoreGoogle Megastore
Google Megastore
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
Markdown Slides [EN]
Markdown Slides [EN]Markdown Slides [EN]
Markdown Slides [EN]
 
How to use any static site generator with GitLab Pages.
How to use any static site generator with GitLab Pages. How to use any static site generator with GitLab Pages.
How to use any static site generator with GitLab Pages.
 
Megastore by Google
Megastore by GoogleMegastore by Google
Megastore by Google
 

Ähnlich wie vmfs intro

Learning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under ContainersLearning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under Containersinside-BigData.com
 
Esxi troubleshooting
Esxi troubleshootingEsxi troubleshooting
Esxi troubleshootingOvi Chis
 
VMworld Europe 2014: Virtual SAN Architecture Deep Dive
VMworld Europe 2014: Virtual SAN Architecture Deep DiveVMworld Europe 2014: Virtual SAN Architecture Deep Dive
VMworld Europe 2014: Virtual SAN Architecture Deep DiveVMworld
 
Network Storage dan Filesystem.pdf
Network Storage dan Filesystem.pdfNetwork Storage dan Filesystem.pdf
Network Storage dan Filesystem.pdfTaseigerKu
 
Distributed file systems (from Google)
Distributed file systems (from Google)Distributed file systems (from Google)
Distributed file systems (from Google)Sri Prasanna
 
Iocg Whats New In V Sphere
Iocg Whats New In V SphereIocg Whats New In V Sphere
Iocg Whats New In V SphereAnne Achleman
 
VMware vSphere Storage Enhancements
VMware vSphere Storage EnhancementsVMware vSphere Storage Enhancements
VMware vSphere Storage EnhancementsAnne Achleman
 
VMware Performance Troubleshooting
VMware Performance TroubleshootingVMware Performance Troubleshooting
VMware Performance Troubleshootingglbsolutions
 
Xen server storage Overview
Xen server storage OverviewXen server storage Overview
Xen server storage OverviewNuno Alves
 

Ähnlich wie vmfs intro (20)

Learning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under ContainersLearning from ZFS to Scale Storage on and under Containers
Learning from ZFS to Scale Storage on and under Containers
 
Esxi troubleshooting
Esxi troubleshootingEsxi troubleshooting
Esxi troubleshooting
 
Posscon2013
Posscon2013Posscon2013
Posscon2013
 
VMworld Europe 2014: Virtual SAN Architecture Deep Dive
VMworld Europe 2014: Virtual SAN Architecture Deep DiveVMworld Europe 2014: Virtual SAN Architecture Deep Dive
VMworld Europe 2014: Virtual SAN Architecture Deep Dive
 
Storage
StorageStorage
Storage
 
Virtualization
VirtualizationVirtualization
Virtualization
 
Network Storage dan Filesystem.pdf
Network Storage dan Filesystem.pdfNetwork Storage dan Filesystem.pdf
Network Storage dan Filesystem.pdf
 
Distributed file systems (from Google)
Distributed file systems (from Google)Distributed file systems (from Google)
Distributed file systems (from Google)
 
Iocg Whats New In V Sphere
Iocg Whats New In V SphereIocg Whats New In V Sphere
Iocg Whats New In V Sphere
 
VMware vSphere Storage Enhancements
VMware vSphere Storage EnhancementsVMware vSphere Storage Enhancements
VMware vSphere Storage Enhancements
 
Installation Guide
Installation GuideInstallation Guide
Installation Guide
 
3487570
34875703487570
3487570
 
VMware Performance Troubleshooting
VMware Performance TroubleshootingVMware Performance Troubleshooting
VMware Performance Troubleshooting
 
Virtualization.ppt
Virtualization.pptVirtualization.ppt
Virtualization.ppt
 
Virtualization.ppt
Virtualization.pptVirtualization.ppt
Virtualization.ppt
 
Virtualization.ppt
Virtualization.pptVirtualization.ppt
Virtualization.ppt
 
Virtualization.ppt
Virtualization.pptVirtualization.ppt
Virtualization.ppt
 
Virtualization.ppt
Virtualization.pptVirtualization.ppt
Virtualization.ppt
 
Xen server storage Overview
Xen server storage OverviewXen server storage Overview
Xen server storage Overview
 
Tlf2014
Tlf2014Tlf2014
Tlf2014
 

Mehr von bergwolf

NFS updates for CLSF
NFS updates for CLSFNFS updates for CLSF
NFS updates for CLSFbergwolf
 
pnfs status
pnfs statuspnfs status
pnfs statusbergwolf
 
linux trim
linux trimlinux trim
linux trimbergwolf
 
network filesystem briefs
network filesystem briefsnetwork filesystem briefs
network filesystem briefsbergwolf
 
gsoc and grub4ext4
gsoc and grub4ext4gsoc and grub4ext4
gsoc and grub4ext4bergwolf
 
grub4ext4 status-plans
grub4ext4 status-plansgrub4ext4 status-plans
grub4ext4 status-plansbergwolf
 

Mehr von bergwolf (8)

NFS updates for CLSF
NFS updates for CLSFNFS updates for CLSF
NFS updates for CLSF
 
Linux aio
Linux aioLinux aio
Linux aio
 
pnfs status
pnfs statuspnfs status
pnfs status
 
linux trim
linux trimlinux trim
linux trim
 
network filesystem briefs
network filesystem briefsnetwork filesystem briefs
network filesystem briefs
 
logfs
logfslogfs
logfs
 
gsoc and grub4ext4
gsoc and grub4ext4gsoc and grub4ext4
gsoc and grub4ext4
 
grub4ext4 status-plans
grub4ext4 status-plansgrub4ext4 status-plans
grub4ext4 status-plans
 

Kürzlich hochgeladen

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

vmfs intro

  • 2. Agenda ESX Introduction VMFS Design Goals VMFS Architecture SAN Impact Conclusion
  • 4. Guest Memory Layers Shadow page tables (VA- MA). Page sharing (BA-MA).
  • 5. ESX IO Stack Average IO requests just involves offset remapping.
  • 6. Agenda ESX Introduction VMFS Design Goals VMFS Architecture SAN Influence and Impact Conclusion
  • 7. Use Case Small number of files (30~100 per VM) Files either very small (~a few KBs), or very large (many GBs) SAN storage is the underlying substrate. All storage exported by these storage systems is shared among all ESX servers
  • 8. Design Goals Metadata overhead should be very low VM IO throughput and latency should be as good as directly attached raw device A clustered lock manager for moderating access to files among ESX servers Help VM deterministically react to transient and non-transient SAN events and error conditions.
  • 9. Agenda ESX Introduction VMFS Design Goals VMFS Architecture SAN Influence and Impact Conclusion
  • 10. VMFS Architecture A volume is an aggregation of resources and on-disk locks. A resource is either an inode, a file block, a sub- block or an indirect block. Each lock moderates access to a subset of resources. Hosts negotiate access to resource by acquiring relevant locks. VMFS = a clustered lock manager + a resource manager + a journaling module + a data mover + a VM IO manager + POSIX system call frantend
  • 11. VMKernel Logical Volume VMFS are by default created inside VMKernel logical volumes. VMKernel logical volumes can be spanned across multiple devices.
  • 12. VMFS on disk Layout
  • 13. Four Resources file blocks sub-blocks pointer blocks file descriptors Resources are grouped together into collections called CLUSTERs and clusters are further grouped together into CLUSTER GROUPS.
  • 14. Block Mapping Packed inside inode Sub block addressing File block addressing Pointer block addressing Can upgrade automatically.
  • 15. System Files System files are created at file system format time, and each manages one type of resources.
  • 16. System Files Use file blocks. Same read/write method as regular files. Checking file data consistency essentially provides metadata consistency.
  • 17. Cluster Groups Cluster groups are repeated to create a file system. An existing VMFS volume grows over unused space on the disk or spans new disks by laying out new cluster groups that refer to the newly added space. VMFS resource manager makes hosts operate on different and distant cluster groups within a system file. This reduces the possibility of mutiple hosts contending on the same lock(s) and increases the efficiency of the clustered lock manager.
  • 18. On-disk Lock A single sector data structure. Locking is based on lease. Atomic disk operations (SCSI reserve-read-modify-write- SCSI release)
  • 19. On-disk Lock Data Structure HostID: This is a 128-bit unique identifier that identifies the ESX host that owns the lock at a given point in time. All zeros means no owner. Mode: A set of non-zero values to indicate whether a lock is free, held exclusively, held by multiple hosts for shared read access, or held by multiple hosts for shared read and write access. Generation: A monotonically increasing counter, updates every time a lock is acquired, released or broken. While the hostID field sufficiently disambiguates operations on a lock from different hosts, this field disambiguates multiple operations on a lock by the same host. HBregion: For each valid hostID (if any) currently using the lock, a pointer to the on disk heartbeat region of the host. HBgen: A generation number to validate the HBregion reference as being current or stale. It disambiguates locks held by a given host before and after a host crash and before and after a storage outage.
  • 20. On-disk Heartbeat A single sector data structure Every host accessing a VMSF volume acquires a heartbeat on disk to declare liveness to other hosts. Allocated from a 1MB reserved region of the volume. 2048 concurrent hosts access.
  • 21. HB Failure Handling Hosts are free to break locks if heartbeat’s timestamp does not change for 20 second. Should replay journal when taking stale lock. If failing to update heartbeat timestamp in five HB period (about 15 sec and 40 HB IO tries), host will fence itself and abort all inflight IOs. Lock manager tries to rejoin the cluster if IO error is not permanent, and reclaims HB slot.
  • 22. On-disk Lock & HB Each host can join a cluster by acquiring a on- disk HB. It can also hold thousands of on-disk locks
  • 23. Journaling Each host maintains its own journal on the volume. HB region on disk stores journal location.
  • 25. Optimistic Locking All hosts in a VMFS cluster generally operate on mutually exclusive subsets of locks on the volume. A host that is interested in acquiring a given lock will typically find it to be free on disk. In stead of acquiring all locks, host first reads all locks, if they are free, modify in memory metadata and then upgrade locks and commit.
  • 27. Transaction State Machine w/ op lock Upgrade Lock 1: reserve disk; 2: issue asynchronous (async) reads of all required locks; 3: if any lock is acquired by remote host, abort and fall back to normal TSM; 4: issue async writes of all required locks; 5: wait for all async writes to complete; 6: release disk;
  • 28. Agenda ESX Introduction VMFS Design Goals VMFS Architecture SAN Influence and Impact Conclusion
  • 29. Adaptive SAN-aware retries For some SAN errors, instead of letting guest OS retry IO, VMkernel retries the IO after an optimal time.
  • 33. Directive SCSI CMD operator(VMID, source_blocklist, destination_blocklist) Zero, clone, delete
  • 34. Directive SCSI CMD atomic_test_and_set(block_number, old_image, new_image) For VMFS lock manager, new lock algorithm: reads a lock image from disk, and if the lock is free, issues an atomic_test_and_set with a new_image containing host specific hostID, generation and heartbeat information. 4 IOs -> 2 IOs
  • 35. Agenda ESX Introduction VMFS Design Goals VMFS Architecture SAN Influence and Impact Conclusion