VMworld 2013
Lee Dilworth, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Duncan Epping, VMware
VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster based infrastructure
1. Operating and Architecting a vSphere Metro Storage
Cluster based infrastructure
Lee Dilworth, VMware
Duncan Epping, VMware
BCO4872
#BCO4872
2. 2
Interact!
If you use Twitter, feel free to tweet about this session and use
hashtag #BCO4872
Feel free to take pictures, shoot video, and share it on twitter /
facebook
Blog about it
• We would love to read your thoughts, your opinion, design decisions!
3. 3
Agenda for Today
Availability Basics
vSphere Metro Storage Cluster Basics
Architecting and Operating
Failure Scenarios
Wrapping up
5. 5
Disaster Avoidance
Avoidance NOT Recovery
• Two sites, One vSphere Cluster
• One vCenter manages BOTH sites
• One site effectively put into maintenance mode
• Hot VM Mobility solution
Intra-cluster vMotion
6. 6
Disaster Recovery
Replication
Recovery NOT avoidance
• Two sites, typically two vSphere Clusters
• Each sites usually managed by own vCenter
• vMSC solutions CAN support disaster recovery via HA restarts
• Cold VM Mobility Solutions (SRM or vMSC “Federated HA”)
7. 7
vSphere High Availability – Setting the Baseline
vSphere HA minimizes unplanned downtime
Provides automatic VM recovery in minutes
Protects against various types of failures
• Host failure
• Host network isolation
• Permanent loss of datastore
• VM crashes (including VMX)
• Guest OS / Application crashes / hangs
Does not require complex configuration changes
Is Operating System and application-independent
8. 8
vSphere 5.0+ Architecture
HA Agent
• Called the Fault Domain Manager (FDM)
• Provides all the HA on-host functionality
Operation
• vCenter Server manages the cluster
• Failover is not dependent on vCenter
Communicate over
• Management Network
• Datastores
vCenter Server
9. 9
Master and Slave Roles
Any host can be master, selected by
election
• All others assume the role of slaves
The Master
• Monitors hosts and VMs
• Manages VM restarts after failures
• Reports cluster state to vCenter Server
The Slave
• Forwards critical state changes to the
Master
• Restart VMs when directed by the Master
• Elects new Master
vCenter Server
10. 10
Network Used for Communication
Network is default communication method
• Used for selecting a Master
• Used for heartbeating
• Used for reporting state to vCenter Server
Network Heartbeating
• Used by a Master to monitor the state of a Slave
• When Master receives no heartbeats it will ping the Slave
• When Slave receives no heartbeats from Master it will ping isolation
address
11. 11
Datastores Used for Communication
Datastores are used when management network is
not available
• It is used to determine state (isolated vs failed)
• Only when a failure has occurred!
• vCenter selects two for each host
Files used on datastores
• host-<id>-hb
• Heartbeat file!
• host-<id>-poweron
• Contains power state of VMs and used to communicate
isolation
• First line, either a “0” or a “1” where “1” means isolated
• protectedlist
• Owned by the master, its view of the world
13. 13
What is a vSphere Metro Storage Cluster
Stretched cluster solution, not a feature!
Requires:
• storage system that “stretches” across sites
• stretched network across sites
Hardware Compatibility List (HCL) – Certified vMSC
• “iSCSI Metro Cluster Storage”
• “FC Metro Cluster Storage”
• “NFS Metro Cluster Storage”
16. 16
Latency Support Requirements
ESXi management network max supported latency 10 milliseconds
Round Trip Time (RTT)
• Note: 10ms supported with Enterprise+ licenses only (Metro vMotion), default
is 5ms
Synchronous storage replication link is 5 milliseconds RTT
• Note: some storage vendors have different support requirements!
network
17. 17
When to Use Stretched vSphere Clusters?
Campus / nearby sites
• Sites within Synchronous distance
• Two buildings on a common campus
• Two datacenters within a city
Planned migration important
• Long-distance vMotion for planned maintenance, disaster avoidance, or load
balancing
DR Features less critical
• No testing, orchestration, or automation
• VMware HA typically not sufficient for automation – requires scripting / manual
process due to VM placement with primary / secondary arrays
• RTOs typically longer
18. 18
Two Architectures: Uniform Host Access Configuration
(1/2)
Stretched Cluster
Storage A
LUN (R/W)
Storage B
LUN (R/O)
FC / IP
fabricfabric
Site A Site B
19. 19
Two architectures: Non-Uniform Host Access Configuration
(2/2)
Stretched Cluster
Storage A
LUN (R/W)
Storage B
LUN (R/W)
fabric fabric
FC / IP
distributed
Site A Site B
20. 20
Defining Some Failure Terminology
All Paths Down (APD) – Aaahhhh where has that device gone?
• Incorrect storage removal i.e. yanked!
• Sudden storage failure
• No time for storage to tell us anything
Permanent Device Loss (PDL) – Aaahhhh the device has gone, OK I
understand
• Much nicer than APD, graceful handing of state change
• Storage notifies of device state change via SCSI sense code
• Allows HA to failover VM’s
Split Brain – Hmmm the other half has disappeared, now what?
• Election of second HA master
• Check heartbeat datastore region
• Restart VM’s (if needed)
22. 22
Will Use Our Environment to Illustrate…
Two sites
Four hosts in total
Stretched network
Stretched storage
One vCenter Server
One vSphere HA
Cluster
fabricfabric
management
Site A Site B
Storage A
LUN (R/W)
Storage B
LUN (R/W)
FC / IP
distributed
23. 23
HA & DRS – Site Awareness
DRS
HA
network
What they think…..
What you’ve actually got…..
DRS
HA
?
?
24. 24
Why Should I Care About Site Awareness?
Operational Simplicity
• Group dependent workloads
• Increase HA predictability
• Reduce impact of full cluster partition
• Orchestrate allocation of workloads
to “sites”
• Even distribution & consumption of
cluster resources
Alignment with Storage
• Locate VM’s above read/write device
• Remove unnecessary east/west IO
traffic
• Access anywhere devices, align with
partition winner per device
25. 25
DRS Design Considerations – Affinity Rules (1/2)
DRS Host Group Per Site
DRS VM Group Per Site
Align Dependent VM Workloads
26. 26
DRS Design Considerations – Affinity Rules (2/2)
Use the “should” rules
• HA does not violate “must” therefore avoid for these configurations
27. 27
Storage DRS Design Considerations
Cluster datastores based on
“site affinity”
Avoid unnecessary site-to-site
migrations
Set Storage DRS to “Manual”,
take control, migration *could*
impact availability
Align VM’s with storage / site
boundary
Group *similar* devices!
28. 28
Network Design Considerations
Network teams usually don’t like the words “Stretch” and “Cluster”
Site-to-Site vMotion – handle carefully
Ingress point to the network? Load balanced / redundant?
Consider application users – site affinity affects data flow to!
Network options are changing (OTV, EoMPLS)
L3 Routing impacts (and options LISP?)
Co-locate Multi-VM applications
Consider east-west traffic
network
29. 29
HA Design Considerations – Admission Control
What about Admission Control?
• We typically recommend setting it to 50%, to allow full site fail-over
• Admission control is not a resource management tool
• Only guarantees power-on
30. 30
HA Design Considerations – Isolation Response
Isolation response
• Configure it based on your infrastructure!
• We cannot make this decision for you, however…
31. 31
HA Design Considerations – Isolation Addresses
Isolation addresses
• Specify two, one at each site, using the advanced setting
“das.isolationaddress”
• Note that “default gateway” is an isolation address already!
isolation
address 02
isolation
address 01
32. 32
HA Design Considerations – Heartbeat Datastores
Each site needs a heartbeat datastore defined to ensure each
site can update heartbeat region for storage local to that site
With multiple storage systems consider increasing default from
2 to 4 => 2 per site
33. 33
HA Design Consideration – Restart Order
You can use “restart priority” to determine restart order
This applies even when there is no contention
Only about order in restarts occur, not about when VM is booted
34. 34
Operations - Maintaining the Configuration
Storage Device <-> DRS Affinity Group
Mappings
Validate DRS Affinity regularly
Are there VM dependencies? Co-locate!
Remember HA doesn’t speak vApp
(wont’ respect restart order)
…automate if you can!
Some vendors offer tools
DRS
HA
36. 36
Face Your Fears!
Understand the possibilities
Test them
Test them again and keeping going until they feel normal!
vm mobility
P
A
R
T
I
T
I
O
N
37. 37
Scenario - Single Host Failure (Non-Uniform)
Storage A
LUN (R/W)
Storage B
LUN (R/W)
FC / IP
fabricfabric
management
A normal HA event
No network or
datastore heartbeats
Host will be declared
dead
All VMs will be
restarted
Could violate affinity
rules
X
Site A Site B
distributed
38. 38
Scenario - Full Compute Failure in One Site (Non-Uniform)
Storage A
LUN (R/W)
Storage B
LUN (R/W)
FC / IP
fabricfabric
management
Normal HA event
No datastore or
network heartbeats
All virtual machines
will be restarted
Note, max 32
concurrent restarts
per host
“Sequencing” start
up order!
Will violate affinity
rules! (should rule)
X X
Site A Site B
distributed
39. 39
Scenario - Storage Partition (Uniform)
Stretched Cluster
Storage A
LUN (R/W)
Storage B
LUN (R/O)
FC / IP
fabricfabric
management
Virtual machines
remained running
with no impact!
Will virtual machines
be restarted on the
other site?
• No Network
heartbeats!
X
Site A Site B
40. 40
Scenario - Storage Partition (Non-uniform)
Stretched Cluster
Storage A
LUN (R/W)
Storage B
LUN (R/W)
FC / IP
fabricfabric
management
Virtual machines
remained running
with no impact!
Will virtual machines
be restarted on the
other site?
• Yes PDL Sense
code issued.
• VM will be killed
• HA will detect and
restart!X
PDL
Site A Site B
preferred
41. 41
Permanent Device Loss (PDL) Requirements (1/2)
Ensure PDL enhancements are configured
• Cluster Advanced Option
• Set “Das.maskCleanShutdownEnabled” to “true”, in advanced settings
• Set to “false” by default in 5.0, change it!
• Set to “true” by default in 5.1 and up
42. 42
Permanent Device Loss (PDL) Requirements (2/2)
Ensure PDL enhancements are configured
• ESXi Host Level changes
• 5.1 and earlier: Set “disk.terminateVMonPDLDefault” to “true” in
“/etc/vmware/settings”
• 5.5 and up: Set advanced setting “VMkernel.Boot.terminateVMOnPDL”
43. 43
Scenario - Datacenter Partition (Uniform) (1/3)
Stretched Cluster
Storage A
LUN (R/W)
Storage B
LUN (R/O)
FC / IP
fabricfabric
management
Virtual machines
remained running
with no impact!
Remember the
affinity rules
Without affinity rules
this would result in
APD condition…
X
X
X
Site A Site B
44. 44
Scenario - Datacenter Partition (Uniform) (2/3)
Stretched Cluster
Storage A
LUN (R/W)
Storage B
LUN (R/O)
FC / IP
fabricfabric
management
Affinity rule was
violated
Same VM restarted in
Site A
Results in APD for
Site B
Same VM
Same IP address
Same name
Yes, could result in
weird behavior!
X
X
X
Site A Site B
45. 45
Scenario - Datacenter Partition (Uniform) (3/3
• VM restarted in site with “storage site-affinity”
• Now you have two active instances of same VM!
• When partition is lifted, VM will be killed!
46. 46
Scenario - Loss of full datacenter (Non-Uniform)
Stretched Cluster
Storage A
LUN (R/W)
Storage B
LUN (R/W)
FC / IP
fabricfabric
management
All virtual machines
will be restarted
Note in many cases
requires manual
intervention from a
storage perspective!
HA will retry 5 times
and has a
compatibility list
Run DRS when site
returns, to apply
affinity rules and
balance load!
Site A Site B
distributed
48. 48
Key Takeaways
Design a cluster that meets your needs don’t forget operations!
Understand HA / DRS play key part in your vMSC success
Testing is critical, don’t just test the easy stuff!
Document process changes, gain operational acceptance
Do not assume it is “Next > Next > Finish”
Ongoing maintenance/checks will be required
Automate as much as you can!