VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster based infrastructure

Operating and Architecting a vSphere Metro Storage
Cluster based infrastructure
Lee Dilworth, VMware
Duncan Epping, VMware
BCO4872
#BCO4872

2
Interact!
 If you use Twitter, feel free to tweet about this session and use
hashtag #BCO4872
 Feel free to take pictures, shoot video, and share it on twitter /
facebook
 Blog about it
• We would love to read your thoughts, your opinion, design decisions!

3
Agenda for Today
 Availability Basics
 vSphere Metro Storage Cluster Basics
 Architecting and Operating
 Failure Scenarios
 Wrapping up

5
Disaster Avoidance
 Avoidance NOT Recovery
• Two sites, One vSphere Cluster
• One vCenter manages BOTH sites
• One site effectively put into maintenance mode
• Hot VM Mobility solution
Intra-cluster vMotion

6
Disaster Recovery
Replication
 Recovery NOT avoidance
• Two sites, typically two vSphere Clusters
• Each sites usually managed by own vCenter
• vMSC solutions CAN support disaster recovery via HA restarts
• Cold VM Mobility Solutions (SRM or vMSC “Federated HA”)

7
vSphere High Availability – Setting the Baseline
 vSphere HA minimizes unplanned downtime
 Provides automatic VM recovery in minutes
 Protects against various types of failures
• Host failure
• Host network isolation
• Permanent loss of datastore
• VM crashes (including VMX)
• Guest OS / Application crashes / hangs
 Does not require complex configuration changes
 Is Operating System and application-independent

8
vSphere 5.0+ Architecture
 HA Agent
• Called the Fault Domain Manager (FDM)
• Provides all the HA on-host functionality
 Operation
• vCenter Server manages the cluster
• Failover is not dependent on vCenter
 Communicate over
• Management Network
• Datastores
vCenter Server

9
Master and Slave Roles
 Any host can be master, selected by
election
• All others assume the role of slaves
 The Master
• Monitors hosts and VMs
• Manages VM restarts after failures
• Reports cluster state to vCenter Server
 The Slave
• Forwards critical state changes to the
Master
• Restart VMs when directed by the Master
• Elects new Master
vCenter Server

10
Network Used for Communication
 Network is default communication method
• Used for selecting a Master
• Used for heartbeating
• Used for reporting state to vCenter Server
 Network Heartbeating
• Used by a Master to monitor the state of a Slave
• When Master receives no heartbeats it will ping the Slave
• When Slave receives no heartbeats from Master it will ping isolation
address

11
Datastores Used for Communication
 Datastores are used when management network is
not available
• It is used to determine state (isolated vs failed)
• Only when a failure has occurred!
• vCenter selects two for each host
 Files used on datastores
• host-<id>-hb
• Heartbeat file!
• host-<id>-poweron
• Contains power state of VMs and used to communicate
isolation
• First line, either a “0” or a “1” where “1” means isolated
• protectedlist
• Owned by the master, its view of the world

12
vSphere Metro Storage Cluster
the Basics (well sort of)

13
What is a vSphere Metro Storage Cluster
 Stretched cluster solution, not a feature!
 Requires:
• storage system that “stretches” across sites
• stretched network across sites
 Hardware Compatibility List (HCL) – Certified vMSC
• “iSCSI Metro Cluster Storage”
• “FC Metro Cluster Storage”
• “NFS Metro Cluster Storage”

14
vSphere Metro Storage Cluster – Growing Ecosystem

15
vMSC Certified Storage
Typical vSphere vMSC Setup
vCenter
Stretched Network
vSphere HA Cluster
Network
Storage

16
Latency Support Requirements
 ESXi management network max supported latency 10 milliseconds
Round Trip Time (RTT)
• Note: 10ms supported with Enterprise+ licenses only (Metro vMotion), default
is 5ms
 Synchronous storage replication link is 5 milliseconds RTT
• Note: some storage vendors have different support requirements!
network

17
When to Use Stretched vSphere Clusters?
 Campus / nearby sites
• Sites within Synchronous distance
• Two buildings on a common campus
• Two datacenters within a city
 Planned migration important
• Long-distance vMotion for planned maintenance, disaster avoidance, or load
balancing
 DR Features less critical
• No testing, orchestration, or automation
• VMware HA typically not sufficient for automation – requires scripting / manual
process due to VM placement with primary / secondary arrays
• RTOs typically longer

18
Two Architectures: Uniform Host Access Configuration
(1/2)
Stretched Cluster
Storage A
LUN (R/W)
Storage B
LUN (R/O)
FC / IP
fabricfabric
Site A Site B

19
Two architectures: Non-Uniform Host Access Configuration
(2/2)
Stretched Cluster
Storage A
LUN (R/W)
Storage B
LUN (R/W)
fabric fabric
FC / IP
distributed
Site A Site B

20
Defining Some Failure Terminology
 All Paths Down (APD) – Aaahhhh where has that device gone?
• Incorrect storage removal i.e. yanked!
• Sudden storage failure
• No time for storage to tell us anything
 Permanent Device Loss (PDL) – Aaahhhh the device has gone, OK I
understand
• Much nicer than APD, graceful handing of state change
• Storage notifies of device state change via SCSI sense code
• Allows HA to failover VM’s
 Split Brain – Hmmm the other half has disappeared, now what?
• Election of second HA master
• Check heartbeat datastore region
• Restart VM’s (if needed)

21
Architecting and Operating
vSphere Metro Storage Cluster

22
Will Use Our Environment to Illustrate…
 Two sites
 Four hosts in total
 Stretched network
 Stretched storage
 One vCenter Server
 One vSphere HA
Cluster
fabricfabric
management
Site A Site B
Storage A
LUN (R/W)
Storage B
LUN (R/W)
FC / IP
distributed

23
HA & DRS – Site Awareness
DRS
HA
network
 What they think…..
 What you’ve actually got…..
DRS
HA
?
?

24
Why Should I Care About Site Awareness?
 Operational Simplicity
• Group dependent workloads
• Increase HA predictability
• Reduce impact of full cluster partition
• Orchestrate allocation of workloads
to “sites”
• Even distribution & consumption of
cluster resources
 Alignment with Storage
• Locate VM’s above read/write device
• Remove unnecessary east/west IO
traffic
• Access anywhere devices, align with
partition winner per device

25
DRS Design Considerations – Affinity Rules (1/2)
DRS Host Group Per Site
DRS VM Group Per Site
Align Dependent VM Workloads

26
DRS Design Considerations – Affinity Rules (2/2)
 Use the “should” rules
• HA does not violate “must” therefore avoid for these configurations

27
Storage DRS Design Considerations
 Cluster datastores based on
“site affinity”
 Avoid unnecessary site-to-site
migrations
 Set Storage DRS to “Manual”,
take control, migration *could*
impact availability
 Align VM’s with storage / site
boundary
 Group *similar* devices!

28
Network Design Considerations
 Network teams usually don’t like the words “Stretch” and “Cluster”
 Site-to-Site vMotion – handle carefully
 Ingress point to the network? Load balanced / redundant?
 Consider application users – site affinity affects data flow to!
 Network options are changing (OTV, EoMPLS)
 L3 Routing impacts (and options LISP?)
 Co-locate Multi-VM applications
 Consider east-west traffic
network

29
HA Design Considerations – Admission Control
 What about Admission Control?
• We typically recommend setting it to 50%, to allow full site fail-over
• Admission control is not a resource management tool
• Only guarantees power-on

30
HA Design Considerations – Isolation Response
 Isolation response
• Configure it based on your infrastructure!
• We cannot make this decision for you, however…

31
HA Design Considerations – Isolation Addresses
 Isolation addresses
• Specify two, one at each site, using the advanced setting
“das.isolationaddress”
• Note that “default gateway” is an isolation address already!
isolation
address 02
isolation
address 01

32
HA Design Considerations – Heartbeat Datastores
 Each site needs a heartbeat datastore defined to ensure each
site can update heartbeat region for storage local to that site
 With multiple storage systems consider increasing default from
2 to 4 => 2 per site

33
HA Design Consideration – Restart Order
 You can use “restart priority” to determine restart order
 This applies even when there is no contention
 Only about order in restarts occur, not about when VM is booted

34
Operations - Maintaining the Configuration
 Storage Device <-> DRS Affinity Group
Mappings
 Validate DRS Affinity regularly
 Are there VM dependencies? Co-locate!
 Remember HA doesn’t speak vApp
(wont’ respect restart order)
 …automate if you can!
 Some vendors offer tools
DRS
HA

36
Face Your Fears!
 Understand the possibilities
 Test them
 Test them again and keeping going until they feel normal!
vm mobility
P
A
R
T
I
T
I
O
N

37
Scenario - Single Host Failure (Non-Uniform)
Storage A
LUN (R/W)
Storage B
LUN (R/W)
FC / IP
fabricfabric
management
 A normal HA event
 No network or
datastore heartbeats
 Host will be declared
dead
 All VMs will be
restarted
 Could violate affinity
rules
X
Site A Site B
distributed

38
Scenario - Full Compute Failure in One Site (Non-Uniform)
Storage A
LUN (R/W)
Storage B
LUN (R/W)
FC / IP
fabricfabric
management
 Normal HA event
 No datastore or
network heartbeats
 All virtual machines
will be restarted
 Note, max 32
concurrent restarts
per host
 “Sequencing” start
up order!
 Will violate affinity
rules! (should rule)
X X
Site A Site B
distributed

39
Scenario - Storage Partition (Uniform)
Stretched Cluster
Storage A
LUN (R/W)
Storage B
LUN (R/O)
FC / IP
fabricfabric
management
 Virtual machines
remained running
with no impact!
 Will virtual machines
be restarted on the
other site?
• No  Network
heartbeats!
X
Site A Site B

40
Scenario - Storage Partition (Non-uniform)
Stretched Cluster
Storage A
LUN (R/W)
Storage B
LUN (R/W)
FC / IP
fabricfabric
management
remained running
with no impact!
 Will virtual machines
be restarted on the
other site?
• Yes  PDL Sense
code issued.
• VM will be killed
• HA will detect and
restart!X
PDL
Site A Site B
preferred

41
Permanent Device Loss (PDL) Requirements (1/2)
 Ensure PDL enhancements are configured
• Cluster Advanced Option
• Set “Das.maskCleanShutdownEnabled” to “true”, in advanced settings
• Set to “false” by default in 5.0, change it!
• Set to “true” by default in 5.1 and up

42
Permanent Device Loss (PDL) Requirements (2/2)
 Ensure PDL enhancements are configured
• ESXi Host Level changes
• 5.1 and earlier: Set “disk.terminateVMonPDLDefault” to “true” in
“/etc/vmware/settings”
• 5.5 and up: Set advanced setting “VMkernel.Boot.terminateVMOnPDL”

43
Scenario - Datacenter Partition (Uniform) (1/3)
Stretched Cluster
Storage A
LUN (R/W)
Storage B
LUN (R/O)
FC / IP
fabricfabric
management
remained running
with no impact!
 Remember the
affinity rules
 Without affinity rules
this would result in
APD condition…
X
X
X
Site A Site B

44
Scenario - Datacenter Partition (Uniform) (2/3)
Stretched Cluster
Storage A
LUN (R/W)
Storage B
LUN (R/O)
FC / IP
fabricfabric
management
 Affinity rule was
violated
 Same VM restarted in
Site A
 Results in APD for
Site B
 Same VM
 Same IP address
 Same name
 Yes, could result in
weird behavior!
X
X
X
Site A Site B

45
Scenario - Datacenter Partition (Uniform) (3/3
• VM restarted in site with “storage site-affinity”
• Now you have two active instances of same VM!
• When partition is lifted, VM will be killed!

46
Scenario - Loss of full datacenter (Non-Uniform)
Stretched Cluster
Storage A
LUN (R/W)
Storage B
LUN (R/W)
FC / IP
fabricfabric
management
 All virtual machines
will be restarted
 Note in many cases
requires manual
intervention from a
storage perspective!
 HA will retry 5 times
and has a
compatibility list
 Run DRS when site
returns, to apply
affinity rules and
balance load!
Site A Site B
distributed

48
Key Takeaways
 Design a cluster that meets your needs don’t forget operations!
 Understand HA / DRS play key part in your vMSC success
 Testing is critical, don’t just test the easy stuff!
 Document process changes, gain operational acceptance
 Do not assume it is “Next > Next > Finish”
 Ongoing maintenance/checks will be required
 Automate as much as you can!

50
Other VMware Activities Related to This Session
 Group Discussions:
BCO1001-GD
Stretched Clusters for Availability with Lee Dilworth

VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster based infrastructure

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster based infrastructure

Similar to VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster based infrastructure (20)

More from VMworld

More from VMworld (20)

Recently uploaded

Recently uploaded (20)

VMworld 2013: Operating and Architecting a vSphere Metro Storage Cluster based infrastructure