This document discusses disaster recovery strategies and approaches. It begins by defining what constitutes a disaster and differentiating between disaster recovery and operational recovery. It then examines common disaster risks and threats. The document outlines a disaster recovery approach that includes business impact analysis, identifying risks and gaps, establishing recovery strategies and objectives, implementing capabilities, and developing documentation and test procedures. Key measures for disaster recovery like recovery point objectives and recovery time objectives are explained. Various disaster recovery strategies are presented based on priority tiers. Finally, the document discusses technologies that can be used to enable disaster recovery capabilities like replication, remote sites, and recovery management tools.
Disaster Recovery Strategies for Business Continuity
1. Disaster Recovery
Business & Technology
Varrow Madness
March 20, 2014
Andrew Miller
Managing Systems Architect
vExpert, VCP 3/4/5, EMC Unified/Symmetrix TA
t: @andriven w: www.thinkmeta.net
2. • If tweeting, include #VM14 hashtag.
• Feel free to send me commentary at @andriven
• Hours of stuff packed in hour so…
• No shame about content source.
Housekeeping
3. 1. One Big Reason
2. Business Discussion
3. Technology Overview
• Who is this guy?
Agenda
4. One Big Reason to Do This
Expectations for Disaster
Recovery
IT Capabilities
for Disaster Recovery
≠
5. What is a Disaster?
• Disaster: An event that affects a service or system such
that significant effort is required to restore the original
performance level.
» IT Service Management Forum
But what does that look like IN
OUR ENVIRONMENT?
What disaster and recovery
scenarios should we plan for?
Where do we begin?
How do we do it?
7. Disaster Recovery vs. Operational Recovery
• Disaster Recovery
– To cope with & recover from an IT crisis that moves work to an
alternative system in a non-routine way.
– A real “disaster” is large in scope and impact
– DR typically implies failure of the primary data center and recovery to an
alternate site
• Operational Recovery
– Addresses more “routine” types of failures (server, network, storage,
etc.)
– Events are smaller in scope and impact than a full “disaster”
– Typically implies recovering to alternate equipment within the primary
data center
• Business expectations for recovery timeframe is typically
shorter for “operational recovery” issues than a true “disaster”
• Each should have its own clearly defined objectives
8. Risks, Threats and Vulnerabilities
Risk is a function of the likelihood of a given threat
acting upon a particular potential vulnerability,
and the resulting impact of that adverse event on
the organization.
9. Some threats that can cause Disasters…
• Human Error
• Localized IT systems /
network failure
• Extended power outage
• Telecommunications outage
• Storm / Weather damage
• Earthquake / Volcano
• Fire in the facility
• Facility flooding
• Local evacuation
• Cyber attack
• Sabotage
10. (Varrow) Disaster Recovery Approach
• Interviews with key personnel to understand Business Process priorities
and establish Business Impact Analysis (BIA).
• Review existing IT production infrastructure, including applications,
servers, storage, network, and external connectivity. Identify Risks and
Gaps.
• Establish Disaster Impact Scenarios and Disaster Recovery strategies to
meet requirements.
• Recommend Roadmap for establishing recovery capabilities and
documenting plans.
• Implement required recovery capabilities.
• Develop framework and content for IT DR Plan.
• Develop maintenance and test procedures for IT DR Plan.
• Address Business Continuity requirements and planning as appropriate.
11. What is the Business Impact Analysis?
• A conversation between IT and key stakeholders to
understand:
– What are the most time-critical and information-critical
business processes?
– How does the business REALLY rely upon IT Service and
Application availability?
– What are the
Student, Financial, Regulatory, Reputational, and other
impacts of IT Service and Application unavailability?
– What availability or recoverability capabilities are justifiable
based on these requirements, potential impact, and costs?
12. DECLARE
DISASTER
10 a.m.
Recovery Point Objectives
(RPO)
Recovery Time Objectives
(RTO)
RPO: Amount of data lost from
failure, measured as the amount
of time from a disaster event
RTO: Targeted amount of time
to restart a business service
after a disaster event
5
a.m.
6
a.m.
7
a.m.
8
a.m.
9
a.m.
10
a.m.
11
a.m.
12
a.m.
1
p.m.
2
p.m.
3
p.m.
4
p.m.
5
p.m.
6
p.m.
7
p.m.
Disaster Recovery: Key Measures
13. Cost
Disaster Recovery: Key Measures
Weeks Days Hours Minutes Seconds WeeksDaysHoursMinutesSeconds
Recovery Point Recovery Time
Real Time
14. BIA - Example Priority Tiers
Priority Tier Description
Priority 1
High Availability /
Immediate Recovery
Services whose unavailability more than a brief period can have a severe impact on
customers or time-critical business operations.
Priority 2
1-2 day recovery
Services whose unavailability significantly impacts customers or business
operations.
Priority 3
3-5 day recovery
Services which can tolerate up to five days of disruption in a disaster.
Priority 4
6-10 day recovery
Services which can tolerate up to ten days of disruption in a disaster.
Priority 3 and 4 systems may be restored in less time, depending on the situation.
However, higher priority functions will be restored first.
Priority 5
“Best effort” recovery
Non-critical services which can tolerate two weeks or more of disruption in a
disaster. These systems will be restored on a best-effort basis, after other more
critical systems have been restored and ongoing operations have resumed.
Priority 5 systems may be restored in less time, depending on the situation.
However, higher priority functions will be restored first. In some cases, systems
deemed to not be required for continued operations may not be restored.
15. What does it take to RECOVER
from an IT Disaster?
• Data Protection
– Backups, Replication
• Recovery Facility
– Location to rebuild IT infrastructure or provision services
• Data Recovery & Storage
– Get Data into a form that is usable
• Servers / Compute Capacity
– Sufficient servers or virtual compute capacity to actually run the applications
• Network, Voice, and Data Communications
– Connect servers, storage and workers
– Connect the recovery site to work sites
– Communicate with customers
– Includes network, telecom, demarcation equipment; cabling; telecom provisioning
• DR Plan
– Documented and tested procedures for what to do, and how to do it
• People
17. Example Disaster Recovery Strategies
Priority Disaster Recovery Strategy Data Protection Approach
Priority 1
4 hour RTO or
less
Establish hot site for systems and data in a
secondary data center at a remote
location that is unlikely to be impacted
by a local or regional event.
Replicate / remote mirror / short
interval remote disk-to-disk
backup
Priority 2
24-48 hour RTO
Maintain sufficient remote physical or virtual
infrastructure for restoration. Ensure
sufficient space/power in recovery
facility.
Remote disk-to-disk backup
Priority 3
72 hour RTO
Ensure ability to quickly acquire
infrastructure for restoration. Ensure
sufficient space/power in recovery
facility.
Tape (with sufficient off-site rotation)
or remote disk-to-disk backup
Priority 4
1-2 week RTO
Ensure ability to quickly acquire
infrastructure for restoration. Ensure
sufficient space/power in recovery
facility.
Tape (with sufficient off-site rotation)
or remote disk-to-disk backup
18. SAN
OPTIONAL DISASTER RECOVERY SITEPRODUCTION SITE
Prod
LUN
s
Fibre
Channel/WAN
Local
copy
Application
servers
SAN
RecoverPoint
appliance
RecoverPoint bi-directional
replication/recovery
Remote
copy
Standby
servers
RecoverPoint
appliance
Production and
local journals
Remote
journal
Storage
arrays
Storage
arraysHost-based write splitter
Fabric-based write splitter
Symmetrix VMAXe, VNX-, and
CLARiiON-based write splitter
Storage Arrays + Replication
19. vSphere Replication
Simple, cost-efficient replication for Tier 2 applications and smaller sites
Storage-based Replication
High-performance replication for business-critical applications in larger sites
vCenter Server
Site
Recovery
Manager
vSphere
vCenter Server
Site
Recovery
Manager
vSphere
vSphere
Replication
Storage-based
replication
Site A (Primary) Site B (Recovery)
20. 1. One Big Reason – Expectation Alignment
2. Business DR Perspectives
3. Technology Underneath
Summary