3. Project Initiation and Management
â—Ź
First steps to building a BCP
–
–
Define a project scope, the objectives to be achieved,
and the planning assumptions
–
Estimate the project resources needed to be
successful, both human resources and financial
resources.
–
â—Ź
Obtain senior management support to go forward with
the project
Define a timeline and major deliverables of the project
Assign a project manager for the initial creation of a BCP
and DRP
4. Senior Leadership Support
â—Ź
Senior Leadership's major goals:
–
–
â—Ź
Execute the mission
Protect the organization
Risks you can point out to get buy in:
–
–
Reputational
–
Regulatory (lawsuits)
–
â—Ź
Financial
Senior Management could be held liable for not using due care to
protect the corporation.
BCP and DRP plans can take a year or more to complete,
management support is critical so the process doesn't get postponed
half way through.
5. Financial Risks
â—Ź
Can be quantified
â—Ź
Determines amount to spend on the recovery program
â—Ź
P*M=C
–
Probability of harm (p)
–
How likely is a damaging event to occur
Magnitude of harm (m)
–
What is the financial damage for a single event?
Cost of prevention (c)
â—Ź
â—Ź
â—Ź
The cost of putting in place a countermeasure. The
cost of the countermeasure should not be more than
the cost of the event.
6. Additional Benefits of Planning
â—Ź
Locating single points of failure (SPOF)
â—Ź
Process Improvements
â—Ź
Dealing with technical incidents
7. Project Scope and Plan
â—Ź
It's very important to gain firm agreement on
the scope and goals of the DRP and BCP.
–
Technology only or include business processes?
–
Main office only or all offices?
–
Workforce impairment
â—Ź
–
Pandemic, labor strike, transportation issues
Project manager must agree with leadership on
scope, timeline, and deliverables.
8. Legal and Regulatory Requirements
â—Ź
Many industries have applicable regulations.
â—Ź
Recent regulations:
–
The 9/11 Commission Recommendations Act Of
2007 (Public Law 110-53)
â—Ź
â—Ź
–
Recommends that private sector organizations validate
their recovery readiness by comparing their programs to
an unnamed standard (NFPA 1600 has been proposed)
US Government endorsed but is vuluntary
British Standard BS25999
9. The Ten Professional Practice Areas
NFPA 1600
â—Ź
Project Initiation and Management
â—Ź
Risk Evaluation and Control
â—Ź
Business Impact Analysis
â—Ź
Developing Business Continuity Strategies
â—Ź
Emergency Response and Operations
â—Ź
Developing and Implementing Business Continuity Plans
â—Ź
Awareness and Training Programs
â—Ź
Maintaining and Exercising Business Continuity Plans
â—Ź
Public Relations and Crisis Communications
â—Ź
Coordination with Public Authorities
10. BS25999
â—Ź
â—Ź
â—Ź
Extension of PAS56
Intention is to create the ability to demonstrate compliance
with the standard
Stage 1: Audit including a desktop review
–
â—Ź
â—Ź
â—Ź
Must be completed before Stage 2
Stage 2: conformance and certification audit where the
planner must demonstrate implementation
If implementation fails then corrective action must be
agreed upon.
If both stages complete then the organization can apply for
BS25999 certification.
11. US Financial Regulations
â—Ź
Federal Financial Institutions Examination Council (FFIEC) specifies that BCP is
about maintaining, resuming, and recovering the organization. Not just the
technology.
–
â—Ź
â—Ź
â—Ź
â—Ź
The planning process must be conducted enterprise wide.
BCP and test results should be independently audited and reviewed by board of
directors
Company should be aware of the BCP activities of its 3rd party providers, key
suppliers, and organization partners.
If processes are outsourced then the service providers BCP must be reviewed to
ensure critical services can be restored within acceptable timeframes.
Additional Regulations:
–
National Association of Insurance Commissioners (NAIC)
–
National Futures Association Compliance Rule 2-38
–
Electronic Funds Transfer Act
–
Basel Committee
12. Other Regulations and Standards
â—Ź
Australian Prudential Standard CPS 232 – July 2012
–
Requires institution BCM must include:
â—Ź
BCM Policy
â—Ź
Business Impact Analysis (BIA) including risk assessment
â—Ź
Recovery objectives and strategies
â—Ź
Business Continuity Plan (BCP) including crisis management and recovery
â—Ź
Review and testing of the BCP
â—Ź
Training and awareness
â—Ź
Monetary Authority of Singapore – June 2003
â—Ź
Standard for Business Continuity/Disaster Recovery Service Providers (SS507)
–
Singapore
â—Ź
â—Ź
Sets stringent standards for DR service providers
HIPAA
–
Requires data backup plan, disaster recovery plan, and emergency mode operations plan
13. Sarbanes Oxley Section 404
â—Ź
â—Ź
Applicable if required to file annual report required by Section
13(a) or 15(d) of the Securities Exchange Act of 1934 (15 USC
78m or 78o(d)
Must contain:
–
Responsibility of management for establishing and
maintaining adequate internal control structure and
procedures for financial reporting
–
Contain an assessment, as of the end of the most recent
fiscal year of the issuer, of the effectiveness of the internal
control structure and procedures of the issuer.
–
Internal Control Evaluation and Reporting
â—Ź
BCP and contingency planning is not considered in scope
14. Legal Standards
â—Ź
Blake vs Woodford Bank & Trust Co (1977)
–
â—Ź
Sun Cattle Company, Inc vs Miners Bank (1974)
–
â—Ź
Foreseeable workload – failure to prepare
Computer System Failure – Foreseeable Computer
Failure
US vs Carroll Towing Company (1947)
–
Defined breach of duty of care where B < PL
â—Ź
â—Ź
â—Ź
â—Ź
B = (cost) Burden of taking precautions
P = Probability of Loss
L = Gravity of Loss
P * L must be greater than B to create a duty of due
care for the defendant
15. Legal Standard Continued
â—Ź
Negligent Standard to Plan or Prepare
(pandemic) 2003
–
Canadian nurses filed suit saying the federal
government was negligent in not preparing for the
second wave after the disease was first identified.
16. Resource Requirements
â—Ź
Require plan for both staff and finances
â—Ź
Staff resources
–
Need staff from business operations and
technology groups (IT).
–
Identify recovery priority
–
Identify required timeframes
â—Ź
–
Once timeframes are identified, plan staffing to meet
timeframes (If 24 hour recovery will be required, etc)
The staff planning recovery must be the same team
who executes the recovery in the event of an
incident.
17. Financial Resources
â—Ź
Finances may be required to:
–
Hire outside contractors/consultants
–
Travel may be required to offsite locations
–
Hardware, software, etc may need to be purchased.
18. Emergency Notification Lists
â—Ź
â—Ź
The BCP/DRP planner should build a contact
list of critical staff and leadership.
The list should include at a minimum:
–
–
â—Ź
Title, name, home phone, work phone, mobile
phone
Tim Recommends also home address
Tim also recommends: Distribute the list and
make sure everyone on the list has a physical
copy offsite. Storing the list in a computer
system housed onsite with no offline copies is
stupid.
19. Vital Records
â—Ź
â—Ź
All vital records needed to rebuild the
organization must be stored offsite in a secure
location that can be accessed following a
disaster.
This includes electronic data backups as well
as paper record backups
20. Common Vital Records
â—Ź
Anything with a signature
â—Ź
Customer Correspondence
â—Ź
Customer Conversations
â—Ź
Accounting Records
â—Ź
Justification Proposals/Documents
â—Ź
Transcripts/minutes of meetings with legal significance
â—Ź
â—Ź
Paper with Value (Stock certificates, bonds, comercial
paper)
Legal Documents (Letters of incorporation, deeds, etc)
21. Common Vital Records
â—Ź
Databases and contact lists for employees,
customers, vendors, partners, etc
â—Ź
Business unit contingency plans
â—Ź
Procedure/application manuals
â—Ź
Backup files from production
servers/applications
â—Ź
Reference documents used regularly
â—Ź
Calendar files or printouts
â—Ź
Source Code
22. Risk and Business Analysis
â—Ź
The planning team will make recommendations
about which risks the organization should
mitigate and which systems and processes the
plan will recover and when.
23. Strategy Development
â—Ź
â—Ź
The planner will review different strategies for
business recovery based on required SLA for
critical systems.
Cost/Benefit analysis will be done to identify
strategy viability.
24. Alternate Site Selection and
Implementation
â—Ź
â—Ź
The planner selects and builds out alternate sites used
to recovery the organization/technology.
Shouldn't be susceptible to the same threats as the
primary site.
–
â—Ź
Example: If Fargo is the main datacenter location,
the backup site shouldn't be in Grand Forks. If one
floods the other is likely to flood at the same time.
Good resources:
–
www.prep4agthreats.org
–
www.switchlv.com/wpcontent/uploads/disaster_avoidance_2013/disastermap.html
27. Documenting the Plan
â—Ź
â—Ź
All of the information is compiled into a plan
document.
Procedures are designed for each site and for
each technology and/or application to be
recovered.
28. Testing, Maintenance, and Updating
â—Ź
â—Ź
The plan must be validated by testing recovery.
A maintenance schedule must be established to
the plan doesn't become obsolete.
29. Business Impact Analysis
â—Ź
â—Ź
The purpose of a BIA is to decide what needs
to be recovered and how quickly.
Priority:
–
–
Essential
–
Supporting
–
â—Ź
Critical
Non-Essential
Must determine maximum tolerable downtime
(MTD). Also known as Recovery Time Objective
(RTO)
30. Risk Assessments
â—Ź
Three elements of risk:
–
–
Assets
–
â—Ź
Threats
Mitigating Factors
Threats are measured as a probability. (May happen 1 in 10
years)
â—Ź
Most common threat is power availability.
â—Ź
Second most common is a water event.
–
â—Ź
Flooding, plumbing leak, broken pipe, leaky roof, water main
break
Other Common Threats:
–
Severe Weather, cable cuts, fires, labor disputes,
transportation mishaps, hardware failures.
31. Internal Threats
â—Ź
Equipment fails prematurely:
–
–
â—Ź
Improper installation
Improper environment
Equipment fails due to wear and tear:
–
Most equipment has a “mean time between failures”
rating.
–
Running equipment beyond MTBF is risking failure.
32. Assets
â—Ź
â—Ź
If the organization doesn't own anything then it
won't be concerned about risks because it has
little or nothing to lose. (Gotta love IT Security
consulting!!!)
Assets include:
–
Information
–
Financial
–
Physical
–
Human
33. Mitigating Factors
â—Ź
â—Ź
â—Ź
Controls ore safeguards that will be put in place
to reduce the impact of a threat.
Example is that UPS devices can save
production systems from hard crashes which
could lead to data loss and long recovery times.
When a risk is identified the planner must
accept the risk, transfer the risk, avoid the risk,
or mitigate the risk.
34. Mitigation Strategies
â—Ź
Accept
–
â—Ź
Transfer
–
â—Ź
Insurance
Avoidance
–
â—Ź
The risk is so unlikely to occur or the impact is so small, it'd
cost more to mitigate.
Have compensating controls so risk is completely removed.
Example is having 2 call centers in very different climates. In
the event of inclement weather in one, the other is still
operational.
Mitigation
–
Controls implemented to avoid the risk or to lessen the
impact.
37. Surviving Site Strategy
â—Ź
A surviving site strategy is implemented so that
while service levels may drop, a function never
ceases to be performed because it operates in
at least two geographically dispersed buildings
that are fully equipped and staffed.
38. Self Service Strategy
â—Ź
An organization can transfer work to another of
its own locations, which has available facilities
and/or staff to manage the time sensitive
workload until the interruption is over.
39. Internal Arrangement Strategy
â—Ź
Training rooms, cafeterias, conference rooms
may be equipped to support organizational
functions while staff from the impacted site
travels to another site and resumes
organization.
40. Mutual Aid Agreement Strategies
â—Ź
Other similar organizations may be able to
accommodate those affected.
41. Dedicated Alternate Site Strategy
â—Ź
Built by the company to accommodate
organization function or technology recovery.
43. External Supplier Strategy
â—Ź
â—Ź
Pay an external company for disaster recovery.
These companies provide data centers,
alternate site spaces, mobile units, and
temporary staff.
44. Backup Storage Strategy
â—Ź
â—Ź
â—Ź
â—Ź
Data should be backed up once or more times
a day and a copy sent offsite.
The offsite storage should be far enough away
from your primary site to be safe and close
enough to your recovery site to allow timely
recovery operations to start.
Systems should be prioritized to make sure
resources are available for the most critical
systems and data.
A full backup is normally taken and then
incremental backups occur every few hours or
every day.
46. Dual Data Center
â—Ź
â—Ź
Applications are load balanced or hot swapped
between two data centers so downtime is
minimized.
Each data center should be able to operate at
full load.
47. Internal Hot Site
â—Ź
â—Ź
â—Ź
Site is standby ready with all technology and
equipment necessary already in place.
Often used as dev/test until recovery is needed,
at which time dev/test is removed and
production is implemented.
Should be exactly the same hardware,
software, etc.
48. External Hot Site
â—Ź
â—Ź
â—Ź
Equipment is installed and waiting, but the
environment must be rebuilt for recovery.
Often contracted through a recovery service
provider.
Equipment and software should be kept as
close to identical as possible to speed recovery.
49. Warm Site
â—Ź
â—Ź
â—Ź
A leased or rented facility which is partially
configured with some equipment, but not the
actual computers.
Generally has cooling, cabling, and networking
in place.
Servers are delivered to the site at the time of
the disaster.
50. Cold Site
â—Ź
â—Ź
Empty data center space with no technology.
All technology must be acquired at the time of a
disaster.
51. Mobile Sites
â—Ź
Mobile house or sea cargo trailer with a data
center in it which can be dropped, hooked up,
and is ready to go.
53. Reciprocal Agreements
â—Ź
â—Ź
Similar organizations can share the risk of an
outage by hosting the data and processing of
the other organization in the event of a disaster.
Has a lot of contractual, legal, and compliance
issues depending on what data you process.
55. Multiple Processing Sites
â—Ź
â—Ź
â—Ź
Multiple sites inside the organization can be
used for processing.
Useful if the company is spread throughout the
country or world.
Runs into bandwidth and latency issues.
56. Disaster Recovery Process
â—Ź
â—Ź
When things are going bad, people get
stressed and make bad decisions
Document the plan!
–
Clear instructions on who will do what and when
–
Consistent regardless of the event
–
Define communication strategy
–
Distribute to everyone who has a role in recovery
–
Test/verify the plan
57. Disaster Recovery Process
â—Ź
Response
–
Assessment team: evaluates the event and
escalates to the appropriate people if needed
–
Escalation team: contacted by assessment team
â—Ź
â—Ź
Consists of event owner, responders, stakeholders
Emergency notification lists
–
Response teams must be reachable 24x7
–
Must be reachable by everyone in the organization
–
Should be used for every event, from plumbing
leaks to Godzilla attacks
58. Disaster Recovery Process
â—Ź
Emergency Management Team
–
–
â—Ź
Provide management (short-term tactical
command)
Assess damage, keep executives in the loop,
initiate and organize response
Executive Team
–
Senior executives
–
Respond to issues that need direction
–
Handle PR
–
Provide leadership (long-term strategic direction)
59. Disaster Recovery Process
â—Ź
Emergency Response Team
–
–
Retrieve recovery info (potentially offsite)
–
Communicate with command center
–
Work with alternate site personnel
–
â—Ź
Execute the recovery process
Identify/Install replacement equipment or software
Command Center
–
Should have copy of the plan so they can ensure it
is being followed correctly
–
Should keep track of what's being done and costs
61. Communications
â—Ź
Important to keep everyone informed
–
Emergency notification list
â—Ź
–
Contingency line (ex: printed phone# on badges)
â—Ź
–
Single number to call to get the latest info
PR
â—Ź
â—Ź
–
Team members/managers who disseminate notifications
Important that everyone tells the same story
Keep things short and honest
Multiple communication channels
â—Ź
Could actually reduce confusion (techs on their own
conference bridge since some jargon sounds scary)
62. Assessment
â—Ź
Process to rate severity of events
â—Ź
Tiered categories like:
–
Non-incident: limited or no disruption
–
Incident: cause downtime for a facility or service
â—Ź
–
Trigger disaster recovery plan, report to senior mgmt
Severe incident: significant destruction or disruption
â—Ź
Trigger DR, contact senior management and crisis mgmt
63. Restoration
â—Ź
Planned event after recovery
–
Interim plans (example)
â—Ź
â—Ź
â—Ź
â—Ź
Part of DR plan was to set up alternative site
Work from alt site until original site is restored
Slowly transition back to original site
After everything is back at the original site, dismantle the
alternate site
64. Training
â—Ź
Awareness program
–
Make sure everyone knows the plan before they
need to use it
–
Train all employees on how to raise issues to the
evaluation team
–
Train stakeholders on their role in case of an event
â—Ź
Conduct exercises to practice
–
Reassure customers that a plan is in place so the
organization will always be there
–
New hire training
65. Exercises
â—Ź
“Exercise” instead of “test”
–
â—Ź
Test makes people think it's pass fail
Call exercise – activate the call tree
–
–
â—Ź
Verify numbers are correct
What percentage were unavailable
Walkthrough exercise
–
Talk though a scenario with everyone
–
Make sure everyone has actually read the plan
–
Find weaknesses
66. Exercises
â—Ź
Simulation
–
–
Validate alternative site readiness
–
Considered successful if everything worked out to
get the resources needed to recover
–
â—Ź
Never create a disaster by testing for one
Also successful if it didn't since you learn what to fix
Compact exercise
–
Start with call exercise and run right into a
simulation
–
Fake injuries, pretend reporters, fire drills
67. Maintaining the Plan
â—Ź
Should be reviewed regularly and updated
–
–
â—Ź
Review every 3 months
Formal audit yearly
Version control
–
–
â—Ź
Ensures everyone is using the latest version
Keeps a history of what changed and why
Store the latest plan offsite so it's available in a
real disaster
68. Disaster Recovery Program
â—Ź
Probably will start as a project
–
â—Ź
Projects have an end; DR must be on-going
Transition into a an ongoing process
–
–
â—Ź
Repeat the steps regularly
Use the program to spin off smaller projects like
yearly audits and quarterly reviews
Emergency Management Organization (EMO)
–
â—Ź
Department or group responsible
Emergency Operations Center (EOC)
–
Provides a location and resources for recovery
69. Other Risk Areas
â—Ź
Business continuity is closely related to other
areas of risk
–
–
â—Ź
A good DR plan doesn't matter if records
management policy is so poor that offsite backups
don't exist or aren't maintained
Good firewall policy doesn't help if alternate site has
so little physical security that people could enter it
and access the data directly
Need to address all risk areas for complete
coverage