1. Regularly testing disaster recovery is important to ensure it works as expected when needed.
2. Testing DR in the cloud is easy and cost effective since resources only need to be provisioned for the test.
3. Different aspects of the DR plan like data transfer speeds and restoration times should be validated during testing.
STG202 Parmigiano, a Monastery, Love and Faith: Technical Lessons on how to do Backup and Disaster Recovery in the Cloud - AWS re: Invent 2012
1.
2. "The mind is not a vessel to be filled,
but a fire to be ignited."
- Plutarch
2
3. Agenda
I. Prologue The story of Monte Cassino
II. Lessons Backup
III. Customer Story Shaw Media
IV. Earthquake What happened to my Parmigiano?
V. Lessons Disaster Recovery
VI. Conclusions ... And a little surprise!
9. 800 papal documents
20,500 volumes in the Old Library Titian, one of the
60,000 in the New Library most influential
painters ever
200 manuscripts on parchment
100,000 prints and paintings (including 11 Titians)
500 incunabula
Gutenberg’s Bible
was printed in 1455
C.E.
A book printed
before 1501 C.E.
[ The Treasure of Monte Cassino ]
9
x
10. High Backup Disaster
availability storage recovery
[ Business continuity continuum ]
10
12. High Availability :
Keeping services alive
Backing up :
Process of copying and archiving of data so it may be used to
restore the original after a data loss event
[ Business continuity continuum ]
12
13. High Availability :
Keeping services alive.
Backing up :
Process of copying and archiving of data so it may be used to
restore the original after a data loss event.
Disaster recovery :
Recovery of technology infrastructure critical to an
organization after a natural or human-induced disaster.
[ Business continuity continuum ]
13
14. Monastery :
Brilliant, scalable, low-cost, highly durable backup system
Origin of Universities (Charlemagne, 814 C.E.)
The Empire Edict: Free
needs educated education in
people cathedrals and
Let’s ask the monasteries
Church!
Lots of books
(and backups)
[ Origin of Backup ]
14
x
15. Monastery :
Barbarians,
Brilliant, scalable, low-cost, highly durable backup system.
pestilences, fires,
Origin of Universities (Charlemagne, 814 a.C.)
invasions, wars,
famines, revolts, etc.
Indoctrination :
One of the first critical function within an organization
(Catholic Church) that needed continuation after any natural or
human-induced disaster.
It needed backup of books (Bibles, etc.) in order to function.
[ Origin ]
15
18. Dec 1942: Many “treasures”
are transported from Rome
and other places to Monte
Cassino, for safety
[ The Treasure of Monte Cassino ]
18
19. Intercepted German message:
“Ist der Abt noch im Kloster?”
“Ja.”
It means
“Military
Division” It also means
(abbreviated) “Abbot”
(abbreviated)
[ Lost in translation ]
19
37. 2. My backup should be able to scale
• “Infinite” scale with Amazon S3 and Amazon Glacier
• Scale to multiple regions
• Seamless
• No need to provision
• Cost tiers (cheaper at scale)
[ Lessons from Monte Cassino ]
37
38. Regions (8) GovCloud Regions (1)
[ Global AWS Infrastructure ]
38 (as of Nov 27th, 2012)
40. Seattle South Bend New York (2) London Amsterdam (2)
Newark Dublin Stockholm
Palo Alto
Tokyo
San Jose
Paris Frankfurt (2)
Ashburn (2) Milan
Los Angeles (2) Jacksonville Madrid Osaka
Dallas (2) Hong Kong
St.Louis
Miami Singapore (2) Sydney
São Paulo
Edge Locations (38)
[ Global AWS Infrastructure ]
40 (as of Nov 27th, 2012)
42. 3. My backup should be safe
• SSL Endpoints (Amazon S3 and Amazon Glacier)
• Signed API calls
• Store encrypted files
• Server-side encryption
• Durability: multiple copies across different data centers
• Local/cloud with AWS Storage Gateway
[ Lessons from Montecassino ]
42
45. 4. My backup should work with a DR policy
• Easy to integrate within AWS or Hybrid
• AWS Storage Gateway: Run services on Amazon EC2 (DR)
• Clear costs
• Reduced costs
• I decide redundancy/availability in relation to costs
[ Lessons from Monte Cassino ]
45
46.
47. 5. Someone should care about it
• Clear ownership
• Permissions with IAM: Users, groups -> roles
• Logs
• AWS support
[ Lessons from Monte Cassino ]
47
48. 1. My backup should be accessible
2. My backup should be able to scale
3. My backup should be safe
4. My backup should work with a DR policy
5. Someone should care about it
[ Lessons from Monte Cassino ]
48
52. [ Who we are ]
• Shaw Media: Division of Shaw Communications Inc.
• It reaches almost 100% of Canadians; 18 specialty channels
• Global national newscast: 1+ million viewers every weekday
• Access to full episodes: 20 websites, 4 video-on-demand
• It engages with 25+ million Canadians per week
52
53. [ Before AWS ]
• Data centers in Winnipeg and Toronto
• Challenge to manage, frequent power outages, downtime
• Expensive hosting fees inherited from parent company
• Technology was old and in disarray (total revamp needed)
53
57. Amazon EC2 Amazon SQS
Amazon EMR Amazon SNS
Auto Scaling Amazon SES
Elastic Load Balancing
AWS Marketplace
Amazon CloudFront Amazon FPS
Amazon RDS Amazon DevPay
Amazon DynamoDB Amazon Mechanical Turk
Amazon SimpleDB
Amazon ElastiCache Amazon Route 53
Amazon VPC
Amazon IAM Amazon Direct Connect
Amazon CloudWatch
Amazon Elastic Beanstalk Amazon S3
Amazon CloudFormation Amazon Glacier
Amazon EBS
Amazon CloudSearch AWS Import/Export
Amazon SWF AWS Storage Gateway
Alexa WIS and Alexa Top Sites AWS Support
58. [ Phase One ]
• Fast deployment of servers, network rules, load balancers
• First site under new CMS: Live in 4 weeks from scratch
• Full migration of 29 sites from a physical DC in 9 months
58
59. [ Phase Two ]
• Full migration of 6 other websites and web services
• From 2nd physical DC into AWS in 2 months
• Migration: Windows ‘03/SQL ‘05 -> Windows ‘08/SQL ’08
• Creating new web farms takes 1 to 5 days (versus months)
• Takes longer to procure licenses than the infrastructure
• Ability to scale and automate
59
60. [ Benefits of Using AWS ]
• Increased uptime from 98.8% to 99.99%
• Scale to success, quicker response to business needs
• 1+ M$ saved in capital and operational cost
• No physical investment, smaller teams
• Allowed using service management 3rd party companies
• Easy backup on AWS -> 3 years retention (tax credits)
60
62. [ Some Numbers ]
• 50+ EC2 instances (various sizes)
• 25+ TB traffic/month
• 40M+ Route53 queries
• 10+ TB backup on Amazon S3
... And growing!
62
63. [ Lessons Learned ]
• Architecting for AWS in mind from start
• Use all Availability Zones in area you choose to host; divide
across all
• Plan for failures: Be crazy about it (things fail)
• Backup backup backup
• Monthly AMI
• Windows/SQL Server workarounds (failover cluster, AD, etc.)
• Engage with AWS Solutions Architects early
63
64. [ Disaster Recovery ]
• Learn from outages all the time
• Implement changes to prevent failures at cloud level
• Document how you recover from failures
• Single component may fail; architecture shouldn’t
64
65. [ Backup ]
• Daily snapshots of all volumes automatically
• VIP volumes: snapshots every 4 hours
• Keep the last 10 snapshots
• Dell Replay: It backs up file system files every 1 hour
• Volumes replicated to Amazon S3 (Oregon) every 2 hours
• SQL Server backup every 30 minutes
• SQL Server backup volumes moved to Amazon S3 every 2
hours
65
66. [ Future ]
• Move from public cloud to VPC
• Auto Scaling on Amazon EC2
• Amazon S3 as image repository for all sites
• Second cloud vendor as DR (instead of in-house)
• Amazon ElastiCache for central caching for ASP.net apps
66
81. Business Impact Analysis (RTO, RPO)
• RTO (Recovery Time Objective):
1) Time for trying to fix the problem
2) The recovery itself
3) Testing
4) Tell users
• RPO (Recovery Point Objective): how much data I can lose
[ Lessons from an Earthquake ]
81
82. Different Types of DR Architecture
1) Backup and Restore
2) “Pilot light” for quick recovery into AWS (Cold standby)
3) Warm standby solution on AWS
4) Multi-site hybrid solution (AWS + on premises)
[ Lessons from an Earthquake ]
82
85. 2. Testing your DR
• Dev/test in the cloud is super easy
• Spin up capacity only for the test
• Regularly test your DR
• Cost is minimal
• What about data transfer speed?
[ Lessons from an Earthquake ]
85
86. s3cmd ls --recursive
s3://datasets.elasticmapreduce/ngrams/b
ooks/ | awk '{print $4;
sub(/s3://datasets.elasticmapreduce/,
"/array", $4); print $4}' | parallel -
j0 -N2 --progress /usr/bin/s3cmd --no-
progress get {1} {2}
86 Special thanks to Craig Carl, AWS Solutions Architect
87. s3cmd ls --recursive
s3://datasets.elasticmapreduce/ngrams/b
ooks/ | awk '{print $4;
sub(/s3://datasets.elasticmapreduce/,
"/array", $4); print $4}' | parallel -
j0 -N2 --progress /usr/bin/s3cmd --no-
progress get {1} {2}
Lists every object
in the bucket
87
88. s3cmd ls --recursive
s3://datasets.elasticmapreduce/ngrams/b
ooks/ | awk '{print $4;
sub(/s3://datasets.elasticmapreduce/,
"/array", $4); print $4}' | parallel -
j0 -N2 --progress /usr/bin/s3cmd --no-
progress get {1} {2}
Gets the path to the Amazon
S3 object and the local
destination path
88
89. s3cmd ls --recursive
s3://datasets.elasticmapreduce/ngrams/b
ooks/ | awk '{print $4;
sub(/s3://datasets.elasticmapreduce/,
"/array", $4); print $4}' | parallel -
j0 -N2 --progress /usr/bin/s3cmd --no-
progress get {1} {2}
Runs parallel with as many
threads as possible, '-N2' tells
parallel there were two
arguments on stdin and
assigns them to {1} and {2}
89
90. s3cmd ls --recursive
s3://datasets.elasticmapreduce/ngrams/b
ooks/ | awk '{print $4;
sub(/s3://datasets.elasticmapreduce/,
"/array", $4); print $4}' | parallel -
j0 -N2 --progress /usr/bin/s3cmd --no-
progress get {1} {2}
It’s the command that GNU
Parallel will run, '{1}' is
substituted with the Amazon
S3 object path, '{2}' is
substituted with the local
destination path
90
91. s3cmd ls --recursive
s3://datasets.elasticmapreduce/ngrams/b
ooks/ | awk '{print $4;
sub(/s3://datasets.elasticmapreduce/,
"/array", $4); print $4}' | parallel -
j0 -N2 --progress /usr/bin/s3cmd --no-
progress get {1} {2}
Copying 2.4 TB
down from 48 hours
to 9 hours (5x faster)
91
96. 4. You can have different DR solutions
• Easy to integrate existing vendors with DR on AWS
• Approach: One vendor/hybrid/multiple vendors
• One region/multi-regions (if you need geodiversity)
[ Lessons from an Earthquake ]
96
97. 1. You NEED a DR in place!
2. Testing your DR
3. Reducing costs
4. You can have different DR solutions
[ Lessons from an Earthquake ]
97
100. Action items
Backups
Disaster Recovery
Agility Cost savings Control
x
101. Parmigiano, a Monastery,
Love and Faith
Technical lessons on how to do
Backup and Disaster Recovery in the Cloud
Simone Brunozzi
Senior Technology Evangelist, Amazon Web Services
@simon
102. We are sincerely eager to
hear your feedback on
this presentation and on
re:Invent.
Please fill out an
evaluation form when you
have a chance.