SQL Database Design For Developers at php[tek] 2024
Infrastructure Migration
1. Infrastructure
Migrations
How many infrastructure migrations have I done? I’m
not sure. I stopped counting around 5.
One of the benefits of working for a small company
that’s growing quickly is that you get to experience a lot
of new things...and moving production and office
environments is one of them.
Thursday, August 2, 12
2. I am: Matt Simmons
• 10+ year sysadmin
• Small infrastructures
• 6+ infrastructure migrations
• http://www.standalone-sysadmin.com
You probably know this...
Thursday, August 2, 12
4. 10,000ft view
• Pre-Planning
• Execution
• Post-Mortem
Thursday, August 2, 12
Like most things, 90% of the work is planning.
The other 90% is lifting heavy things.
There’s another 10-25% reserved for figuring out what went
wrong, and determining how to make it not happen again.
5. Considerations:
Types of Migrations
• Build in parallel
• Move Infrastructure
• Hybrid
You really, really want to build
in parallel. Sure it’s expensive,
but it means much, much
shorter periods of downtime.
Moving an infrastructure is hairraising, because there are only a few
million things that can go wrong.
Most people will probably end up doing hybrid migrations,
where you build some of the new infrastructure, then
migrate some from the existing setup.
Watch out for things like IP addressing issues, and that
you’ve made the correct assumptions about rack space and
power requirements for the machines that are moving.
Thursday, August 2, 12
And you don’t know scary until
you’re driving a U-Haul full of
ser vers across the Pennsylvania
Turnpike in the middle of a
rainstorm.
6. Considerations:
• Downtime Limits
• Uptime Requirements
• Service Window Length
Strangely enough, downtime limits and uptime
requirements aren’t the same.
Figure out what your uptime limits are according
to your user base’s expectations, then figure out
how much infrastructure needs to be running in
order to accommodate that. Good luck.
Thursday, August 2, 12
You might have a
maintenance window, where
downtime is planned and
doesn’t count against your
SLAs. If your migration can
fit within this, awesome
(hint: it can’t.)
So you need to figure out
what kind of downtime you
can afford, and remember to
schedule notices to your
customers far enough in
advance so that they aren’t
taken by surprise.
7. Considerations:
Upstream Network Changes
I think I could do an entire
presentation where I just list all
of the problems that could happen
when net work providers screw
things up.
Big ones to watch out for:
Thursday, August 2, 12
1. Is the test and turn-up date early
enough so that inevitable failures
don’t impact the go-live date?
2. Is the circuit exactly what
you ordered, and is what you
ordered exactly what you need?
3. Are cross-connects in the
datacenter ordered, and is the
datacenter net working team
working with the provider?
8. Considerations:
(Wo)man Power
You can’t lift all of the things you own.
You need friends to come help you move, right? And you
usually pay them beer and pizza for the effort.
Moving infrastructures is kind of like that, except
“money” typically substitutes for beer and pizza, and you
want to find people who are reasonably smart, because
you probably don’t own anything in your apartment that
costs as much as a high performance RAID array.
Thursday, August 2, 12
Figure out how many
people you need, then add
20% to cover the stuff
you didn’t think of.
Have another 10% at
home ready to come in if
the need arises.
9. Considerations:
How can we parallelize the work?
If you have teams, having them all work
independently but simultaneously is important,
so try not to have one team waiting around on
the result of another team. This is no different
than removing bottlenecks from a computing
infrastructure.
Thursday, August 2, 12
11. Build a checklist
Every good plan includes a checklist
• What needs to be done
• By whom?
• Where?
• In what order?
Thursday, August 2, 12
12. Build a checklist
Include all phases
•
•
•
•
•
•
Thursday, August 2, 12
Off site prior
On site prior
On site during
On site after
Testing
Signoff
Off site things before moves
are usually slow processes or
long-term changes that rely
on TTLs or human interaction
outside of your organization.
13. Build a checklist
Establish Dependencies
If item 23 relies on item 24 being done, then
it’s probably in the wrong place...
Figuring out all of these dependencies is like
untangling a knot. It’s slow, it’s difficult, and
when you’re done, no one seems to be as
appreciative of your hard work as you are.
Thursday, August 2, 12
14. Build a checklist
Build in checkpoints
Checkpoints are a great place
to stop all the teams at the
same time and make sure that
everyone’s on the same page.
Thursday, August 2, 12
15. Build a checklist
Include communication up-stream
Overcommunicate.
Keep your boss informed.
Keep your stakeholders informed.
If you have the kind of work
environment where your users
care, keep them informed.
Thursday, August 2, 12
16. Build a checklist
Multiple Checklists
• Per team?
• Per location?
• Per person?
Thursday, August 2, 12
If you’ve got multiple teams,
you are likely to need multiple
checklists.
Ditto if your locations are
farther apart.
If each person’s tasks are
complicated, give each person
an individual checklist, too.
17. Build a checklist
Schedule Breaks
Breaks are SO important.
You can’t work for 8 hours
without stopping to rest,
physically or mentally. Put
these into the schedule.
Thursday, August 2, 12
18. Change Management
Techniques
Establish tests for complicated steps
(or groups)
Would you build a new ser ver then put it into
production without testing it?
Of course not.
Build tests to see if your work so far is correct. It can be
as simple as “ this point, LED 7, 8, and 9 should be green,
at
and LED 10 should be amber”.
Thursday, August 2, 12
19. Change Management
Techniques
Establish roll-back procedures
Things happen. Stuff doesn’t
always go right.
Make sure your plan includes
when to roll-back and what
steps to take to do it.
Thursday, August 2, 12
20. Change Management
Techniques
Establish failure guidelines
Failures are inevitable.
Unhandled failures are
unnecessary though.
Know how to tell if
something has failed, and
know what to do about
it.
Thursday, August 2, 12
(What happens if...)
• ...a machine breaks?
• ...a router doesn’t boot?
• ...?
21. Identify Goods & Services
to be Purchased
These kinds of steps
require a lot of
planning, but more
planning just makes
the end result better.
• Cables of specific lengths, connectors, label
tape, velcro, rack shelves, etc
• Servers, routers, firmwares, licenses, etc
• Circuits, bandwidth, accounts, etc
Thursday, August 2, 12
22. Maintain
Communications
• Cellphones
• (at least one per team)
• 2-way radios
• (for lack of cellular service)
• Probably not IP phones
Cell reception in datacenters is
spotty. Using handheld 2-way
radios is much more reliable.
Don’t rely on your IP phone
infrastructure for critical
communications during
net work outages.
Just don’t.
Thursday, August 2, 12
23. Find Warm Bodies
Figure out how many people you need.
Add 20% for good measure
Have 10% standing by
Thursday, August 2, 12
24. Establish Roles
Zone: “Your job is to stay at this rack,
pulling things out in the order
prescribed by the checklist, and to
load them on the cart once removed”
Man to Man: “Your job is to cart
these servers to the truck, and once
the number of servers in the truck
matches the number prescribed by the
checklist, to drive the truck to the
new datacenter, and assist in loading
the ser vers onto the cart for the next
zone man”
• Zone
• Man to Man
• Point Guard
...and so on, as required by your migration.
Thursday, August 2, 12
Point Guard: “Your job is to act as
the communications hub, the
person to verify that check
points happen on schedule, and
that things are correct, as well
as to finalize sign-off and handoff once we’re done”
25. Communicate
the plan
Default to being too communicative
Have your point guard
annoy people with the
number of email updates.
Thursday, August 2, 12
26. Communicate
the plan
Get clearance from the stake-holders
Before ever starting
work, make sure that
everyone is on board
with the migration plan,
and that everyone has
agreed and signed off.
Thursday, August 2, 12
27. Communicate
the plan
Alert users multiple times
• Well in advance
(so long term projects aren’t scheduled)
• A week before
(so short-term pushes aren’t interrupted)
• Immediately before
(so last minute issues don’t compound)
Thursday, August 2, 12
28. Communicate
the plan
Give everyone the information they need
• Checklists
• Plan document
• Contact Information
I actually got to the point where every person involved
in the migration got a personalized envelope.
The contents were the checklist relevant to their job,
the diagrams of what the rack looked like before, what
the new racks were supposed to look like, and the
contact information for all of the other team members.
...and has signed off on it
Thursday, August 2, 12
29. Executing the plan
I love it when a plan comes together...
Thursday, August 2, 12
30. Executing the plan
Verify all goods were purchased
Doing inventory sucks, but
not having enough ethernet
cables that reach to the
switch sucks more...
Thursday, August 2, 12
31. Executing the plan
Clear personal schedules
“oh, that was this weekend? Crap, man,
I’m sorry. I have to go drink beer with
my other friends and have a good
weekend. Maybe next time, brah”
Thursday, August 2, 12
32. Executing the plan
Complete off-site checklist items
Verify that everyone at
both sites knows what’s
happening, when, and is
on board. Make sure the
datacenter has people on
hand to help who are
capable of helping.
Thursday, August 2, 12
33. Executing the plan
Show up early
,,,because something won’t be right.
Thursday, August 2, 12
34. Executing the plan
Verify assigned roles
Ask for questions
...and ask each person.
Make sure that they
know how to get ahold of
you and the point guard.
Thursday, August 2, 12
38. Executing the plan
Go have a beer.
Seriously, celebrate
completing the task with
the team. I didn’t always
get to do this, and I’m still
sorry about it today.
Thursday, August 2, 12
39. Executing the plan
Complete post-mortem according to schedule
During the next workweek, complete the postmortem and identify
what went wrong as well
as what went right.
You can’t replicate success
and eliminate failure
unless you identify them.
Thursday, August 2, 12
41. Dealing with problems
Two big take-aways:
1) Problems are
inevitable because they
are a condition of the
infrastructure, and
they arise from its
inherent complexity.
2) It’s not possible to
eliminate all failures,
but it’s desirable to
minimize them, and to
try to eliminate
repeating the same
failure by improving the
process and design.
Thursday, August 2, 12
Problems are inevitable
(It’s not “if”, it’s “when”)
Read “The Field Guide to Understanding
Human Error” by Sydney Dekker
http:/
/amzn.to/QFpcqY
During my talk, I gave
far more discussion on
this topic than I’m going
to give here.
42. Dealing with problems
• Identify & Acknowledge the problem
• Don’t punish the reporter
• Follow the failure guidelines
• Roll-back if necessary & reschedule
Thursday, August 2, 12
43. Post-mortem
• What went wrong?
• Why?
• The ‘Five Whys’
• What went right?
• What have we learned?
Thursday, August 2, 12
44. Thanks for your time.
I hope you were able to
get something out of it.
Infrastructure
Migrations
If you have questions,
feel free to contact me
@standaloneSA
Thursday, August 2, 12
standalone.sysadmin@gmail.com