Devoxx 2014 michael_neale

Changing
wheels of a moving car
Replacing core technologies in a growing
startup
Michael Neale
CloudBees
#DV14 @michaelneale

This talk
• lucky early decisions
• transitions and containers
• lessons learned on changing
continuously
• finally: monitoring, alerting,
health - ops for devs.
(rarely talked about)
#DV14 @michaelneale

ABOUT ME
• Co-founder CloudBees (the Jenkins company)
• Developer with an interest in Ops
• built DEV@cloud RUN@cloud
#DV14 @michaelneale

Working with Cloud Platforms
• not as “friendly” as traditional hosting:
• Awesome power at fingertips: try everything, try all hardware
• Iterate rapidly
• But:
• APIs have lower QoS than hosts
• Servers are cattle, not pets
• Jenkins (and others) still need filesystems (not always easy on cloud)
• multi tenancy for scale/cost
#DV14 @michaelneale

Lucky decisions we made
• Isolate EC2 apis with fault tolerant REST app for provisioning
• API can behave strangely - backoff and retry, API limits and more
• Build pathological API simulator
• Enable replacement of servers via termination
• “chaos monkey” approach
• Reality: I didn’t understand chef. So replace AMI by terminating, new
latest takes its place
• Done as a “hack” but core platform value today
• ie. we are always changing, always replacing “naturally”
#DV14 @michaelneale

NetflixOSS productised this!
https://github.com/Netflix/Hystrix
https://github.com/Netflix/SimianArmy
http://netflix.github.io
netflixoss.ci.cloudbees.com
#DV14 @michaelneale

Chaos monkeying to upgrade
• OS change: new AMI == terminate, let system replace
• (in ec2: autoscale groups can do this for you)
• Security patch? == terminate.
• Server a bit sick? TERMINATE
• (we actually use chef for minor config changes and some app
level upgrades… relax…)
• If in doubt.. you get the idea…
#DV14 @michaelneale

A bad year for security
• Heartbleed
• Shell-shock
• POODLE
• XEN guest flaw, aws-reboot-a-thon
#DV14 @michaelneale

But a great year for logos:
Xen
#DV14 @michaelneale

Upgrades…
• In place or… TERMINATE?
• Often easier and safer to swap out:
• eg revproxy (nginx) cluster replacement process:
• warm new server, cut over IP and traffic, terminate old
• No half-measures, half-upgrades, clean slate
• (elastic IP helped in this case)
#DV14 @michaelneale

More benefits of terminate …
• “Retirement notices” from AWS - daily event!
• Even “new” servers - 3 days until “retire”
• No you can’t see the server in retirement home.
• Reboot at some vague time - TERMINATE
• Encourages immutable servers
• predictable state
• security advantages of being “locked down” in image
#DV14 @michaelneale

But what about data…
• Some say filesystem dependency is “legacy”
• I say “you aren’t trying hard enough”
• APIs such as EBS allow quick volume creation based on
snapshots:
• Continuous (delta) snapshotting of data
• Can quickly restore service in healthy data centers
• Faster time to recovery, route around failing zones
• Ideal: use distributed data in all forms if you can!
#DV14 @michaelneale

Containment challenge:
APPS
JENKINS
MASTERS
BUILD
EXECUTORS
#DV14 @michaelneale

Containment
• Apps (paas) can do anything
• Builds DO do anything
• Need a clean slate for users
• Process cleanup
• Jenkins masters have plugins
• Multi tenancy: cost effective, higher density, better elasticity
(fine grained processes vs autoscale groups)
#DV14 @michaelneale

Containment Evolution
• Unix user isolation + cgroups
• LXC (builds on cgroups, namespaces)
• Docker (builds on cgroups, namespaces, NOT lxc)
• Natural current end point and so hot right now:
#DV14 @michaelneale

Containment challenge:
http://developer-blog.cloudbees.com/2013/05/inside-linux-containers-lxc-with.html
#DV14 @michaelneale

Security benefits of containers?
• Not complete
• Not a replacement for current measures, but help
• Lots of (changing) content online
• Next: linux user-namespace for “fake root user”
• “coming real soon now??” already in lxc, not in docker at this time.
#DV14 @michaelneale

Transition of a build service
• Initial: discrete build nodes, “recycled” between use
• Pools with “mark and sweep” garbage collection of unused build
servers
• unix user and cgroup/namespace isolation
• Attach build data from snapshots
#DV14 @michaelneale

• Next: use LXC for containment isolation
• Finally: Use multi-tenant pools with full container isolation
• Pool disks for IO and EBS resilience (ZFS)
• Use larger more economical server (more burst power)
• Consistent hashing to get server with warm “build cache”
• (sorry if your maven re-downloads the world, hopefully not all the
time)
#DV14 @michaelneale

• Done continually over a year
• Limited user opt-in/out, majority do not notice
• Strategy options:
• roll out to 10%, 50%
• roll out to tiered users (ie freemium users get new/unstable?)
• roll out to all - incremental uptake due to natural restarting/
reprovisioning
• ALWAYS dog food
#DV14 @michaelneale

Dog food
• Always roll out to self first
• (occasionally joyously discover bootstrapping problem if it goes bad!)
• True indicator of confidence
• We get used to change, from users point of view
#DV14 @michaelneale

• How we apply Jenkins with CD:
UPSTREAM
CHANGE
terminate at any time
CHEF RECIPE
MASTER BRANCH TEST ENV
CHEF RECIPE
PRODUCTION BRANCH
PROD ENV
rollout
strategy
#DV14 @michaelneale

Wide feedback
• Provide something community want to try:
• https://registry.hub.docker.com/_/jenkins/
• Helps them, helps us learn
#DV14 @michaelneale

Lessons on continual change
• Cost of change == F(gap between deployments)
• CD etc etc (you will hear a lot elsewhere)
• Keep MTTR (mean time to recovery) low
• If short enough, people will blame internet connection (ssshhhh)
#DV14 @michaelneale

Lessons on continual change
• Always be doing DR
• People ask about “DR” strategy
• If you DR often, then it isn’t really DR - just BAU*, TMA*?
• Normal service restoration and termination exercises “backups”
#DV14 @michaelneale

Changes in a SaaS
• If people use a SaaS, upgrades/change expected
• Communicate to users on changes, let them know how much work
you do for them! It isn’t easy!
• Some changes visible, some not (some you thought invisible, but
were visible) - let people know.
• Even outages can create good will:
• Explanations and understanding == appreciation, it happens
• Proactive security patching this year
• “we don’t want to run this ourselves”
#DV14 @michaelneale

Monitoring and alerting
• Not often talked about in classic dev circles
• Increasingly passionate in “devops” circles (monitorama)
• Alerting a staple of traditional ops and “on call”
• These roles now smearing out amongst all devs
#DV14 @michaelneale

Why monitoring?
• SaaS always changing
• The Question:
• Are things better or worse than before?
• Did the change make things better or worse
• Not so much:
• Is everything perfect (it won’t be)
#DV14 @michaelneale

Monitoring and alerting
• Roughly split:
• “check engines” (nagios, pingdom etc)
• receive events, work if service up/down
• “notifications” - pagerduty and email, sms
• tell people about things
• analytics and monitoring (librato, boundary, new relic and more)
• DASHBOARDS AND GRAPHS EVERYWHERE
#DV14 @michaelneale

Analytics
Checks
#DV14 #YourTag @michaelneale

All exist to inform you
• Graphic dashboards can overwhelm
• Some people treat them as end goal
• Too much information often - are things OK Y/N?
• Aim is to get insight (eg new relic like an online profiler)
WHEN problems are happening
• Aim is to tell people when problems are happening
• Reports/graphs can be useful, but not at the expense of “health”
monitoring
#DV14 @michaelneale

If you must graph, a most important feature:
Deploy happened here!
#DV14 #YourTag @michaelneale

Alert and information fatigue
• A real (world) problem:
• http://fractio.nl/2014/08/26/cardiac-alarms-and-ops/
• Eg: cardiac monitors:
• Thresholds adjusted until only life critical
• No “ACK” of noisy alerts (no “WARNING”)
• Increased urgency, but reduced volume
• reduced noise, reduced fatigue and fatalities! (counterintuitive?)
#DV14 @michaelneale

Alert and information fatigue
• Avoid “warnings” that interrupt people
• (remember each interruption is > 1 hour really)
• Push messages to chat rooms “chat ops”
• Allow people already distracted to act
• Alerts/info as “streams” people can dip into and help out
• Avoid escalation
• Follow the sun support! (if your team has it! Great!)
#DV14 @michaelneale

End to End test monitor
• Why save testing for dev time only
• Apply a kind of integration test to production
• Can be a “synthetic transaction”
• eg: signup, run some process, exit
• Run it continually
• Increases confidence
• “Out Of Band End To End Test” “oobetet”
• technically monitoring, not testing!
#DV14 @michaelneale

Codahale metrics
• https://dropwizard.github.io/metrics/3.1.0/
• Simple metrics to your app:
• Binary health checks “foo.widget.thing is OK”
• Numerical metrics:
• Gauges, meters, histograms and more
• Lots of statistical goodness baked in (so you don’t have to)
• Expose via /health URL and JSON, push to metrics services and
more (can use a servlet):
#DV14 @michaelneale

Gauge measurement:
metrics.register(“important thing”, ”size”),
new Gauge<Integer>() {
@Override
public Integer getValue() {
return queue.size();
}
});
#DV14 #YourTag @YourTwitterHandle

trace percentile of times spent in..
private final Timer responses = metrics.timer(“important thing”);
public String handleRequest(Request request, Response response) {
final Timer.Context context = responses.time();
try {
// do some work;
return "OK";
} finally {
context.stop();
}
}
#DV14 #YourTag @YourTwitterHandle

Minimal points to take away
• Give codahale/dropwizard stuff a good look!
• Instrument at least a /health check that can be wired in later
• *think* about monitoring
• Replace/restore as matter of “routine”
• Change becomes the normal
• Terminate, restart, are often an OK way to recover!
#DV14 @michaelneale

Thank you!
Questions?
@michaelneale
developer-blog.cloudbees.com
#DV14 @michaelneale

Devoxx 2014 michael_neale

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Devoxx 2014 michael_neale

Ähnlich wie Devoxx 2014 michael_neale (20)

Mehr von Michael Neale

Mehr von Michael Neale (16)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Devoxx 2014 michael_neale