5. puppet @ eBay, quick facts
-> perhaps the largest Puppet deployment
-> more definitively the most diverse
-> manages core security
-> trying to solve the âp100kâ problems
6. #âs
âą 100K+ agents
â Solaris, Linux, and Windows
â Production & QA
â Cloud (openstack & VMware) + bare metal
âą 32 different OS versions, 43 hardware configurations
â Over 300 permutations in production
âą Countless apps from C/C++ to Hadoop
â Some applications over 15+ years old
7. currently
âą 3-4 puppet masters per data center
âą foreman for ENC, statistics, and fact collection
âą 150+ puppet runs per second
âą separate git repos per environment, common core
modules
â caching git daemon used by ppmâs
11. setup puppetmasters
setup puppet master, itâs the CA too
sign and run 400 agents concurrently, thatâs less than
half a percent of all the nodes you need to get
through.
12.
13. not exactly puppet issues
entropy unavailable
crypto is CPU heavy (heavier than you ever have and
still believe)
passenger children are all busy
15. multiple dedicated CAâs
much better, distributed the CPU I/O and helped the
entropy problem.
the PPMâs can handle actual puppet agent runs
because they arenât tied up signing. Great!
16. wait, how do the CAâs know about each others certs?
some sort of network file system (NFS sounds okay).
17. shared storage for CA cluster
-> Get a list of pending signing requests (should be small!)
# puppet cert list
âŠ
wait
âŠ
wait
âŠ
18.
19. optimize CAâs for large # of certs
Traversing a large # of certs is too slow over NFS.
-> Profile
-> Implement optimization
-> Get patch accepted (PUP-1665, 8x improvement)
21. optimizing foreman
- read heavy is fine, DBâs do it well.
- read heavy in a write heavy environment is more challenging.
- foreman writes a lot of log, fact, and report data post puppet run.
- majority of requests are to get ENC data
- use makara with PG read slaves
(https://github.com/taskrabbit/makara) to scale ENC requests
- Needs updates to foreigner (gem)
- If ENC requests areslow, puppetmasters fall over.
22. optimizing foreman
ENC requests load balanced to read slaves
fact/report/host info write requests sent to master
makara knows how to arbitrate the connection (great
job TaskRabbit team!)
23. more optimizations
make sure RoR cache is set to use dalli
(config.cache_store = :dalli_store), see foreman wiki
fact collection optimization (already in upstream),
without this reporting facts back to foreman can kill a
busy puppetmaster! (if you care:
https://github.com/theforeman/puppet-foreman/
pull/145)
25. letâs add more nodes
Adding another 30,000 nodes (thatâs 30% coverage).
Agent setup: pretty standard stuff, puppet agent as a
service.
26. results
average puppet run: 29 seconds.
not horrible. but average latency is a lie because that
usually represents the mean average (sum of N / N).
the actual puppet run graph looks more likeâŠ
27. curve impossible
No one in operations or infrastructure ever wants a service runtime graph like this.
mean
average
28. PPM running @ medium load
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16765 puppet 20 0 341m 76m 3828 S 53.0 0.1 67:14.92 ruby
17197 puppet 20 0 343m 75m 3828 S 40.7 0.1 62:50.01 ruby
17174 puppet 20 0 353m 78m 3996 S 38.7 0.1 70:07.44 ruby
16330 puppet 20 0 338m 74m 3828 S 33.8 0.1 66:08.81 ruby
17231 puppet 20 0 344m 75m 3820 S 29.8 0.1 70:00.47 ruby
17238 puppet 20 0 353m 76m 3996 S 29.8 0.1 69:11.94 ruby
17187 puppet 20 0 343m 76m 3820 S 26.2 0.1 70:48.66 ruby
17156 puppet 20 0 353m 75m 3984 S 25.8 0.1 64:44.62 ruby
⊠system processes
29. 60 seconds laterâŠidle
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17343 puppet 20 0 344m 77m 3828 S 11.6 0.1 74:47.23 ruby
31152 puppet 20 0 203m 9048 2568 S 11.3 0.0 0:03.67 httpd
29435 puppet 20 0 203m 9208 2668 S 10.9 0.0 0:05.46 httpd
16220 puppet 20 0 337m 74m 3828 S 10.3 0.1 70:07.42 ruby
16354 puppet 20 0 339m 75m 3816 S 10.3 0.1 62:11.71 ruby
⊠system processes
30. 120 seconds laterâŠthrashing
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16765 puppet 20 0 341m 76m 3828 S 94.0 0.1 67:14.92 ruby
17197 puppet 20 0 343m 75m 3828 S 93.7 0.1 62:50.01 ruby
17174 puppet 20 0 353m 78m 3996 S 92.7 0.1 70:07.44 ruby
16330 puppet 20 0 338m 74m 3828 S 90.8 0.1 66:08.81 ruby
17231 puppet 20 0 344m 75m 3820 S 89.8 0.1 70:00.47 ruby
17238 puppet 20 0 353m 76m 3996 S 89.8 0.1 69:11.94 ruby
17187 puppet 20 0 343m 76m 3820 S 88.2 0.1 70:48.66 ruby
17156 puppet 20 0 353m 75m 3984 S 87.8 0.1 64:44.62 ruby
17152 puppet 20 0 353m 75m 3984 S 86.3 0.1 64:44.62 ruby
17153 puppet 20 0 353m 75m 3984 S 85.3 0.1 64:44.62 ruby
17151 puppet 20 0 353m 75m 3984 S 82.9 0.1 64:44.62 ruby
⊠more ruby processes
31.
32. what we really want
A flat consistent runtime curve, this is important for any production service.
Without predictability there is no reliability!
33. consistency @ medium load
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
16765 puppet 20 0 341m 76m 3828 S 53.0 0.1 67:14.92 ruby
17197 puppet 20 0 343m 75m 3828 S 40.7 0.1 62:50.01 ruby
17174 puppet 20 0 353m 78m 3996 S 38.7 0.1 70:07.44 ruby
16330 puppet 20 0 338m 74m 3828 S 33.8 0.1 66:08.81 ruby
17231 puppet 20 0 344m 75m 3820 S 29.8 0.1 70:00.47 ruby
17238 puppet 20 0 353m 76m 3996 S 29.8 0.1 69:11.94 ruby
17187 puppet 20 0 343m 76m 3820 S 26.2 0.1 70:48.66 ruby
17156 puppet 20 0 353m 75m 3984 S 25.8 0.1 64:44.62 ruby
⊠system processes
34. hurdle: runinterval
near impossible to get a flat curve because of uneven
and chaotic agent run distribution.
runinterval is non-deterministic ⊠even if you manage
to sync up service times eventually itâs nebulous.
36. plan A: puppet via cron
generate run time based some deterministic agent data
point (IP, MAC address, hostname, etc.).
IE, if you wanted a puppet run every 30 minutes, your
crontab may look like:
08 * * * * puppet agent -t
38 * * * * puppet agent -t
38. Improved.
But does not scale because cronjobs help run times
become deterministic but lack even distribution.
39. eliminate all masters? masterless puppet
kicking the can down the road, somewhere
infrastructure still has to serve the files and catalog to
agents.
masterless puppet creates a whole host of other
issues (file transfer channels, catalog compiler host).
40. eliminate all masters? masterless puppet
âŠand the same issues exists in albeit in different
forms.
shifts problems to âcompile intervalâ and
âmanifest/module push intervalâ.
41. plan Z: increase your runinterval
Z, the zombie apocalypse plan (do not do this!).
delaying failure till you are no longer responsible for it
(hopefully).
42. alternate setups
SSL termination on load balancer â expensive
- LBâs are difficult to deploy, cost more (you still
need fail over otherwise itâs a SPoF!)
caching â cache is meant to make things faster, not
required to work. If cache is required to make services
functional, solving the wrong problem.
43. zen moment
maybe the issue isnât about timing the agent from
the host.
maybe the issue is that the agent doesnât know when
thereâs enough capacity to reliably and predictably run
puppet.
44. enforcing states is delayed
runinterval/cronjobs/masterless setups still render
puppet as a suboptimal solution in a state sensitive
environment (customer and financial data).
the problem is not unique to puppet. salt, coreOS, et
al. are susceptible.
45. security trivia
web service REST3DotOh just got compromised and
allows a sensitive file managed by puppet to be
manipulated.
Q: how/when does puppet set the proper state?
46. the how; sounds awesome
A: every puppet runs ensures that a file is in itsâ
intended state and records the previous state if it was
not.
47. the when; sounds far from awesome
A: whenever puppet is scheduled to run next. up to
runinterval minutes from the compromise, masterless
push, or cronjob execution.
48. smaller intervals help butâŠ
all the strategies have one common issue:
puppet masters do not scale with smaller intervals,
exasperate spikes in the runtime curve.
50. pvc
âpvcâ â open source & lightweight process for a
deterministic and evenly distributed puppet service
curveâŠ
âŠand reactive state enforcement puppet runs.
51. pvc
a different approach that executes puppet runs based on
available capacity and local state changes.
pings from an agent to check if itsâ time to run puppet.
file monitoring to force puppet runs when important files
change outside of puppet (think /etc/shadow,
/etc/sudoers).
52. pvc
basic concepts:
- Frequent pings to determine when to run puppet
- Tied in to backend PPM health/capacity
- Frequent fact collection without needing to run puppet
- Sensitive files should be subject to monitoring
- on change or updates outside of puppet, immediately run
puppet!
- efficiency an important factor.
53. pvc advantages
-> variable puppet agent run timing
- allows the flat and predictable service curve (what we
want).
- more frequent puppet runs when capacity is available,
less frequent puppet runs less capacity is available.
54. pvc advantages
-> improves security (kind of a big deal these days)
- puppet runs when state changes rather than waiting to
run.
- efficient, uses inotify to monitor files.
- if a file being monitored is changed, a puppet run is
forced.
55. pvc advantages
- orchestration between foreman & puppet
- controlled rollout of changes
- upload facts between puppet runs into foreman
56. pvc â backend
3 endpoints â all get the ?fqdn=<certname> parameter
GET /host â should pvc run puppet or facter?
POST /report â raw puppet run output, files monitored
were changed
POST /facts â facter output (puppet facts in JSON)
58. pvc â /facts
allows collecting of facts outside of the normal puppet
run, useful for monitoring.
set PVC_FACT_RUN to report facts back to the pvc
backend.
59. pvc â git for auditing
push actual changes between runs into git
- branch per host, parentless branches & commits
are cheap.
- easy to audit fact changes (fact blacklist to
prevent spam) and changes between puppet runs.
- keeping puppet reports between runs is not
helpful.
60. pvc â incremental rollouts
select candidate hosts based on your criteria and set an environment variable
via the /host endpoint output:
FACTER_UPDATE_FLAG=true
in your manifest, check:
if $::UPDATE_FLAG {
âŠ
}
62. pvc â available on github
$ git clone https://github.com/johnj/pvc
make someone happy, achieve:
63. wishlist
stuff pvc should probably have:
âą authentication of some sort
âą a more general backend, currently tightly integrated
into internal PPM infrastructure health
âą whatever other users wish it had
64. misc. lessons learned
your ENC has to be fast, or your puppetmasters fail
without ever doing anything.
upgrade ruby to 2.x for the performance improvements.
serve static module files with a caching http server
(nginx).