Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2fmEofl.
James Wen tells the story of how Spotify’s infrastructure evolved from teams owning and doting on groups of long-running servers to a distinctive separation of business code and value from the underlying machines all of Spotify's services actually run on. He examines how this evolution also changed the way that Spotify developers write code and the vast increase in iteration and shipping speed. Filmed at qconnewyork.com.
James Wen is currently a Site Reliability Engineer at Spotify. He's on the ALF squad at Spotify, maintaining and developing the tooling for capacity management + provisioning and internal DNS for 150+ teams. He was formerly the Team Lead (Anchor) of the Cloud Foundry Buildpacks team at Pivotal and a core contributor and maintainer of Bundler.
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Spotify Lessons: Learning to Let Go of Machines
1. Spotify Lessons:
Learning to Let Go of
Machines
IO Tribe
James Wen, Site Reliability Engineer at Spotify
ALF Squad, Infrastructure & Operations Tribe
2. InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
spotify-infrastructure-deployment
3. Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
6. Takeaways
• Feature developers = happiest with
feature work
• Find out developer machine
concerns and mitigate
• Migrating to cloud or hybrid? Start
embracing ephemeral service design
and infrastructure
17. Dedicated Ops?
~2000 Services
74 Infrastructure and Operations
Engineers
If all IO engineers → dedicated ops
27:1 service:engineer ratio
18. Ops In Squads
Feature teams handle their own ops and
provisioning
Using the services and tooling the
Infrastructure and Operations tribe has
written
19. We control the level of
context feature teams
need to operate their
services.
25. Machine Context
- Packages
- Hostname
- Machine specs (CPU, RAM,
disk, etc.)
- Uptime and service duration
- Location
- Local state (files on disk, info in
memory)
Unbound
v1.6.3
ash2-metadata-a.ash2.spotify.net
Openssl
v1.0.0f
2 Cores
8 GB
RAM
Tarred LogsIn Virginia
3 Years
26. Feature Developer Concerns
How to get?
How many?Specs?
How long?
How to talk
to it?
Where? Up to date?
How to
track?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
27.
28. Feature Developer Concerns
How to get?
How many?Specs?
How long?
How to talk
to it?
Where? Up to date?
How to
track?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
30. Feature Developer Concerns
How to get?
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
How to
track?
Where? Up to date?
How many? Specs?
32. Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
How many? Specs?
34. Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
How many? Specs?
36. Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
How many? Specs?
39. Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
How many? Specs?
41. Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
Where? Up to date?
How many? Specs?
How to
track?
How to get?
45. Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
How many? Specs?
47. Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
How many? Specs?
49. Feature Developer Concerns
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
How to
track?
How to get?
Where? How long?Up to date?
How many? Specs?
51. Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
What tools
on it?
What to put
on it?
How many? Specs?
52. Current: Feature Developer’s Context for Service’s Capacity
GCP - europe-west-1
Pool:
2 instances x (n1-standard-32)
Stockholm
Pool:
4 instances x (High Mem)
53. Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
What tools
on it?
What to put
on it?
How many? Specs?
56. Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
What tools
on it?
What to put
on it?
How many? Specs?
58. Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
What tools
on it?
What to put
on it?
How many? Specs?
60. Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
What tools
on it?
What to put
on it?
How many? Specs?
61. Future Feature Developer’s Context for Service’s Capacity
GCP - asia-east-1
Service Pool
GCP - europe-west-1
Service Pool
GCP - us-central-1
Service Pool
62. Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
What tools
on it?
What to put
on it?
How many? Specs?
66. - Replicate bare metal functionality, then
iterate
- When in doubt, devs provision up and
many
- Migration = great time to influence dev
paradigms
- Don’t need to DIY
Hybrid Learnings
67. - Feature devs need carrots,
sledgehammers, and/or limos to change
- Edge Cases: REST API + CLI = provide
enough for feature teams to handle the
edge cases
DevEx Learnings