SlideShare ist ein Scribd-Unternehmen logo
1 von 71
Spotify Lessons:

Learning to Let Go of
Machines
IO Tribe
James Wen, Site Reliability Engineer at Spotify

ALF Squad, Infrastructure & Operations Tribe
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
spotify-infrastructure-deployment
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Let’s control how
feature developers think
about what their code is
actually running on.
Takeaways
• Feature developers = happiest with
feature work
• Find out developer machine
concerns and mitigate
• Migrating to cloud or hybrid? Start
embracing ephemeral service design
and infrastructure
Agenda
• Why?
• Journey
• Hybrid Cloud
• Ops in Squads
• Future
• Learnings
Why?
Why don’t we want feature devs to care too
much about infrastructure and machines?
Why?
Time taken on infrastructure tasks = time
taken away from feature work
Feature devs = focused on features
Spotify Scale Stats
- 140 Million+ Monthly Active Users
- 50 Million+ Subscribers
- 30 Million+ Songs
- 2 Billion+ Playlists
- Available in 60 markets
Spotify Dev Scale Stats
~900 Devs
~100 Tech Teams
~2000 Services
Spotify Machine Scale Stats
~10,000 Bare Metal Hosts

~13,000 Hosts on GCP

46 Hardware/VM Types
Example: Capacity Planning
Avg # devs on a team Capacity Planning
Scale doesn’t really matter
-Smaller companies/teams =
developer time is more valuable
-Larger companies/teams =
wasted infra time scales as well
Other Infrastructure Tasks
- Machine provisioning

- Failure planning
- Security updates
- Machine maintenance
Dedicated Ops?
Dedicated Ops?
~2000 Services

74 Infrastructure and Operations
Engineers
If all IO engineers → dedicated ops

27:1 service:engineer ratio
Ops In Squads
Feature teams handle their own ops and
provisioning



Using the services and tooling the
Infrastructure and Operations tribe has
written
We control the level of
context feature teams
need to operate their
services.
- Developer Happiness

- Developer
effectiveness and
context
Journey
- Ops in Squads
- Hybrid Cloud
(Ephemerality)
Starting Out
StockholmSan Jose
Rack 2Rack 1
Historical: Feature Developer’s Context for Service’s Capacity
lon-1-dlon-1-b
lon-1-clon-1-a
keys
updated
Rack 2
lon-1-f
lon-1-e
updated
Machine Context
- Packages
- Hostname
- Machine specs (CPU, RAM,
disk, etc.)
- Uptime and service duration
- Location
- Local state (files on disk, info in
memory)
Unbound

v1.6.3
ash2-metadata-a.ash2.spotify.net
Openssl
v1.0.0f
2 Cores
8 GB
RAM
Tarred LogsIn Virginia
3 Years
Feature Developer Concerns
How to get?
How many?Specs?
How long?
How to talk
to it?
Where? Up to date?
How to
track?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
Feature Developer Concerns
How to get?
How many?Specs?
How long?
How to talk
to it?
Where? Up to date?
How to
track?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
ServerDB
Feature Developer Concerns
How to get?
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
How to
track?
Where? Up to date?
How many? Specs?
ProvGun/ProvCannon
Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
How many? Specs?
DNS
Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
How many? Specs?
Nameless
Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
How many? Specs?
Cortana
Cortana
Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
How many? Specs?
Helios and Containers
Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
Where? Up to date?
How many? Specs?
How to
track?
How to get?
Google Compute Platform
ash2-cortana-a1.ash2

Zone Service Group Sequential #
gew1-cortana-a-l33t.gew1
Zone Service Pool Random 4 Chars
Cortana Pool Manager
Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
How many? Specs?
Regional Managed
Instance Groups
Feature Developer Concerns
How long?
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
How many? Specs?
MBMI: Minimal Base Machine Image
Feature Developer Concerns
How to talk
to it?
What tools
on it?
Maintenance?
What to put
on it?
Available?
Service + Business
How to
track?
How to get?
Where? How long?Up to date?
How many? Specs?
Phoenix
Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
What tools
on it?
What to put
on it?
How many? Specs?
Current: Feature Developer’s Context for Service’s Capacity
GCP - europe-west-1
Pool:

2 instances x (n1-standard-32)
Stockholm
Pool:

4 instances x (High Mem)
Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
What tools
on it?
What to put
on it?
How many? Specs?
Future
Gordon (Cloud DNS)
Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
What tools
on it?
What to put
on it?
How many? Specs?
Autoscaling
Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
What tools
on it?
What to put
on it?
How many? Specs?
Right Sizing
Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
What tools
on it?
What to put
on it?
How many? Specs?
Future Feature Developer’s Context for Service’s Capacity
GCP - asia-east-1
Service Pool
GCP - europe-west-1
Service Pool
GCP - us-central-1
Service Pool
Feature Developer Concerns
How long?
How to talk
to it?
Maintenance?
Available?
Service + Business
How to
track?
How to get?
Where? Up to date?
What tools
on it?
What to put
on it?
How many? Specs?
Learnings
Why Pets to Cattle was Difficult:
- Manual/tedious setup

- Wait times for machine becoming ready
(packages, DNS)

- Non-automatic security updates
- A fixed, reliable hostname
- SSH Access
- Always up/present unless team tears down
- Monitoring

- Logging

- Service Design
- Incidents
Ephemerality Learnings
- Replicate bare metal functionality, then
iterate
- When in doubt, devs provision up and
many
- Migration = great time to influence dev
paradigms
- Don’t need to DIY
Hybrid Learnings
- Feature devs need carrots,
sledgehammers, and/or limos to change
- Edge Cases: REST API + CLI = provide
enough for feature teams to handle the
edge cases


DevEx Learnings
- Decrease necessary
infrastructure context
- Increase reliability
- Save $$$
- Increase dev happiness and
productivity
Recap
Let’s strategically
control and limit how
feature developers
think about
infrastructure.
James Wen

Email: jameswen@spotify.com

Twitter/Github: @rochesterinnyc
LinkedIn: jamesrwen



Spotify is hiring! spotifyjobs.com
IO Tribe
Watch the video with slide synchronization on
InfoQ.com!
https://www.infoq.com/presentations/spotify-
infrastructure-deployment

Weitere ähnliche Inhalte

Mehr von C4Media

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinC4Media
 

Mehr von C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Kürzlich hochgeladen

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Spotify Lessons: Learning to Let Go of Machines

  • 1. Spotify Lessons:
 Learning to Let Go of Machines IO Tribe James Wen, Site Reliability Engineer at Spotify
 ALF Squad, Infrastructure & Operations Tribe
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ spotify-infrastructure-deployment
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4.
  • 5. Let’s control how feature developers think about what their code is actually running on.
  • 6. Takeaways • Feature developers = happiest with feature work • Find out developer machine concerns and mitigate • Migrating to cloud or hybrid? Start embracing ephemeral service design and infrastructure
  • 7. Agenda • Why? • Journey • Hybrid Cloud • Ops in Squads • Future • Learnings
  • 8. Why? Why don’t we want feature devs to care too much about infrastructure and machines?
  • 9. Why? Time taken on infrastructure tasks = time taken away from feature work Feature devs = focused on features
  • 10. Spotify Scale Stats - 140 Million+ Monthly Active Users - 50 Million+ Subscribers - 30 Million+ Songs - 2 Billion+ Playlists - Available in 60 markets
  • 11. Spotify Dev Scale Stats ~900 Devs ~100 Tech Teams ~2000 Services
  • 12. Spotify Machine Scale Stats ~10,000 Bare Metal Hosts
 ~13,000 Hosts on GCP
 46 Hardware/VM Types
  • 13. Example: Capacity Planning Avg # devs on a team Capacity Planning
  • 14. Scale doesn’t really matter -Smaller companies/teams = developer time is more valuable -Larger companies/teams = wasted infra time scales as well
  • 15. Other Infrastructure Tasks - Machine provisioning
 - Failure planning - Security updates - Machine maintenance
  • 17. Dedicated Ops? ~2000 Services
 74 Infrastructure and Operations Engineers If all IO engineers → dedicated ops
 27:1 service:engineer ratio
  • 18. Ops In Squads Feature teams handle their own ops and provisioning
 
 Using the services and tooling the Infrastructure and Operations tribe has written
  • 19. We control the level of context feature teams need to operate their services.
  • 20. - Developer Happiness
 - Developer effectiveness and context
  • 22. - Ops in Squads - Hybrid Cloud (Ephemerality)
  • 24. StockholmSan Jose Rack 2Rack 1 Historical: Feature Developer’s Context for Service’s Capacity lon-1-dlon-1-b lon-1-clon-1-a keys updated Rack 2 lon-1-f lon-1-e updated
  • 25. Machine Context - Packages - Hostname - Machine specs (CPU, RAM, disk, etc.) - Uptime and service duration - Location - Local state (files on disk, info in memory) Unbound
 v1.6.3 ash2-metadata-a.ash2.spotify.net Openssl v1.0.0f 2 Cores 8 GB RAM Tarred LogsIn Virginia 3 Years
  • 26. Feature Developer Concerns How to get? How many?Specs? How long? How to talk to it? Where? Up to date? How to track? What tools on it? Maintenance? What to put on it? Available? Service + Business
  • 27.
  • 28. Feature Developer Concerns How to get? How many?Specs? How long? How to talk to it? Where? Up to date? How to track? What tools on it? Maintenance? What to put on it? Available? Service + Business
  • 30. Feature Developer Concerns How to get? How long? How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business How to track? Where? Up to date? How many? Specs?
  • 32. Feature Developer Concerns How long? How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?
  • 33. DNS
  • 34. Feature Developer Concerns How long? How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?
  • 36. Feature Developer Concerns How long? How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?
  • 39. Feature Developer Concerns How long? How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?
  • 41. Feature Developer Concerns How long? How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business Where? Up to date? How many? Specs? How to track? How to get?
  • 43. ash2-cortana-a1.ash2
 Zone Service Group Sequential # gew1-cortana-a-l33t.gew1 Zone Service Pool Random 4 Chars
  • 45. Feature Developer Concerns How long? How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?
  • 47. Feature Developer Concerns How long? How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business How to track? How to get? Where? Up to date? How many? Specs?
  • 48. MBMI: Minimal Base Machine Image
  • 49. Feature Developer Concerns How to talk to it? What tools on it? Maintenance? What to put on it? Available? Service + Business How to track? How to get? Where? How long?Up to date? How many? Specs?
  • 51. Feature Developer Concerns How long? How to talk to it? Maintenance? Available? Service + Business How to track? How to get? Where? Up to date? What tools on it? What to put on it? How many? Specs?
  • 52. Current: Feature Developer’s Context for Service’s Capacity GCP - europe-west-1 Pool:
 2 instances x (n1-standard-32) Stockholm Pool:
 4 instances x (High Mem)
  • 53. Feature Developer Concerns How long? How to talk to it? Maintenance? Available? Service + Business How to track? How to get? Where? Up to date? What tools on it? What to put on it? How many? Specs?
  • 56. Feature Developer Concerns How long? How to talk to it? Maintenance? Available? Service + Business How to track? How to get? Where? Up to date? What tools on it? What to put on it? How many? Specs?
  • 58. Feature Developer Concerns How long? How to talk to it? Maintenance? Available? Service + Business How to track? How to get? Where? Up to date? What tools on it? What to put on it? How many? Specs?
  • 60. Feature Developer Concerns How long? How to talk to it? Maintenance? Available? Service + Business How to track? How to get? Where? Up to date? What tools on it? What to put on it? How many? Specs?
  • 61. Future Feature Developer’s Context for Service’s Capacity GCP - asia-east-1 Service Pool GCP - europe-west-1 Service Pool GCP - us-central-1 Service Pool
  • 62. Feature Developer Concerns How long? How to talk to it? Maintenance? Available? Service + Business How to track? How to get? Where? Up to date? What tools on it? What to put on it? How many? Specs?
  • 64. Why Pets to Cattle was Difficult: - Manual/tedious setup
 - Wait times for machine becoming ready (packages, DNS)
 - Non-automatic security updates - A fixed, reliable hostname - SSH Access - Always up/present unless team tears down
  • 65. - Monitoring
 - Logging
 - Service Design - Incidents Ephemerality Learnings
  • 66. - Replicate bare metal functionality, then iterate - When in doubt, devs provision up and many - Migration = great time to influence dev paradigms - Don’t need to DIY Hybrid Learnings
  • 67. - Feature devs need carrots, sledgehammers, and/or limos to change - Edge Cases: REST API + CLI = provide enough for feature teams to handle the edge cases 
 DevEx Learnings
  • 68. - Decrease necessary infrastructure context - Increase reliability - Save $$$ - Increase dev happiness and productivity Recap
  • 69. Let’s strategically control and limit how feature developers think about infrastructure.
  • 70. James Wen
 Email: jameswen@spotify.com
 Twitter/Github: @rochesterinnyc LinkedIn: jamesrwen
 
 Spotify is hiring! spotifyjobs.com IO Tribe
  • 71. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/spotify- infrastructure-deployment