Spotify had one of its most disruptive outages in recent history in the evening of 8.3.2022 CET, which resulted in over an hour of downtime and users getting logged out. As luck would have it, I was on-call for the very first time for the User platform tribe. I helped where I could, but mostly watched in awe as a flurry of teams came online after-hours to work together to debug and mitigate the issue. Here, I walk you through the storm of incident-20220308, including symptoms, root causes, aftereffects, and takeaways.
2. About Me
● 8+ years experience as a backend
software engineer
● Originally from NJ, lived in CA 1 year,
in Berlin for 6+ years
● Work stuff I like: Distributed systems,
and incidents!
● Been at Spotify since October 2021
on User Platform tribe
14. Service Discovery @ Spotify
● Nameless, developed in-house
● Built on top of DNS protocol, serves
SRV records
● DNS propagation is naturally slow
● Client-heavy logic that does load
balancing
15. Traffic Director @ Spotify
● Traffic control plane for service mesh
● Fully-managed by Google
● Smarter load balancing
● Built-in service discovery
● Uses open-source xDS APIs by Envoy
for gPRC
16. The outage
Mar 08, 2022 6:30:44 PM
io.grpc.internal.ManagedChannelImpl$NameResolverListener
handleErrorInSyncContext
WARNING: [Channel<1>: (xds:///service2)] Failed to resolve name.
status=Status{code=NOT_FOUND, description=Requested entity was not
found., cause=null}
17. The outage
Mar 08, 2022 6:30:44 PM
io.grpc.internal.ManagedChannelImpl$NameResolverListener
handleErrorInSyncContext
WARNING: [Channel<1>: (xds:///service2)] Failed to resolve name.
status=Status{code=NOT_FOUND, description=Requested entity was not
found., cause=null}
Service2 not reachable because Traffic Director failed to resolve
18. The fix
● Revert all services back to using Nameless
● Service mostly restored by 19:40 CET
23. The aftermath
● ~50 million login sessions disrupted
● 3 million new duplicate accounts created in the next days / weeks
24. Lessons Learned
● Sometimes you are at the mercy of 3rd party SLAs
○ Login service displayed correct behavior on NOT_FOUND
○ Keep a fallback to Nameless? Lots of issues with that
○ Fewer synchronous calls on critical paths
● SSO login vs. email login usually confuses users
● Spotify is fully of smart, proactive, supportive engineers who even take
the time to have fun during an incident
30. Acknowledgements
● All 100+ colleagues online throughout the incident
● My own team for coming online without hesitation
● Infrastructure team for quickly spotting the bug and contacting Google