SlideShare ist ein Scribd-Unternehmen logo
1 von 37
@jewelia
Scaling Slack
Infrastructure
🚀
Julia Grace
Senior Director of Engineering
@jewelia
InfoQ.com: News & Community Site
‱ 750,000 unique visitors/month
‱ Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
‱ Post content from our QCon conferences
‱ News 15-20 / week
‱ Articles 3-4 / week
‱ Presentations (videos) 12-15 / week
‱ Interviews 2-3 / week
‱ Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
slack-scaling-infrastructure/
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
@jewelia
Phase 0: 2015
@jewelia
~2.5M Daily
Active Users
@jewelia
Phase 1: 2016
@jewelia
~4M Daily Active
Users
@jewelia
Phase 1: 2016
Slack was originally designed for teams < 150ppl.
You make very diïŹ€erent architectural decisions when you’re building for a team of
100 people vs 500,000.
Before August 2016 we had no Infra team.
Original infrastructure built for Glitch worked very well in 2014/2015.
~150 Engineers total.
Infrastructure investments would come secondary to feature work.
@jewelia
Things were starting to break in strange,
unusual
ways.
‹
@jewelia
Phase 1: 2016
Example: User Presence
Green dot indicaFng online/away/oïŹ„ine.
Very few people noFce it, unless it’s broken (people expect it to “just work”).
Apps and bots are always online.
@jewelia
Phase 1: 2016
User Presence
IniFally broadcast all changes to all users (e.g. “Julia Grace is away”) to the whole
workspace: O(n^2).
Presence was ~80% of all web socket traffic.
Peak volume in late 2016: 16 million messages/minute over web socket.
Presence messages: 13 million/minute.
Rapidly transiFon from broadcast to publish/subscribe.
@jewelia
There were many organizational challenges as
well.
‹
@jewelia
Phase 1: 2016
How to build engineering-led org in a product-
led company?
Would we be able to get headcount, budget?
How to communicate the value of we are doing to non-technical audiences?
How do we interface with sales?
Infrastructure as a compeFFve advantage.
@jeweliahUps://www.ïŹ‚ickr.com/photos/pocheco/14833391966
@jewelia
Phase 1: 2016
Start internal evangelism on day #1.
I went on an internal PR campaign: Why our work was important, why we needed
to conFnually invest in infrastructure. Make work very visible to execs in other
funcFons.
Followed existing company process.
We did planning, status reporFng, etc. at the same cadence and in the same
meeFngs as product engineering. Don’t try to start a new group and invent new
process.
Identify executive sponsor.
@jewelia
Phase 2: 2017
@jewelia
Phase 2: 2017
Technology landscape.
Hack/PHP monolith on backend, JavaScript with no libraries on frontend.
1 service: presence and real-Fme messaging.
Building a second service: Go caching service.
These bespoke services each had to handle rate limiFng, traïŹƒc management,
deployment.
@jewelia
Phase 2: 2017
It was time to change our DB sharding strategy.
MySQL sharded by team/workspace to Vitess sharded by various keys.
Worked great! UnFl we hit scaling limits, signiïŹcant hotspots.
‹
@jewelia
Monolith
Service
A
Service
B
@jewelia
Monolith
Service
A
Service
B
@jewelia
Monolith
Service
A
Service
B
Who owns this?
@jewelia
Communication Risk
The more technically complex,
nuanced a problem is

@jewelia
Communication Risk
The more technically complex,
nuanced a problem is

The higher the communication risk.
@jewelia
Phase 2: 2017
Immense pressure to hire engineers.
Many human SPOFs (single points of failure) because team was so small.
Everyone was overextended and overcommihed.
We had to figure out how to hire Infra
engineers.
All our hiring processes were opFmized around hiring generalists: frontend
backend, iOS, Android, Ops.
We skills do we need and value? How do we test for those skills?
@jewelia
Phase 2: 2017
Decided to hire Infra engineering generalists.
Created a take home coding exercise designed
to test:
1. An understanding of servers, networking, and protocols.
2. An understanding of concurrency, performance, and resource constraints, and
an ability to anFcipate future issues and implement soluFons.
3. An ability to write clear, easy to understand code, communicate your approach,
and reason about tradeoïŹ€s that you have made.
@jewelia
Phase 2: 2017
I wore so many hats. Too many hats.
Similar to my days as a startup CTO!
I was the Engineering Director and
Forming strategy, hiring managers and ICs, evangelizing the org.

Product Manager and
Internal interface to Product Engineering/PMs building features, externally to
customers with quesFons about the integrity of our infrastructure.

Program Manager.
Running cross funcFonal iniFaFves.
@jewelia
Phase 3: 2018
@jewelia
Phase 3: 2018
“0 to 1” was over. Now time for “1 to ∞”.
ReacFve to ProacFve.
Transition from few teams to an org in 3 offices.
Team nearly 100 engineers by end of year.
Now included Data, Machine Learning, Search Infrastructure
Many orders of magnitude better performance
Things were not breaking all the time.
@jewelia
Phase 3: 2018
Services model matured significantly.
SLAs for services, consistent deployment processes, etc.
Mature incident response process.
Dividing into sub-teams made sense.
Data Stores & Cache Infra, Service Mesh & Web Serving, Distributed Messaging.
@jewelia
Phase 3: 2018
Hired Director Specialists

Had to quickly learn how to hire senior leaders whose jobs you haven’t done before.
How to do this well: talk to a lot people who currently do the job you’re trying to hire
for, deeply understand the talent market.
and Product Managers

and did an acquisition.
@jewelia
Phase 3: 2018
Challenge: coherency across a large
organization.
Example: overlap between Machine Learning and Frontend Infra was NULL.
Difficult to have a unified vision.
Stakeholders were each org were diïŹ€erent for each part of the org; Data Infra
organizaFon worked closely with G&A (ïŹnance), Search Infra did not.
I should have done more re-orgs!
@jewelia
2016:
@jewelia
2016:
2017:
@jewelia
2016:
2017:
2018:
@jewelia
Today
Infra has been around for ~3 years
400M async jobs processed/day to 2.5B
3M DAU (daily active users) to 10M DAU
1M simultaneously connected users to 7.5M
10 to ~100 engineers in SF, NYC, YVR
Generalist (ICs, Managers) to specialists
1 amazing team
@jewelia
Thank You!
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
slack-scaling-infrastructure/

Weitere Àhnliche Inhalte

Mehr von C4Media

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinC4Media
 

Mehr von C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

KĂŒrzlich hochgeladen

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

KĂŒrzlich hochgeladen (20)

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Scaling Infrastructure Engineering at Slack

  • 2. InfoQ.com: News & Community Site ‱ 750,000 unique visitors/month ‱ Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) ‱ Post content from our QCon conferences ‱ News 15-20 / week ‱ Articles 3-4 / week ‱ Presentations (videos) 12-15 / week ‱ Interviews 2-3 / week ‱ Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ slack-scaling-infrastructure/
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 8. @jewelia Phase 1: 2016 Slack was originally designed for teams < 150ppl. You make very diïŹ€erent architectural decisions when you’re building for a team of 100 people vs 500,000. Before August 2016 we had no Infra team. Original infrastructure built for Glitch worked very well in 2014/2015. ~150 Engineers total. Infrastructure investments would come secondary to feature work.
  • 9. @jewelia Things were starting to break in strange, unusual ways. ‹
  • 10. @jewelia Phase 1: 2016 Example: User Presence Green dot indicaFng online/away/oïŹ„ine. Very few people noFce it, unless it’s broken (people expect it to “just work”). Apps and bots are always online.
  • 11. @jewelia Phase 1: 2016 User Presence IniFally broadcast all changes to all users (e.g. “Julia Grace is away”) to the whole workspace: O(n^2). Presence was ~80% of all web socket traffic. Peak volume in late 2016: 16 million messages/minute over web socket. Presence messages: 13 million/minute. Rapidly transiFon from broadcast to publish/subscribe.
  • 12. @jewelia There were many organizational challenges as well. ‹
  • 13. @jewelia Phase 1: 2016 How to build engineering-led org in a product- led company? Would we be able to get headcount, budget? How to communicate the value of we are doing to non-technical audiences? How do we interface with sales? Infrastructure as a compeFFve advantage.
  • 15. @jewelia Phase 1: 2016 Start internal evangelism on day #1. I went on an internal PR campaign: Why our work was important, why we needed to conFnually invest in infrastructure. Make work very visible to execs in other funcFons. Followed existing company process. We did planning, status reporFng, etc. at the same cadence and in the same meeFngs as product engineering. Don’t try to start a new group and invent new process. Identify executive sponsor.
  • 17. @jewelia Phase 2: 2017 Technology landscape. Hack/PHP monolith on backend, JavaScript with no libraries on frontend. 1 service: presence and real-Fme messaging. Building a second service: Go caching service. These bespoke services each had to handle rate limiFng, traïŹƒc management, deployment.
  • 18. @jewelia Phase 2: 2017 It was time to change our DB sharding strategy. MySQL sharded by team/workspace to Vitess sharded by various keys. Worked great! UnFl we hit scaling limits, signiïŹcant hotspots. ‹
  • 22. @jewelia Communication Risk The more technically complex, nuanced a problem is

  • 23. @jewelia Communication Risk The more technically complex, nuanced a problem is
 The higher the communication risk.
  • 24. @jewelia Phase 2: 2017 Immense pressure to hire engineers. Many human SPOFs (single points of failure) because team was so small. Everyone was overextended and overcommihed. We had to figure out how to hire Infra engineers. All our hiring processes were opFmized around hiring generalists: frontend backend, iOS, Android, Ops. We skills do we need and value? How do we test for those skills?
  • 25. @jewelia Phase 2: 2017 Decided to hire Infra engineering generalists. Created a take home coding exercise designed to test: 1. An understanding of servers, networking, and protocols. 2. An understanding of concurrency, performance, and resource constraints, and an ability to anFcipate future issues and implement soluFons. 3. An ability to write clear, easy to understand code, communicate your approach, and reason about tradeoïŹ€s that you have made.
  • 26. @jewelia Phase 2: 2017 I wore so many hats. Too many hats. Similar to my days as a startup CTO! I was the Engineering Director and Forming strategy, hiring managers and ICs, evangelizing the org. 
Product Manager and Internal interface to Product Engineering/PMs building features, externally to customers with quesFons about the integrity of our infrastructure. 
Program Manager. Running cross funcFonal iniFaFves.
  • 28. @jewelia Phase 3: 2018 “0 to 1” was over. Now time for “1 to ∞”. ReacFve to ProacFve. Transition from few teams to an org in 3 offices. Team nearly 100 engineers by end of year. Now included Data, Machine Learning, Search Infrastructure Many orders of magnitude better performance Things were not breaking all the time.
  • 29. @jewelia Phase 3: 2018 Services model matured significantly. SLAs for services, consistent deployment processes, etc. Mature incident response process. Dividing into sub-teams made sense. Data Stores & Cache Infra, Service Mesh & Web Serving, Distributed Messaging.
  • 30. @jewelia Phase 3: 2018 Hired Director Specialists
 Had to quickly learn how to hire senior leaders whose jobs you haven’t done before. How to do this well: talk to a lot people who currently do the job you’re trying to hire for, deeply understand the talent market. and Product Managers
 and did an acquisition.
  • 31. @jewelia Phase 3: 2018 Challenge: coherency across a large organization. Example: overlap between Machine Learning and Frontend Infra was NULL. Difficult to have a unified vision. Stakeholders were each org were diïŹ€erent for each part of the org; Data Infra organizaFon worked closely with G&A (ïŹnance), Search Infra did not. I should have done more re-orgs!
  • 35. @jewelia Today Infra has been around for ~3 years 400M async jobs processed/day to 2.5B 3M DAU (daily active users) to 10M DAU 1M simultaneously connected users to 7.5M 10 to ~100 engineers in SF, NYC, YVR Generalist (ICs, Managers) to specialists 1 amazing team
  • 37. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ slack-scaling-infrastructure/