Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

I'm No Hero: Full Stack Reliability at LinkedIn

1.093 Aufrufe

Veröffentlicht am

The operations engineer is often seen as the hero, toiling away late nights on call to keep the systems running through failures of hardware and of code. While developers try as hard as possible to move quickly and break things, we stand as the voice of reason urging caution. We’re the only ones who truly understand the systems, but you’ll rarely find documentation because it’s just too complex and changeable to write down. When we’re doing our jobs well, we’re unappreciated because nobody understands how difficult it is. When things break, everyone thinks we’re doing our jobs badly. These are not the things we aspire to.

At LinkedIn, Site Reliability Engineers are one layer in a stack that starts with the way we manage our code and basic hardware, and is built with common systems for application management, monitoring, and alerting. Each layer has its own specialist engineers, focused on making their piece as resilient as it can be and building it to integrate with the rest of the stack. This lets Software Engineers concentrate on developing their applications, without having to spend time building systems to build, package, and distribute their code. SREs can dedicate their time to integrating applications with the stack, architecting and scaling deployments, as well as developing tools and documentation to make the job easier. When the inevitable failure happens, many experts come together to quickly identify and resolve the problem and improve the entire stack for everyone.

Description:
Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Twitter: @CloudRR2016

Veröffentlicht in: Ingenieurwesen
  • Als Erste(r) kommentieren

I'm No Hero: Full Stack Reliability at LinkedIn

  1. 1. I’m No Hero Full Stack Reliability At LinkedIn
  2. 2. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Todd Palino
  3. 3. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. What is Site Reliability Engineering? 3
  4. 4. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Types of SRE  Embedded  Central (or Production SRE)  Tools and Infrastructure 4
  5. 5. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. We Can’t Do It Alone  The Kafka SRE team is 3 people in the US, and 1.5 SREs in Bangalore  We manage over 6000 application instances – 100 Kafka clusters, with 1800 brokers – Over 1 trillion messages a day  The environment is never static from one day to the next 6
  6. 6. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Maslow’s Hierarchy 7
  7. 7. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Todd’s Hierarchy of Reliability 8
  8. 8. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Infrastructure as a Service  SREs do not deploy hardware and OS  Production Operations – Datacenter Technicians – Systems Operations – Network Operations  Provide all basic OS and network services  There is still tweaking for individual applications 9
  9. 9. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Common Repositories  All source code and configurations are committed to one place  Subversion and Git centrally managed  Consistent management – Precommit checks – ACLs and Review boards  Connects directly to the build systems 10
  10. 10. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Containerization  Most of our stack is Java – Python is well-supported – Always a few one-offs  Java applications have Tomcat and Jetty containers – Hooks for monitoring – Client libraries are managed by the team that owns the application  Provides a consistent control surface for applications 11
  11. 11. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Build and Deployment  When code is committed, it is automatically built – Successes become deployment artifacts – Failures are tracked via Jira  Build systems are centrally managed  Common tools – Dependency management and introspection – Version management – Error budgeting – Deployment 12
  12. 12. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Monitoring  Monitoring, graphing, and alerting as a service  Completely self-service – Applications annotate metrics and they are automatically collected – Monitoring dashboards can be created by anyone  Automatic metrics and dashboards for common features – HTTP servers, system and OS metrics – Client libraries (such as Kafka)  Additional metrics can be published outside the container 13
  13. 13. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Site Up 14
  14. 14. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Site Up  With the stack supporting it, applications sit on top – SREs architect and run the application – SRE and developers respond to failures  The NOC monitors high-level metrics – Overall site health and growth metrics – They also coordinate incident response  Incident response is blameless – Fix the problem, don’t fix the blame 15
  15. 15. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Review and Revise  All components are constantly improving – Incidents expose issues in the infrastructure – Feedback from usage of the tools  Steering committees discuss large-scale changes – Production Operations, SRE, and Development all have their own – Comprised of individual contributors, not managers  Open collaboration – Common repositories means everyone can help 16

×