Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Getting started with Site Reliability Engineering (SRE)

569 Aufrufe

Veröffentlicht am

"Getting started with Site Reliability Engineering (SRE): A guide to improving systems reliability at production"

This is an intro guide to share some of the common concepts of SRE to a non-technical audience. We will look at both technical and organizational changes that should be adopted to increase operational efficiency, ultimately benefiting for global optimizations - such as minimize downtime, improve systems architecture & infrastructure:
- improving incident response
- Defining error budgets
- Better monitoring of systems
- Getting the best out of systems alerting
- Eliminating manual, repetitive actions (toils) by automation
- Designing better on-call shifts/rotations
How to design the role of the Site Reliability Engineer (who effectively works between application development teams and operations support teams)

Veröffentlicht in: Internet
  • Als Erste(r) kommentieren

Getting started with Site Reliability Engineering (SRE)

  1. 1. Abeer Rahman Toronto Agile Conference Oct 2018 Getting started with Site Reliability Engineering (SRE): A guide to improving systems reliability at production @abeer486
  2. 2. 3:32 am
  3. 3. Things are good No… it can’t be down… It’s actually down! Oh, things look bad.. Things look REALLY BAD! I’m just going to cry in agony! (Mental) State transitions during an outage…
  4. 4. Abeer Rahman Engineering Manager, Deloitte Focusing on Software delivery, Agile & DevOps for clients @abeer486 Hi, I’m Abeer Let’s shift operations from reactive to proactive
  5. 5. Topics Closing Q&A Measuring Systems Budgeting for Failures Introduction to SRE Enabling SREs Managing Incidents Technology Enablers
  6. 6. Introduction to Site Reliability Engineering
  7. 7. Typical Software Lifecycle Idea Business Development Operations Customers
  8. 8. The goal is not just for software to be pushed out, but to also be maintained once it’s live Sanity Check How long does it take for an incident to reach the right team? Can it be minutes, instead of hours? How often does the same incident re-occur? Can it be reduced to never?
  9. 9. Reliability is the outcome of a system to be able to withstand some certain types of failure and still remain functional from a user’s perspective.
  10. 10. …is what happens when a software engineer is put to solve operations problems Site Reliability Engineering… Site reliability engineering (SRE) is a specialized skillset where software engineers develop and contribute to the ongoing operation of systems in production. Site Reliability Engineer Spend 50% of contributions in Systems Engineering problems -- infrastructure, operating system & network, databases -- availability, observability, efficiency & reliability of services Spend 50% of contributions in Software Engineering problems -- application design, data structure, service design & key integrations -- development tasks that contribute to systems reliability
  11. 11. SRE team objectives Site Reliability Engineers spend less than 50% of their time performing operational work to allow them to spend more time on improving infrastructure and task automation SRE approach to Operations: • Treat operations like a software engineering problem • Automate tasks normally done by system administers (such as deployments) • Design more reliable and operable service architectures from the ground-up, not an afterthought Support systems in production Contribute to Project work
  12. 12. The evolution of systems requires an evolution of systems engineers Why another role/discipline? Common issues: • Reliability is usually left to the architects to design, typically at the beginning of a product design • Non-functional requirements are not reviewed as often • Change of functionality may impact previously assumed reliability requirements • Most outages are typically caused by change, and a typical Ops mindset does not necessarily prepare for fast remedy In the new world, think of systems as software-defined.
  13. 13. Incentives aren’t aligned Wall of confusion Development Speed Agility Operations Stability Control
  14. 14. DevOps vs. SRE? DevOps: A set of principles & culture guidelines that helps to breakdown the silos between development and operations / networking / security • Reduce silos • Plan for failure • Small batch changes • Tooling & automation • Measure everything Site Reliability Engineering: A set of practices with an emphasis of strong engineering capabilities that implement the DevOps practices, and sets a job role + team • Reduce silos by shared ownership • Plan for failure using error budgets • Small batch changes with focus on stability • Tooling & automation of manual tasks • Measure everything monitoring the right things SRE implements DevOps
  15. 15. SREs aim create more value in linking Operations activities with development This pattern is used by many organizations that have a high degree of organizational & engineering maturity. • SRE often act as gate-keepers for production readiness • Sub-standard software is rejected and SREs provide support in refortifying potential operational issues https://web.devopstopologies.com/ OpsSREDevOpsDev
  16. 16. Shared Responsibility Model Application development teams take part in supporting applications in productions • SRE work can flow into application development teams • 5% of operational work to developers • Take part in on-calls (common these days) • Give the pager to the developer once the reliability objectives are met
  17. 17. Enabling SREs Designing & setting up SRE in an organization
  18. 18. Enabling successful SRE teams requires buy-in across the entire organization Rollout of SRE in an Organization The values of the company must align with how SRE operates • Organizational mandate • Leadership buy-in • Shared Responsibilities
  19. 19. Horizontal enablement of SRE across the organization Model 1: Centralized SRE Teams as a shared service Centralized teams Platform Operations Product Team SRE … … … Product Mgmt. Centralized SRE team that can act as a shared service for all of the organization: • Common backlog of organization’s reliability improvements – so rolled out evenly • SREs are brought in as part of design, review, validate, go-lives • Limited level of influence within team Sample organizations: • Facebook (via Production Engineering)
  20. 20. Placing SREs in vertical slices across the organization Model 2: Horizontal enablement across products/teams (Embedded with) Application teams SREs embedded as full-time team members as part of the product teams: • Treated like a team member of the product team • Regular contribution to the overall product reliability design & implementation Sample organizations: • Google • Bloomberg • Uber Mature state: Pairing with application team to uplift the application, then rotate out Platform Operations Product Team SRE Product Mgmt. …SRE …SRE …SRE Platform Operations
  21. 21. Using these as a baseline helps to track progress Key metrics to measures during enablement of SRE • Mean Time to Restore (MTTR) service • Lead time to release (or rollback) • Improved monitoring to catch & detect issues earlier • Establishing error budgets to enable budget-based risk management
  22. 22. Finding key talent who have natural curiosity for software systems and aggressively automates Hiring & Onboarding SREs Hiring for skillsets: • Ideally, multi-skilled individuals who have experience in writing software and maintaining software • Traditional Ops / Sys Admins who are willing to learn scripting / programming • App developers who are willing to step up for operational roles • Exposure to and interest in systems architecture Onboarding SREs: • SRE Bootcamps that consist of insight into the system architecture, software delivery process and operations • On-call rotations with senior SRE members • Participation in incidents response calls
  23. 23. Budgeting for failures Error Budgets
  24. 24. How do we go about measuring availability of a system from the eyes of a user? Measuring reliability • Easy to measure through systematic means (e.g., server uptime) • Difficult to judge success for distributed systems • Does not really speak on behalf of the user’s experience Availability = actual system uptime anticipated system uptime • Measures the amount for real users for whom the service is working • Better for measuring distributed systems Availability = good interactions total interactions
  25. 25. Taking one step further, setup a budget that allows for taking risks Budget for unreliability 100% - Availability target = Budget of unreliability or, the Error Budget Method: 1. SRE and Product Management jointly define an availability target 2. After that, an error budget is established 3. Monitoring (via uptime) is set in place to measure performance + metrics relative to the objective set 4. A feedback loop is created for maintaining / utilizing the budget
  26. 26. Benefits of having Error Budgets Shared responsibility for uptime Functionality failures as well as infrastructure issues fall into the same error budget Common incentive for Product teams & SREs Creates a balance between reliability and feature innovation Product teams can prioritize capabilities The team now can decide where and how to spend the error budget Reliability goals become credible & achievable Provides a more pragmatic approach to setting a realistic availability Error budgets become an explicit trackable metric that glues product feature planning to service reliability
  27. 27. Setting up a reliability goal depends on the nature of the business Unavailability Window Availability Target Allowed Unavailability Window per year per 30 days 95 % 18.25 days 1.5 days 99 % 3.65 days 7.2 hours 99.9 % 8.76 hours 43.2 minutes 99.99 % 52.6 minutes 4.32 minutes 99.999 % 5.26 minutes 25.9 seconds Question: Why is 100% reliability a not a good idea?
  28. 28. Product Management with SREs define service levels for the systems as part of product design Approach to defining Service Levels Service levels allow us to: • Balance features being built with operational reliability • Qualify and quantify the user experience • Set user expectations for availability and performance
  29. 29. Method to defining Service Levels Service Level Agreement (SLA) • A business contract the service provider + the customer • Loss of service could be related to loss of business • Typically, an SLA breach would mean some form of compensation to the client Service Level Objective (SLO) • A threshold beyond which an improvement of the service is required • The point at which the users may consider opening up support ticket, the "pain threshold", e.g., YouTube buffering • Driven by business requirements, not just current performance Service Level Indicator (SLI) • Quantitative measure of an attribute of the service, such as throughput, latency • A directly measurable & observable by the users – not an internal metric (response time vs. CPU utilization) • Could be a way to represent the user's experience
  30. 30. Group exercise: Setting up Service Levels Care-Van is an app that helps to find & book van owners who are available to give lifts to elderly people who have mobility issues. You are an SRE working with the product team that designs the ride request for the customers. What do you suggest are good measures/metrics of the following: • Service Level Indicator (SLI) • Service Level Objective (SLO) • Service Level Agreement (SLA) 5 minutes For Reference SLI: a measurable attribute of a service that represents availability/performance of the system SLO: defines how the service should perform, from the perspective of the user (measured via SLI) SLA: a binding contract to provide a customer some form of compensation if the service did not meet expectations
  31. 31. Sample solution Service Level Indicator (SLI): the latency of successful HTTP responses (HTTP 200) Service Level Objective (SLO): The latency of 95% of the responses must be less than 200ms Service Level Agreement (SLA): Customer compensated if the 95th percentile latency exceeds 300ms
  32. 32. Exhausting an Error Budget As long as there is remaining error budget (in a rolling timeframe), new features can be deployed. Question: what can you do if you exhaust the error budget? Possibly: • No new feature launches (for a period of time) • Change the velocity of updates • Prioritize engineering efforts to reliability
  33. 33. Measuring the Systems Monitoring & Alerting
  34. 34. It is the primary means in determining and maintaining reliability of a system Monitoring is a foundational capability for any organization Product Development Capacity Planning Test/Release Process Post Mortems Incident Response Monitoring Mikey Dickerson’s Hierarchy of Service Reliability Monitoring helps with identifying what needs urgent attention, and helps prioritization: • Urgency • Trends • Planning • Improvements
  35. 35. Golden Signals of Monitoring Golden Signals Latency Traffic Errors Saturation The case for better monitoring: • Focused more on uptime to deduce availability vs. user experience • Monitoring system metrics, like CPU, memory, storage are good indicators, however may not mean anything meaningful for the client experience • “Noise overload” It’s better to other metrics, such as latency: are the response times taking longer & longer? It’s best to define SLIs & SLOs using these golden signals as it accounts for user’s perception of the system
  36. 36. Outputs of a good monitoring system Logging Alerts Tickets Level of urgency A carefully crafted monitoring system should be able to distinguish urgent requests from important ones, and only alert when required
  37. 37. An alert should only be paged for urgent requests that need immediate attention Alerting design consideration Common issues: • Not all alerts are helpful • Alerts is not maintained regularly to keep up with product functionality being rolled out Good practices: • Measure the system along with all of the parts, but alert only on problems will cause or is causing end-user pain. This helps to reduce “alert fatigue” • Alerts should be actionable (consider logging for informational items)
  38. 38. Managing incidents Incident Response & On-Calls
  39. 39. Effective incident response process increases user trust Incident Response Users don’t know always expect a service to be perfect. However, they expect their service provider to be transparent, fast to resolve issues, and learns quickly from mistakes. Incident Response helps to create that sense of trust when there is a crisis. Product Development Capacity Planning Test/Release Process Post Mortems Incident Response Monitoring Dickerson’s Hierarchy of Service Reliability
  40. 40. People do not typically work well in situations of emergencies, so having a structured incident response process is essential Structured Incident Response Have sufficient monitoring • Service dashboards indicating throughputs, loads, latency, etc. • Typically, an SLA breach would mean some form of compensation to the client Good alerting & communication process • Notifying on-call member(s) and having an automated escalation process if primary on-call is not reachable • Notifying end-users before or as soon as there is degraded experience Playbook-style plans & tools • An easily accessible step-by-step checklist of actions to do when an incident occurs • Tools available for system introspection 1 2 3 Great resource: https://www.incidentresponse.com/playbooks/improper-computer-usage
  41. 41. A post-mortem conducted after an incident helps to build a culture of self-improvement Post-mortems Goals of writing a post-mortem: • All contributing causes are understood and documented • Action items are put in place to avoid repeating the same issue again Good habits: • Post-mortems conducted & closed within 3-5 days after an incident has occurred • Document in details errors, logs, steps taken and decisions made that led to the incident/outage • Document fixes if an SLO is breached, even if the SLA is not impacted
  42. 42. VW’s defeat device scandal Oct 2015  Transparency  Accountability  Blamelessness https://nypost.com/2015/10/08/volkswagen-exec-blames-rogue-engineers-for-emissions-scandal/
  43. 43. Blameless culture enables faster experimentation without fear. Because failures are part of the process of experimentation. Conducting blameless post-mortem “Human errors” are system problems. It’s better to fix systems & processes to better support people in making good choices. If the culture of finger-point prevails, there is no incentive for team members to bring forth issues in fear of punishment. Useful approach & questions: • Conduct Five Why’s • How can you provide better information to make decisions? • Was the information misleading? Can this information be fixed in an automated way? • Can the activity be automated so it requires minimal human intervention? • Could a new hire have made the same mistake the same way? John Allspaw’s reference paper on making decisions under pressure: http://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=8084520&fileOId=8084521
  44. 44. Gitlab’s database outage post-mortem in January 2017  Transparency  Accountability  Blamelessness https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/
  45. 45. Create a work environments that have a balance between planned/project-driven work and interrupt-driven work Better On-Calls Good practices for creating a more sustainable on-call: • Have a single person be in charge of on-call (Note: if no one is on call, then everyone becomes on call!) • On-call person makes sure interrupt-based work is completed without impacting others who are not on-call • Create a rotation schedule to ensure relief • No more than 2 key incidents per on-call engineer per shift(or staff appropriately) Project-driven work Planned work to reduce toil Interrupt-driven work Supporting incidents
  46. 46. Enablers Technology components
  47. 47. Technology capabilities that help with speed & reliability Microservices Microservices provide a foundation to support automation, rapid deployment, and DevOps infrastructure. Cloud Cloud architecture is a direct response to the need for agility. Containers Infrastructure manages services and software deployments on IT resources where components are replaced rather than changed.
  48. 48. Considerations for storage, compute, networking… Infrastructure Design Good practices: • Version control your infrastructure design and code • Offload as much work to cloud provider • Design for reliability by having the application + data situated in more than one datacenter • Design your infrastructure architecture to support Canary deployment
  49. 49. Using the cloud as an enabler for reliability Distribute your infrastructure across multiple datacenters (or regions/zones) Sample taken from Google Cloud Platform, however, most cloud providers have this capability.
  50. 50. Considerations for business logic, data layer and presentation… Application design Good practices: • Re-think how your features can be re-architected to support for scalability (hence increasing reliability) • Favour scale out over scale up • De-couple / modular architecture: such as Microservices • Distribute your applications across multiple geographies • Consistent build, deploy process (as much as possible)
  51. 51. Design your applications using the principles of Twelve-Factor App  Enables “cloud- readiness”  Helps with increasing reliability https://12factor.net/
  52. 52. Closing
  53. 53. Recap Measuring Systems Budgeting for Failures Introduction to SRE Enabling SREs Managing Incidents Technology Enablers
  54. 54. Get started with your own SRE journey! Acknowledge: failures are inevitable, it’s best to plan for it Buy-in: from leadership & top-management Culture: creating a culture of self improvement Skillset: look for multi-skilled individuals who strive on collaboration 1. Select a sample application 2. Identify it’s SLI/SLO 3. Create a baseline for current state 4. Incrementally update systems & processes to meet SLO 5. Rinse, repeat
  55. 55. Organization-wide Reliability Improvement backlog Sources of improvement backlog: • Previous outages and incidents, especially the ones that have been recurring • Review of tools used in operations (monitoring, alerting, on-call schedule) • Operational technical debt – automation of manual scripts • Mass reboot of all your systems in the whole datacenter • Do a disaster-recovery (DR) test
  56. 56. Using these as a baseline helps to track progress… Metrics to measures during enablement of SRE • Mean Time to Restore (MTTR) service • Lead time to release (or rollback) • Improved monitoring to catch & detect issues earlier • Establishing error budgets to enable budget-based risk management
  57. 57. …or make one that makes sense to you! Metrics to measures during enablement of SRE Mean time-to-return-to-bed (MTTRTB)
  58. 58. What to watch out for when enabling SRE… Anti-patterns! • Rebranding your existing Operations team to SRE is not SRE! • Treat SREs like any other software developers – not like operations teams • 100% is not a good SLO. It’s better to plan for failure as so many interactions are out of your hands • Do not do the transformation in silo. Teams are willing to change; best to involve them as part of the transformation journey. E.g. Change Management, Incident Management, Security
  59. 59. Resources
  60. 60. @abeer486 Thank you! Let’s shift operations from reactive to proactive