I'm No Hero: Full Stack Reliability at LinkedIn

•Als PPTX, PDF herunterladen•

5 gefällt mir•1,537 views

The operations engineer is often seen as the hero, toiling away late nights on call to keep the systems running through failures of hardware and of code. While developers try as hard as possible to move quickly and break things, we stand as the voice of reason urging caution. We’re the only ones who truly understand the systems, but you’ll rarely find documentation because it’s just too complex and changeable to write down. When we’re doing our jobs well, we’re unappreciated because nobody understands how difficult it is. When things break, everyone thinks we’re doing our jobs badly. These are not the things we aspire to. At LinkedIn, Site Reliability Engineers are one layer in a stack that starts with the way we manage our code and basic hardware, and is built with common systems for application management, monitoring, and alerting. Each layer has its own specialist engineers, focused on making their piece as resilient as it can be and building it to integrate with the rest of the stack. This lets Software Engineers concentrate on developing their applications, without having to spend time building systems to build, package, and distribute their code. SREs can dedicate their time to integrating applications with the stack, architecting and scaling deployments, as well as developing tools and documentation to make the job easier. When the inevitable failure happens, many experts come together to quickly identify and resolve the problem and improve the entire stack for everyone. Description: Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany. Organized by EIT Digital and Huawei GRC, Germany. Twitter: @CloudRR2016

Ingenieurwesen

I’m No Hero
Full Stack
Reliability
At LinkedIn

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Todd Palino

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
What is Site Reliability Engineering?
3

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Types of SRE
 Embedded
 Central (or Production SRE)
 Tools and Infrastructure
4

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
We Can’t Do It Alone
 The Kafka SRE team is 3 people in the US, and 1.5 SREs in Bangalore
 We manage over 6000 application instances
– 100 Kafka clusters, with 1800 brokers
– Over 1 trillion messages a day
 The environment is never static from one day to the next
6

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Maslow’s Hierarchy
7

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Todd’s Hierarchy of Reliability
8

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Infrastructure as a Service
 SREs do not deploy hardware and OS
 Production Operations
– Datacenter Technicians
– Systems Operations
– Network Operations
 Provide all basic OS and network services
 There is still tweaking for individual applications
9

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Common Repositories
 All source code and configurations are committed to one place
 Subversion and Git centrally managed
 Consistent management
– Precommit checks
– ACLs and Review boards
 Connects directly to the build systems
10

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Containerization
 Most of our stack is Java
– Python is well-supported
– Always a few one-offs
 Java applications have Tomcat and Jetty containers
– Hooks for monitoring
– Client libraries are managed by the team that owns the application
 Provides a consistent control surface for applications
11

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Build and Deployment
 When code is committed, it is automatically built
– Successes become deployment artifacts
– Failures are tracked via Jira
 Build systems are centrally managed
 Common tools
– Dependency management and introspection
– Version management
– Error budgeting
– Deployment
12

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Monitoring
 Monitoring, graphing, and alerting as a service
 Completely self-service
– Applications annotate metrics and they are automatically collected
– Monitoring dashboards can be created by anyone
 Automatic metrics and dashboards for common features
– HTTP servers, system and OS metrics
– Client libraries (such as Kafka)
 Additional metrics can be published outside the container
13

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Site Up
14

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Site Up
 With the stack supporting it, applications sit on top
– SREs architect and run the application
– SRE and developers respond to failures
 The NOC monitors high-level metrics
– Overall site health and growth metrics
– They also coordinate incident response
 Incident response is blameless
– Fix the problem, don’t fix the blame
15

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.
Review and Revise
 All components are constantly improving
– Incidents expose issues in the infrastructure
– Feedback from usage of the tools
 Steering committees discuss large-scale changes
– Production Operations, SRE, and Development all have their own
– Comprised of individual contributors, not managers
 Open collaboration
– Common repositories means everyone can help
16

Weitere ähnliche Inhalte

Was ist angesagt?

When you need to react quickly to competitive threats or new line of business demands, but your existing architecture is anything but nimble, what do you do? Is it time to completely start over with a new enterprise architecture, or can you can augment your existing systems to become more resilient and responsive? This slideshow features Michael Facemire, Principal Analyst at Forrester Research, and Kevin Webber, Enterprise Advocate at Typesafe, Inc., in a discussion about how to leverage a Reactive architectural model to ensure your back-end infrastructure isn’t the limiting factor for your business success.

Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...

Legacy Typesafe (now Lightbend)

Nano Server is the future of Windows Server. With Nano Server Microsoft created the foundation for Windows Server for the next 20 year. In this session you will get an overview about Nano Server and see some great live demos how you can deploy, manage and operate Nano Server as well as creating applications for it. Get a better understanding of Nano Server and see how you deploy, manage and operate it.

Nano Server - the future of Windows Server - Thomas Maurer

ITCamp

What's New in Hyper-V 2016 - Thomas Maurer

ITCamp

The Top Outages of 2021: Analysis and Takeaways

ThousandEyes

Cisco IT and ThousandEyes

ThousandEyes

Cisco UCS has helped customers get more out of their IT by providing a simplified physical environment with more efficient management. However, UCS continues to innovate and deliver additional value. This session will focus on updates and new offerings that have enhanced the Cisco UCS portfolio in the previous year. These include M-Series composable infrastructure, Dense Storage servers, next generation Fabric Interconnects, updated servers, additional storage offerings, and more. This will also cover refreshes and best practices to the UCS Management portfolio. Finally, there will be the opportunity for a questions throughout and after the presentation.

UCS Update: Efficiently Managing your server environment for traditional ente...

Cisco Canada

APIC EM APIs: a deep dive

Cisco DevNet

Pre-Requisites: As this is a hands-on lab, students should bring their laptop to take full advantage of the labs throughout the day sessions. Learn how to configure network, compute, storage and virtualization components then use UCS Director to orchestrate, provision, automate and manage the solution. IT transformation is a long journey that requires a close, sustained collaboration across the IT organization. For IT organizations that are transitioning to a private cloud architecture, the measure of success isn’t whether they’ve implemented an IT architecture that “works better”—it’s whether IT makes the business work better. By simplifying the design, build, and operational processes for private and hybrid clouds, Cisco ONE Enterprise Cloud Suite (ECS) helps you focus your time and energy on ensuring that your platform remains closely aligned with business requirements. Labs • UCS Director Concepts (vDC, Pod, etc.) • Managing Physical Resources, • Managing Virtual Infrastructure, • UCS Director Extensibility

Cisco ONE Enterprise Cloud (UCSD) Hands-on Lab

Cisco Canada

Open Source Applied - Real World Use Cases

All Things Open

Cisco ACI for the Microsoft Cloud Platform

Shashi Kiran

Application delivery controllers provide load balancing, acceleration, traffic shaping and other services that improve the performance, availability and security of web applications. But with more and more web application developers hosting their applications in the cloud, using application delivery hardware is often a non-starter. This presentation discusses the architecture of a new type of service called the Application Delivery Cloud. This new cloud service not only offers critical performance, availability and security capabilities to web application vendors, it goes beyond its hardware analog to deliver new capabilities that today’s applications require, including regional content policies and up-to-the-minute security intelligence.

Is the Cloud Going to Kill Traditional Application Delivery?

Imperva Incapsula

Travelling in time with SQL Server 2016 - Damian Widera

ITCamp

Riverbed Performance Management

CTI Group

Alerting is a critical component of the ThousandEyes platform to inform operations teams of performance deviations or problems. From DNS availability to BGP reachability to layer 3 network metrics, ThousandEyes has a wide array of alert triggers. Learn how to use the alerting framework to your advantage by selecting the best alerts, customizing rules and receiving notifications. In this presentation, we share how to match alerts to your most important monitoring use case, customize modular alert rules, configure notifications and alert integrations. https://www.thousandeyes.com/webinars/alerting

ThousandEyes Alerting Essentials for Your Network

ThousandEyes

Oracle Public Cloud Operations from ThousandEyes Connect

ThousandEyes

Ocs F5 Bigip Bestpractices

Thiago Gutierri

F5 iHealth Presentation 10 22-10

F5 Networks

Introduction to ThousandEyes

ThousandEyes

Top 5 favourite features of Cisco ACI in Pulsant Cloud Data Centres

Martin Lipka

SDN in the Enterprise: APIC Enterprise Module

Cisco Canada

Was ist angesagt? (20)

Modernizing Your Aging Architecture: What Enterprise Architects Need To Know ...

Nano Server - the future of Windows Server - Thomas Maurer

What's New in Hyper-V 2016 - Thomas Maurer

The Top Outages of 2021: Analysis and Takeaways

Cisco IT and ThousandEyes

UCS Update: Efficiently Managing your server environment for traditional ente...

APIC EM APIs: a deep dive

Cisco ONE Enterprise Cloud (UCSD) Hands-on Lab

Open Source Applied - Real World Use Cases

Cisco ACI for the Microsoft Cloud Platform

Is the Cloud Going to Kill Traditional Application Delivery?

Travelling in time with SQL Server 2016 - Damian Widera

Riverbed Performance Management

ThousandEyes Alerting Essentials for Your Network

Oracle Public Cloud Operations from ThousandEyes Connect

Ocs F5 Bigip Bestpractices

F5 iHealth Presentation 10 22-10

Introduction to ThousandEyes

Top 5 favourite features of Cisco ACI in Pulsant Cloud Data Centres

SDN in the Enterprise: APIC Enterprise Module

Andere mochten auch

This is a talk given at ApacheCon 2015 If data is the lifeblood of high technology, Apache Kafka is the circulatory system in use at LinkedIn. It is used for moving every type of data around between systems, and it touches virtually every server, every day. This can only be accomplished with multiple Kafka clusters, installed at several sites, and they must all work together to assure no message loss, and almost no message duplication. In this presentation, we will discuss the architectural choices behind how the clusters are deployed, and the tools and processes that have been developed to manage them. Todd Palino will also discuss some of the challenges of running Kafka at this scale, and how they are being addressed both operationally and in the Kafka development community. Note - there are a significant amount of slide notes on each slide that goes into detail. Please make sure to check out the downloaded file to get the full content!

Kafka at Scale: Multi-Tier Architectures

Todd Palino

Big Data means big hardware, and the less of it we can use to do the job properly, the better the bottom line. Apache Kafka makes up the core of our data pipelines at many organizations, including LinkedIn, and we are on a perpetual quest to squeeze as much as we can out of our systems, from Zookeeper, to the brokers, to the various client applications. This means we need to know how well the system is running, and only then can we start turning the knobs to optimize it. In this talk, we will explore how best to monitor Kafka and its clients to assure they are working well. Then we will dive into how to get the best performance from Kafka, including how to pick hardware and the effect of a variety of configurations in both the broker and clients. We’ll also talk about setting up Kafka for no data loss.

Kafka at Peak Performance

Todd Palino

Site Reliability Engineering Helps Google Conquer The World

Vistara

Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Learn the right approach for getting the most out of Kafka from the experts at LinkedIn and Confluent. Todd Palino and Gwen Shapira demonstrate how to monitor, optimize, and troubleshoot performance of your data pipelines—from producer to consumer, development to production—as they explore some of the common problems that Kafka developers and administrators encounter when they take Apache Kafka from a proof of concept to production usage. Too often, systems are overprovisioned and underutilized and still have trouble meeting reasonable performance agreements. Topics include: - What latencies and throughputs you should expect from Kafka - How to select hardware and size components - What you should be monitoring - Design patterns and antipatterns for client applications - How to go about diagnosing performance bottlenecks - Which configurations to examine and which ones to avoid

Putting Kafka Into Overdrive

Todd Palino

Works of site reliability engineer

Shohei Kobayashi

SRE From Scratch

Grier Johnson

Producer Performance Tuning for Apache Kafka

Jiangjie Qin

Kafka overview and use cases

Indrajeet Kumar

Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

IE Group

Building an E-commerce website in MEAN stack

divyapisces

In this session, Netflix provides an overview of Keystone, their new data pipeline. The session covers how Netflix migrated from Suro to Keystone, including the reasons behind the transition and the challenges of zero loss while processing over 400 billion events daily. The session covers in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.

(BDT318) How Netflix Handles Up To 8 Million Events Per Second

Amazon Web Services

Netflix Keystone—Cloud scale event processing pipeline

Monal Daxini

You got a couple Microservices, now what? - Adding SRE to DevOps

Gonzalo Maldonado

Kafka Reliability - When it absolutely, positively has to be there

Gwen (Chen) Shapira

The Startup Relationship Survival Guide by Nicole Cottrell

PHX Startup Week

SRE Tools

Gurbakash Phonsa

Netflix Data Pipeline With Kafka

Allen (Xiaozhong) Wang

No data loss pipeline with apache kafka

Jiangjie Qin

SRE - drupal day aveiro 2016

Ricardo Amaro

Talk from SREcon2016 by Brendan Gregg. Video: https://www.usenix.org/conference/srecon16/program/presentation/gregg . "There's limited time for performance analysis in the emergency room. When there is a performance-related site outage, the SRE team must analyze and solve complex performance issues as quickly as possible, and under pressure. Many performance tools and techniques are designed for a different environment: an engineer analyzing their system over the course of hours or days, and given time to try dozens of tools: profilers, tracers, monitoring tools, benchmarks, as well as different tunings and configurations. But when Netflix is down, minutes matter, and there's little time for such traditional systems analysis. As with aviation emergencies, short checklists and quick procedures can be applied by the on-call SRE staff to help solve performance issues as quickly as possible. In this talk, I'll cover a checklist for Linux performance analysis in 60 seconds, as well as other methodology-derived checklists and procedures for cloud computing, with examples of performance issues for context. Whether you are solving crises in the SRE war room, or just have limited time for performance engineering, these checklists and approaches should help you find some quick performance wins. Safe flying."

SREcon 2016 Performance Checklists for SREs

Brendan Gregg

Andere mochten auch (20)

Kafka at Scale: Multi-Tier Architectures

Kafka at Peak Performance

Site Reliability Engineering Helps Google Conquer The World

Putting Kafka Into Overdrive

Works of site reliability engineer

SRE From Scratch

Producer Performance Tuning for Apache Kafka

Kafka overview and use cases

Stephen McHenry - Chanecellor of Site Reliability Engineering, Google

Building an E-commerce website in MEAN stack

(BDT318) How Netflix Handles Up To 8 Million Events Per Second

Netflix Keystone—Cloud scale event processing pipeline

You got a couple Microservices, now what? - Adding SRE to DevOps

Kafka Reliability - When it absolutely, positively has to be there

The Startup Relationship Survival Guide by Nicole Cottrell

SRE Tools

Netflix Data Pipeline With Kafka

No data loss pipeline with apache kafka

SRE - drupal day aveiro 2016

SREcon 2016 Performance Checklists for SREs

Ähnlich wie I'm No Hero: Full Stack Reliability at LinkedIn

Linked in multi tier, multi-tenant, multi-problem kafka

Nitin Kumar

Virtualization and storage technologies go hand-in-hand. If performing poorly, they can have a serious impact on your applications' performance and users' experience. This presentation shows how Splunk can help you get unified visibility into your VMware environment and NetApp storage systems. Learn how to utilize Splunk Enterprise to correlate storage machine data with virtualization, operating systems and data from technology tiers for quicker time to resolution, optimal performance planning and unified view of KPIs across your entire enterprise.

Splunk conf2014 - Getting Deeper Insights into your Virtualization and Storag...

Splunk

Oracle Management Cloud newpres-v1.1

Lee Bonfield

Splunk provee Inteligencia operativa para todos Splunk es la plataforma de inteligencia operativa en tiempo real líder del sector. Es una forma fácil, rápida y segura de buscar, analizar y visualizar los grandes flujos de datos de máquina generados por sus sistemas de TI e infraestructura tecnológica (físicos, virtuales y en la nube). Splunk Enterprise 6 es la versión más reciente y proporciona: - Análisis potente para todos los usuarios a velocidades sorprendentes - Experiencia de usuario completamente rediseñada - Entorno del desarrollador más enriquecido para una ampliación fácil de la plataforma Splunk Enterprise 6 ya está disponible. Descárguelo ahora y pruébelo usted mismo.

Splunk Sales Presentation Imagemaker 2014

Urena Nicolas

ThreadFix is an open source application vulnerability management system that helps automate many common application security tasks and integrate security and development tools. This tutorial will walk through the capabilities of the ecosystem of ThreadFix applications, showing how ThreadFix can be used to: •Manage a risk-ranked application portfolio •Consolidate, normalize and de-duplicate the results of DAST, SAST and other application security testing activities and track these results over time to produce trending and mean-time-to-fix reporting •Convert application vulnerabilities into software defects in developer issue tracking systems •Pre-seed DAST scanners such as OWASP ZAP with application attack surface data to allow for better scan coverage •Instrument developer Continuous Integration (CI) systems such as Jenkins to automatically collect security test data •Map the results of DAST and SAST scanning into developer IDEs The presentation walks through these scenarios and demonstrates how ThreadFix, along with other open source tools, can be used to address common problems faced by teams implementing software security programs. It will also provide insight into the ThreadFix development roadmap and upcoming enhancements.

Managing Your Application Security Program with the ThreadFix Ecosystem

Denim Group

Splunk bangalore user group 2020-06-01

NiketNilay

Here’s your chance to get hands-on with Splunk for the first time! Bring your modern Mac, Windows, or Linux laptop and we’ll go through a simple install of Splunk. Then, we’ll load some sample data, and see Splunk in action – we’ll cover searching, pivot, reporting, alerting, and dashboard creation. At the end of this session you’ll have a hands-on understanding of the pieces that make up the Splunk Platform, how it works, and how it fits in the landscape of Big Data. You’ll experience practical examples that differentiate Splunk while demonstrating how to gain quick time to value.

Getting Started with Splunk Enterprise

Splunk

MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...

Jitendra Bafna

Developers want to write code and security testers want to break it and both groups have specialized tools supporting these goals. The problem is – security testers need to know more about application code to do better testing and developers need to be able to quickly address problems found by security testers. This presentation looks at both groups and their respective toolsets and explores ways they can help each other out. Two different interactions are examined: • How can knowledge of code make application scanning better? • How can application scan results be mapped back to specific lines of code? Using open source examples built on OWASP ZAP, ThreadFix and Eclipse, the presentation walks through the process of seeding web applications scans with knowledge gleaned from code analysis as well as the mapping of dynamic scan results to specific line of code. The end result is a combination of testing and remediation workflows that help both security testers and software developers be more effective. Particular attention is give to Java/JSP applications and Java/Spring applications and how teams using these frameworks can best benefit from these interactions.

Hybrid Analysis Mapping: Making Security and Development Tools Play Nice Toge...

Denim Group

Oracle: Building Cloud Native Applications

Kelly Goetsch

Veracode Corporate Overview - Print

Andrew Kanikuru

SwitchIT-02.2018-Company-overview.pptx

WILFRIEDKOUASSIKAN

The Netsparker web application security scanner allows both development and security teams to easily test web applications for common security vulnerabilities. This webinar demonstrates how Netsparker can be used with the ThreadFix vulnerability resolution platform to correlate testing results, prioritize risk decisions based on data, and transition security vulnerabilities to development teams in the tools they’re already using. Combining the application vulnerability correlation capabilities of ThreadFix with the proof-based vulnerability scanning technology of Netsparker allows organizations to take a quantitative approach to addressing application security risk.

Optimizing Your Application Security Program with Netsparker and ThreadFix

Denim Group

SAP security made easy

ERPScan

A presentation given by Joe Furbee, Developer Advocate and Developers Communities Manager at SAS Institute, at our 2024 Austin API Summit, March 12-13. Session Description: Sure, we could have hired someone to (re)create our developer portal, developer.sas.com. However, we wanted the freedom to build our portal from the ground up. But, it takes more than an API architect and a developer advocate to create a modern, interactive developer experience. This session provides an overview of the steps we took to relaunch the SAS AI and analytics platform developer portal. Who was involved? How did we accomplish what we wanted to build? We’ll explore the stakeholders involved, the importance of open-source technologies, and why focusing on the developer’s perspective matters. This is not a marketing pitch to promote SAS services. Instead, it’s a detailed look at the process we followed to deploy our new developer portal.

The SAS developer portal –developer.sas.com 2.0: How we built it by Joe Furb...

Nordic APIs

Big Iron z/OS systems produce an enormous amount of operational data, but the challenge for the past few decades has been how to go beyond basic performance and availability management and extract the information that can provide IT operational intelligence? You need analytical insight into z/OS operations, security data, and service delivery in real-time for the success of your business. Watch this webcast, to learn: o Challenges that have inhibited z/OS analytics and how to overcome those challenges by forwarding critical IBM z/OS mainframe data to Splunk Enterprise for analysis. o How gain better insights into security threats on z/OS and across your enterprise. o How to leverage Splunk IT Service Intelligence to monitor critical business services reliant on z/OS critical components like CICS and DB2.

Big Data Analytics for Real-time Operational Intelligence with Your z/OS Data

Precisely

Government and Education Webinar: Improving Application Performance

SolarWinds

Attacks to SAP Web Applications: Your crown jewels online (BlackHat DC)

Onapsis Inc.

Asma Zubair, Product Mgmt Mgr, Sr Staff, Synopsys and Kimm Yeo, Product Marketing Mgr, Staff, Synopsys presented on a recent webinar. Attendees learned: how this nondisruptive tool: - Runs in the background and reporst vulnerabilities during functional testing, CI/CD, and QA activities. - Prioritizes and triages vulnerability findings in real time with 100% confidence. - Fully automates secure code delivery and deployment, without the need for extra security scans or processes. - Frees up development and security resources to focus on strategic or mission-critical tasks and contributions. For more information, please visit our website at www.synopsy.com/seeker

Webinar–AppSec: Hype or Reality

Synopsys Software Integrity Group

SYN328: Learn why AppDNA should be a part of every consultant’s toolkit

Jeremy Saunders

Ähnlich wie I'm No Hero: Full Stack Reliability at LinkedIn (20)

Linked in multi tier, multi-tenant, multi-problem kafka

Splunk conf2014 - Getting Deeper Insights into your Virtualization and Storag...

Oracle Management Cloud newpres-v1.1

Splunk Sales Presentation Imagemaker 2014

Managing Your Application Security Program with the ThreadFix Ecosystem

Splunk bangalore user group 2020-06-01

Getting Started with Splunk Enterprise

MuleSoft Surat Virtual Meetup#16 - Anypoint Deployment Option, API and Operat...

Hybrid Analysis Mapping: Making Security and Development Tools Play Nice Toge...

Oracle: Building Cloud Native Applications

Veracode Corporate Overview - Print

SwitchIT-02.2018-Company-overview.pptx

Optimizing Your Application Security Program with Netsparker and ThreadFix

SAP security made easy

The SAS developer portal –developer.sas.com 2.0: How we built it by Joe Furb...

Big Data Analytics for Real-time Operational Intelligence with Your z/OS Data

Government and Education Webinar: Improving Application Performance

Attacks to SAP Web Applications: Your crown jewels online (BlackHat DC)

Webinar–AppSec: Hype or Reality

SYN328: Learn why AppDNA should be a part of every consultant’s toolkit

Mehr von Todd Palino

Increasingly, technical organizations are developing career paths to build and recognize leaders outside of the traditional management roles. But what should an SRE who wants to be a leader be focusing on? Through the eyes of an engineer who reinvented his career in one of the largest SRE organizations, we will examine what technical leadership looks like, and how an individual can help guide the strategic path of a team, department, or company without taking on the role of a people manager. You'll pick up tactical work that you can start immediately to set yourself up for success, and some pointers to be able to identify the opportunities when they show up.

Leading Without Managing: Becoming an SRE Technical Leader

Todd Palino

Across industries, modern operations teams have noted the emergence of a new role: the Site Reliability Engineer (SRE): an IT craftsperson who fuses software engineering and operations best practices to enable highly reliable software systems. Once the domain of technology giants, this discipline is both applicable and important for any organization looking to differentiate itself in a world increasingly defined by software. In this session, Todd Palino from LinkedIn explores how SRE evolves from Operations by taking the ‘lid-off’ SRE at LinkedIn. He’ll describe how by crafting automation, problem solving, and building a partnership with software engineering teams, companies can build a high-trust and inclusive team culture that is needed to drive continuous improvement — and importantly, have lots of fun doing it!

From Operations to Site Reliability in Five Easy Steps

Todd Palino

All engineering teams run into trouble from time to time. Alert fatigue, caused by technical debt or a failure to plan for growth, can quickly burn out SREs, overloading both development and operations with reactive work. Layer in the potential for communication problems between teams, and we can find ourselves in a place so troublesome we cannot easily see a path out. At times like this, our natural instinct as reliability engineers is to double down and fight through the issues. Often, however, we need to step back, assess the situation, and ask for help to put the team back on the road to success. We will look at the process for Code Yellow, the term we use for this process of “righting the ship”, and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.

Code Yellow: Helping Operations Top-Heavy Teams the Smart Way

Todd Palino

Monitoring services is easy, right? Set up a notification that goes out when a certain number increases past a certain threshold to let you know that there’s a problem. But if that’s the case, why are so many teams drowning in alerts and dreading their time on call? The reason is that we tend to monitor the wrong things: reactive alerts, metrics that we don’t completely understand how they impact our service, and capacity alerts. We look at our own view of the service and fail to consider that our customers have a different view. Come learn to let go of what does not help, and explore how to monitor for what truly matters: what the customer sees. This starts with defining our agreements with our customers, continues through building applications intelligently and instrumenting all the things, and finishes with picking the right signals out of that instrumentation to generate alerts that are actionable, not ones that introduce confusion and noise. We will also touch on capacity planning, and how it should never wake you up. You’ll find it’s possible to assure that you meet your service level objectives while still maximizing your sleep level objectives.

Why Does (My) Monitoring Suck?

Todd Palino

What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows. We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain: Under-replicated Partitions: The mother of all metrics Request Latencies: Why your users complain Thread pool utilization: How could 80% be a problem? We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!

URP? Excuse You! The Three Kafka Metrics You Need to Know

Todd Palino

Across industries, modern operations teams have noted the emergence of a new role: the Site Reliability Engineer (SRE); a new IT craftsperson who fuses software engineering and operations best practices to enable highly reliable software systems. Once the domain of web-scale businesses, this discipline is both applicable and important for any organization looking to differentiate itself in a world increasingly defined by software. In this session, Todd Palino from LinkedIn explores SRE from organizational, team and individual perspectives. He’ll describe how by crafting automation and problem solving, SRE can permeate across a technical organization – not only ensuring a massively high-performant and always available site, but used to inform optimum decision making - in everything from system procurement to application design, builds and deployment. Todd will talk in depth about what constitutes the best in SRE in a DevOps world, using examples to examine the techniques needed to accelerate value and grow teams. Taking the ‘lid-off’ SRE at LinkedIn, join Todd as he describes how it started and continues to evolve, what goals are important, and how it’s instrumental in building a high-trust and inclusive team culture needed to drive continuous improvement -- and importantly, have lots of fun doing it!

Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...

Todd Palino

Kafka makes so many things easier to do, from managing metrics to processing streams of data. Yet it seems that so many things we have done to this point in configuring and managing it have been object studies in how to make our lives, as the plumbers who keep the data flowing, more difficult than they have to be. What are some of our favorites? * Kafka without access controls * Multitenant clusters with no capacity controls * Worrying about message schemas * MirrorMaker inefficiencies * Hope and pray log compaction * Configurations as shared secrets * One-way upgrades We’ve made a lot of progress over the last few years improving the situation, in part by focusing some of this incredibly talented community towards operational concerns. We’ll talk about the big mistakes you can avoid when setting up multi-tenant Kafka, and some that you still can’t. And we will talk about how to continue down the path of marrying the hot, new features with operational stability so we can all continue to come back here every year to talk about it.

Running Kafka for Maximum Pain

Todd Palino

Mehr von Todd Palino (7)

Leading Without Managing: Becoming an SRE Technical Leader

From Operations to Site Reliability in Five Easy Steps

Code Yellow: Helping Operations Top-Heavy Teams the Smart Way

Why Does (My) Monitoring Suck?

URP? Excuse You! The Three Kafka Metrics You Need to Know

Redefine Operations in a DevOps World: The New Role for Site Reliability Eng...

Running Kafka for Maximum Pain

Kürzlich hochgeladen

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Booking Booking Now open +91- 7737669865 Why you Choose Us- +91- 7737669865 HOT⇄ 7737669865 Mr ashu ji Call Mr ashu Ji +91- 7737669865 (V020524]N) 𝐇𝐨𝐭𝐞𝐥 𝐑𝐨𝐨𝐦𝐬 𝐈𝐧𝐜𝐥𝐮𝐝𝐢𝐧𝐠 𝐑𝐚𝐭𝐞 𝐒𝐡𝐨𝐭𝐬/𝐇𝐨𝐮𝐫𝐲🆓 .█▬█⓿▀█▀ 𝐈𝐍𝐃𝐄𝐏𝐄𝐍𝐃𝐄𝐍𝐓 𝐆𝐈𝐑𝐋 𝐕𝐈𝐏 𝐄𝐒𝐂𝐎𝐑𝐓 Hello Guys ! High Profiles young Beauties and Good Looking standard Profiles Available , Enquire Now if you are interested in Hifi Service and want to get connect with someone who can understand your needs. Service offers you the most beautiful High Profile sexy independent female Escorts in genuine ✔✔✔ To enjoy with hot and sexy girls ✔✔✔ ★providing:- • Models • vip Models • Russian Models

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...

roncy bisnoi

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking Escorts Service Available Whatsapp SABANA ☎️ : [+91-7001035870] Escorts Service are always ready to make their clients happy. Their exotic looks and sexy personalities are sure to turn heads. You can enjoy with them, including massages and erotic encounters. Our area Escorts are young and sexy, so you can expect to have an exotic time with them. They are trained to satiate your naughty nerves and they can handle anything that you want. They are also intelligent, so they know how to make you feel comfortable and relaxed Independent Escorts Service They know all the sex positions and can satisfy you in any way that you desire. They can even give you erotic massages to help you relax before your session. This is essential, because a man who is stressed won’t be receptive to the pleasures of sex. They also know how to play with your sexy organs, so you’ll have plenty of foreplay and cuddling. P252024SS SERVICE ✅ ❣️ ⭐➡️HOT & SEXY MODELS // COLLEGE GIRLS HOUSE WIFE RUSSIAN , AIR HOSTES ,VIP MODELS . AVAILABLE FOR COMPLETE ENJOYMENT WITH HIGH PROFILE INDIAN MODEL AVAILABLE HOTEL & HOME ★ SAFE AND SECURE HIGH CLASS SERVICE AFFORDABLE RATE ★ SATISFACTION,UNLIMITED ENJOYMENT. ★ All Meetings are confidential and no information is provided to any one at any cost. ★ EXCLUSIVE PROFILes Are Safe and Consensual with Most Limits Respected ★ Service Available In: - HOME & HOTEL Star Hotel Service .In Call & Out call SeRvIcEs : ★ A-Level (star escort) ★ Strip-tease ★ BBBJ (Bareback Blowjob)Receive advanced sexual techniques in different mode make their life more pleasurable. ★ Spending time in hotel rooms ★ BJ (Blowjob Without a Condom) ★ Completion (Oral to completion) ★ Covered (Covered blowjob Without condom ★ANAL SERVICES.

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking

dharasingh5698

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking Escorts Service Available Whatsapp SABANA ☎️ : [+91-7001035870] Escorts Service are always ready to make their clients happy. Their exotic looks and sexy personalities are sure to turn heads. You can enjoy with them, including massages and erotic encounters. Our area Escorts are young and sexy, so you can expect to have an exotic time with them. They are trained to satiate your naughty nerves and they can handle anything that you want. They are also intelligent, so they know how to make you feel comfortable and relaxed Independent Escorts Service They know all the sex positions and can satisfy you in any way that you desire. They can even give you erotic massages to help you relax before your session. This is essential, because a man who is stressed won’t be receptive to the pleasures of sex. They also know how to play with your sexy organs, so you’ll have plenty of foreplay and cuddling. P252024SS SERVICE ✅ ❣️ ⭐➡️HOT & SEXY MODELS // COLLEGE GIRLS HOUSE WIFE RUSSIAN , AIR HOSTES ,VIP MODELS . AVAILABLE FOR COMPLETE ENJOYMENT WITH HIGH PROFILE INDIAN MODEL AVAILABLE HOTEL & HOME ★ SAFE AND SECURE HIGH CLASS SERVICE AFFORDABLE RATE ★ SATISFACTION,UNLIMITED ENJOYMENT. ★ All Meetings are confidential and no information is provided to any one at any cost. ★ EXCLUSIVE PROFILes Are Safe and Consensual with Most Limits Respected ★ Service Available In: - HOME & HOTEL Star Hotel Service .In Call & Out call SeRvIcEs : ★ A-Level (star escort) ★ Strip-tease ★ BBBJ (Bareback Blowjob)Receive advanced sexual techniques in different mode make their life more pleasurable. ★ Spending time in hotel rooms ★ BJ (Blowjob Without a Condom) ★ Completion (Oral to completion) ★ Covered (Covered blowjob Without condom ★ANAL SERVICES.

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking

dharasingh5698

UNIT - IV - Air Compressors and its Performance

sivaprakash250

Intze Overhead Water Tank Design by Working Stress - IS Method.pdf

Suman Jyoti

Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking Booking Now open +91- 7737669865 Why you Choose Us- +91- 7737669865 HOT⇄ 7737669865 Mr ashu ji Call Mr ashu Ji +91- 7737669865 (V020524]N) 𝐇𝐨𝐭𝐞𝐥 𝐑𝐨𝐨𝐦𝐬 𝐈𝐧𝐜𝐥𝐮𝐝𝐢𝐧𝐠 𝐑𝐚𝐭𝐞 𝐒𝐡𝐨𝐭𝐬/𝐇𝐨𝐮𝐫𝐲🆓 .█▬█⓿▀█▀ 𝐈𝐍𝐃𝐄𝐏𝐄𝐍𝐃𝐄𝐍𝐓 𝐆𝐈𝐑𝐋 𝐕𝐈𝐏 𝐄𝐒𝐂𝐎𝐑𝐓 Hello Guys ! High Profiles young Beauties and Good Looking standard Profiles Available , Enquire Now if you are interested in Hifi Service and want to get connect with someone who can understand your needs. Service offers you the most beautiful High Profile sexy independent female Escorts in genuine ✔✔✔ To enjoy with hot and sexy girls ✔✔✔ ★providing:- • Models • vip Models • Russian Models

Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking

roncy bisnoi

PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL

ManishPatel169454

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Indian Girls Waiting For You To Fuck Booking Contact Details WhatsApp Chat: +91-6297143586 pune Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts pune understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide - 01-may-2024(v.n)

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...

Call Girls in Nagpur High Profile

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss This Chance Of Getting Into My Sexy Boobs? Booking Contact Details WhatsApp Chat: +91-8250192130 pune Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts pune understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide - 30-april-2024(v.n)

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...

ranjana rawat

KubeKraft presentation @CloudNativeHooghly

sanyuktamishra911

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Sex Service At Affordable Rate Booking Contact Details WhatsApp Chat: +91-6297143586 pune Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts pune understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide - 01-may-2024(v.n)

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...

Call Girls in Nagpur High Profile

Java Programming :Event Handling(Types of Events)

simmis5

Online banking management system project.pdf

Kamal Acharya

Increased aeration of the soil; Stabilized soil structure; Higher and more diversified crop production; Better workability of the land; Earlier planting dates; Reduction of peak discharges by an increased temporary storage of water in the soil decomposition of organic matter; soil subsidence; reduced irrigation efficiency; increased risk of drought. excessive leaching of valuable nutrients from the soil; downstream environmental damage by salty or otherwise polluted drainage water; the presence of ditches, canals, and structures impending accessibility and interfering with other infrastructural elements of the land.

chapter 5.pptx: drainage and irrigation engineering

mulugeta48

The Educational Administration: Theory and Practice publishes prominent empirical and conceptual articles focused on timely and critical leadership and policy issues of educational organizations. The journal embraces traditional and emergent research paradigms, methods, and issues. The journal particularly promotes the publication of rigorous and relevant scholarly work that enhances linkages among and utility for educational policy, practice, and research arenas. The goal of the editorial team and the journal’s editorial board is to promote sound scholarship and a clear and continuing dialogue among scholars and practitioners from a broad spectrum of education. Educational Administration: Theory and Practice presents prominent empirical and conceptual articles focused on timely and critical leadership and policy issues facing educational organizations. As an editorial team, we embrace traditional and emergent theoretical frameworks, research methods, and topics. We particularly promote the publication of rigorous and relevant scholarly work with utility for educational policy, practice, and research. The journal’s primary focus is on studies of educational leadership, organizations, leadership development, and policy as they relate to elementary and secondary levels of education. Examinations of leadership and policy that fall outside K-12 are considered insofar as there are meaningful connections to the K-12 arena (e.g., college pipeline). International comparative investigations are welcome to the extent they have implications for a broad audience.s.

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...

Christo Ananth

Generative AI or GenAI technology based PPT

bhaskargani46

Call girls in delhi ✔️✔️🔝 9953056974 🔝✔️✔️Welcome To Vip Escort Services In Delhi [ ]Noida Gurgaon 24/7 Open Sex Escort Services With Happy Ending ServiCe Done By Most Attractive Charming Soft Spoken Bold Beautiful Full Cooperative Independent Escort Girls ServiCe In All-Star Hotel And Home Service In All Over Delhi, Noida, Gurgaon, Faridabad, Ghaziabad, Greater Noida, • IN CALL AND OUT CALL SERVICE IN DELHI NCR • 3* 5* 7* HOTELS SERVICE IN DELHI NCR • 24 HOURS AVAILABLE IN DELHI NCR • INDIAN, RUSSIAN, PUNJABI, KASHMIRI ESCORTS • REAL MODELS, COLLEGE GIRLS, HOUSE WIFE, ALSO AVAILABLE • SHORT TIME AND FULL TIME SERVICE AVAILABLE • HYGIENIC FULL AC NEAT AND CLEAN ROOMS AVAIL. IN HOTEL 24 HOURS • DAILY NEW ESCORTS STAFF AVAILABLE • MINIMUM TO MAXIMUM RANGE AVAILABLE. Call Girls in Delhi & Independent Escort Service – CALL GIRLS SERVICE DELHI NCR Vip call girls in Delhi Call Girls in Delhi, Call Girl Service 24×7 open Call Girls in Delhi Best Delhi Escorts in Delhi Low Rate Call Girls In Saket Delhi X~CALL GIRLS IN Ramesh Nagar Metro best Delhi call girls and Delhi escort service. CALL GIRLS SERVICE IN ALL DELHI … (Delhi) Call Girls in (Chanakyapuri) Hot And Sexy Independent Model Escort Service In Delhi Unlimited Enjoy Genuine 100% Profiles And Trusted Door Step Call Girls Feel Free To Call Us Female Service Hot Busty & Sexy Party Girls Available For Complete Enjoyment. We Guarantee Full Satisfaction & In Case Of Any Unhappy Experience, We Would Refund Your Fees, Without Any Questions Asked. Feel Free To Call Us Female Service Provider Hours Opens Thanks. Delhi Escorts Services 100% secure Services.Incall_OutCall Available and outcall Services provide. We are available 24*7 for Full Night and short Time Escort Services all over Delhi NCR. Delhi All Hotel Services available 3* 4* 5* Call Call Delhi Escorts Services And Delhi Call Girl Agency 100% secure Services in my agency. Incall and outcall Services provide. We are available 24*7 for Full Night and short Time Escort Services my agency in all over New Delhi Delhi All Hotel Services available my agency SERVICES [✓✓✓] Housewife College Girl VIP Escort Independent Girl Aunty Without a Condom sucking )? Sexy Aunty.DSL (Dick Sucking Lips)? DT (Dining at the Toes English Spanking) Doggie (Sex style from no behind)?? OutCall- All Over Delhi Noida Gurgaon 24/7 FOR APPOINTMENT Call/Whatsop / 9953056974

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service

9953056974 Low Rate Call Girls In Saket, Delhi NCR

N-Grade deals with the maintenance of university, department, faculty, student information within the university. N-Grade is an automation system, which is used to store the department, faculty, student, courses and information of a university. Starting from registration of a new student in the university, it maintains all the details regarding the attendance and marks of the students. The project deals with retrieval of information through an INTRANET based campus wide portal. It collects related information from all the departments of an organization and maintains files, which are used to generate reports in various forms to measure individual and overall performance of the students.

University management System project report..pdf

Kamal Acharya

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Girls Waiting For You To Fuck Booking Contact Details WhatsApp Chat: +91-6297143586 pune Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts pune understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide - 01-may-2024(v.n)

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...

Call Girls in Nagpur High Profile

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Chance Of Getting Into My Sexy Boobs? Booking Contact Details WhatsApp Chat: +91-8250192130 pune Escort Service includes providing maximum physical satisfaction to their clients as well as engaging conversation that keeps your time enjoyable and entertaining. Plus they look fabulously elegant; making an impressionable. Independent Escorts pune understands the value of confidentiality and discretion - they will go the extra mile to meet your needs. Simply contact them via text messaging or through their online profiles; they'd be more than delighted to accommodate any request or arrange a romantic date or fun-filled night together. We provide - 30-april-2024(v.n)

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...

ranjana rawat

Kürzlich hochgeladen (20)

Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...

VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking

UNIT - IV - Air Compressors and its Performance

Intze Overhead Water Tank Design by Working Stress - IS Method.pdf

Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking

PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL

Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...

KubeKraft presentation @CloudNativeHooghly

Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...

Java Programming :Event Handling(Types of Events)

Online banking management system project.pdf

chapter 5.pptx: drainage and irrigation engineering

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...

Generative AI or GenAI technology based PPT

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service

University management System project report..pdf

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...

I'm No Hero: Full Stack Reliability at LinkedIn

1. I’m No Hero Full Stack Reliability At LinkedIn

6. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. We Can’t Do It Alone  The Kafka SRE team is 3 people in the US, and 1.5 SREs in Bangalore  We manage over 6000 application instances – 100 Kafka clusters, with 1800 brokers – Over 1 trillion messages a day  The environment is never static from one day to the next 6

9. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Infrastructure as a Service  SREs do not deploy hardware and OS  Production Operations – Datacenter Technicians – Systems Operations – Network Operations  Provide all basic OS and network services  There is still tweaking for individual applications 9

10. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Common Repositories  All source code and configurations are committed to one place  Subversion and Git centrally managed  Consistent management – Precommit checks – ACLs and Review boards  Connects directly to the build systems 10

11. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Containerization  Most of our stack is Java – Python is well-supported – Always a few one-offs  Java applications have Tomcat and Jetty containers – Hooks for monitoring – Client libraries are managed by the team that owns the application  Provides a consistent control surface for applications 11

12. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Build and Deployment  When code is committed, it is automatically built – Successes become deployment artifacts – Failures are tracked via Jira  Build systems are centrally managed  Common tools – Dependency management and introspection – Version management – Error budgeting – Deployment 12

13. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Monitoring  Monitoring, graphing, and alerting as a service  Completely self-service – Applications annotate metrics and they are automatically collected – Monitoring dashboards can be created by anyone  Automatic metrics and dashboards for common features – HTTP servers, system and OS metrics – Client libraries (such as Kafka)  Additional metrics can be published outside the container 13

15. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Site Up  With the stack supporting it, applications sit on top – SREs architect and run the application – SRE and developers respond to failures  The NOC monitors high-level metrics – Overall site health and growth metrics – They also coordinate incident response  Incident response is blameless – Fix the problem, don’t fix the blame 15

16. SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. Review and Revise  All components are constantly improving – Incidents expose issues in the infrastructure – Feedback from usage of the tools  Steering committees discuss large-scale changes – Production Operations, SRE, and Development all have their own – Comprised of individual contributors, not managers  Open collaboration – Common repositories means everyone can help 16

Hinweis der Redaktion

This is not far from the truth. We go through a lot of beer. We’ll get to why I drink shortly. Site Reliability Engineering, or SRE, combines several roles that fit together into one Operations position. Foremost, we are administrators. We manage all of the systems in our area We are also architects. We do capacity planning for our deployments, plan out our infrastructure in new datacenters, and make sure all the pieces fit together And we are also developers. We identify tools we need, both to make our jobs easier and to keep our users happy, and we write and maintain them. This is all well and good for describing the responsibilities, but how do we do it? An SRE needs certain knowledge A little knowledge of all the components Understanding of how they fit together Understanding of how to fit them into the infrastructure Combined with the ability to build tools and automation around the applications, SRE allows the developers to focus on the application, not on running the application. At the end of the day, our job is to keep the site running, always.
At LinkedIn we have three types of SREs. The work is generally the same, but the scope is different for each. Embedded SRE teams are closely aligned with a development team, working with a specific application. This requires deep knowledge of the application itself, and the SREs often find themselves working in the code. The development team and the SRE team work together on feature planning, with the SRE team providing their expertise in operations to inform the architecture of the application. Central SRE teams (at LinkedIn we now call them Production SRE) oversee a number of different applications for a variety of development teams. Many of these applications are not big enough on their own to warrant their own teams, so the central SRE team will assign with managing the operations of the applications, including making sure there’s hardware for them to run on. Production SRE is also the home of our NOC team, who provide high level site monitoring and coordinate incidents that impact more than one team. Tools and Infrastructure SREs are a category to themselves. These teams are responsible for developing and deploying the infrastructure that everything at LinkedIn uses. For example, build and deployment systems, monitoring and alerting systems, and other tools that are common to all teams. My role is that of an Embedded SRE, working directly with the development teams responsible for Streaming. So, on to why I drink.
This is an overview of the Streaming ecosystem at LinkedIn, highly simplified (it doesn’t account for multiple sites and simplifies many of the data flows). Within the Streaming organization, we have 3 teams – Data*, Kafka, and Samza. Data* manages our change capture systems. There are several versions of these, with the latest being Brooklin. Brooklin uses Apache Kafka underneath for streaming changes from Espresso (a key-value store) to client systems. Apache Kafka is the heart of our big data systems. Not only does it underpin Brooklin, some of our data storage systems, such as Espresso and Voldemort (two different key-value systems) use Kafka for replication between components. We also have a number of multitenant Kafka clusters, which are used by every system and application at LinkedIn. These are used for user tracking data, system and application metrics, logging, and queuing all sorts of other messages. Because Kafka is used for metrics, driving our monitoring and alerting systems, we have separate monitoring systems that we maintain for Kafka. Our team is also responsible for managing Zookeeper, used by us and many other applications. Samza is the third team, and they manage our stream processing platform that uses Apache Samza. This heavily relies on Kafka to provide the data, and a place for intermediate results to be written. Some of the applications that run here are things like our data standardization systems, and messaging applications.
My team is quite small. We have 3 SREs dedicated to Kafka and Zookeeper in the US, with a little more than another full SRE in our team in Bangalore, India. This is to manage a deployment with well over 6000 application instances. For the core part of that, the Kafka clusters themselves, we have over 100 separate clusters comprised of more than 1800 servers. They’re processing over a trillion messages a day in total. What’s more, LinkedIn’s landscape is changing daily. There are thousands of applications running, with new versions many times a day. Hardware is always changing, we always have new features to contend with. There’s always someone who needs our help. How can we manage to run this ecosystem effectively with so small a team? The answer lies in what I call full-stack reliability.
Many of us will be familiar with Maslow’s hierarchy of needs. This diagram illustrates the theory that there are basic needs that must be met in order for us to function as human beings. Each need builds upon the one below it. None can stand unless the ones beneath are met. What makes the SRE teams at LinkedIn effective is that we have built our environment in a similar fashion. When building a system within a cloud environment, you have many services that are provided for you to take advantage of. This includes hardware, databases, load balancers, monitoring, and any number of other tools. The idea is that you want to be able to focus on your application, not running those things that are not core to your business, but are still required.
Here is what my stack looks like. I’m not as fancy as Maslow, with his colors, but the same theory stands. Each layer describes a basic need when it comes to reliability in our applications. None of the layers can stand unless the ones below them are satisfied. My stack has 6 layers, starting from the bottom: Infrastructure as a Service Common Repositories Containerization Build and Deployment Monitoring Site Up We’ll cover each of these in turn
As an SRE, I have never set foot in a LinkedIn datacenter, nor have I had my hands on one of our servers. I haven’t even installed an operating system on one of them. Likewise, I have never worked our our networking hardware, or directly made modifications to a service like DNS. All of the services are provided by a separate organization, named Production Operations. The 3 larger teams that SRE works with on a day-to-day basis are the Datacenter Technicians, who are the people who actually deal directly with the hardware. They are the ones on site in each datacenter to both deploy and maintain the systems Systems Operations, the team responsible for the operating system deployment. They are also responsible for maintaining services like DNS Network Operations, which performs a similar function for the networking, handling all the router and switches, as well as firewalls, load balancers, and more The ProdOps team provides all basic OS and network services so that other teams do not have to have specialists in these areas and there is consistency across the infrastructure. For most applications, when I need to deploy new services I can allocate systems from a common pool and deploy with one command. If I need DNS changes, or network ACLs, I open a request for the change and it’s taken care of promptly. When I need to deploy a new broker, it’s a little different because they use custom hardware and tuning. For this, I put in a ticket for new hardware. Within a specified time, I get a hostname for the new system. I can trust that it’s already configured the way I need it, and it’s fully integrated with LinkedIn’s systems. I just need to deploy my application. How we get to that deployable application involves the next 3 layers.
Applications start as source code, and how that is managed forms the base of the application layers. We use a single set of repositories for all code and configuration, which are separate. These subversion and git repositories (we use both right now) are centrally managed by our Tools team. They have consistent precommit checks, which not only help to validate the simple format of certain files (like XML or YAML), but also perform more complex checks like rejecting duplicate class definitions. There are also ACLs and review boards tied in so that individual teams can make sure that changes to their applications are appropriately vetted before they are committed. These repositories are tied into our build system as well, as we’ll discuss in the next layer. This may seem like a small thing to make up such a fundamental layer, but the management of code and config is critical. We have cultural tenets of craftsmanship and openness, and this serves both of them. Precommit checks allow us to follow a set of standards as to how we write code. Having it all in one place means that anyone can check out anyone else’s code – there are no secrets. It’s also important that we maintain configurations the same way we maintain code. Reviews before things are checked in means we are able to catch a lot of problems before they get out to production.
Most of the applications we are working with are Java. We do have a large number of Python applications, as that is the other supported language and it’s used a lot by the SRE teams for writing the tools around the applications. Of course, there are more language than that in use – I have a few Golang apps that we have written. Because that is not a fully supported language, I had to take a few extra steps to make sure it would integrate with all of our build and deployment systems. All of the Java applications run in a container, usually Tomcat or Jetty, that encapsulates the application and provides all of the common pieces for the application developer. For example, the monitoring systems (which make up the next layer) are simply hooked in here. Most client libraries are accessed via Spring here. The versions have already been vetted by other teams, and any configuration parameters either have sane defaults or are surfaced in the application’s config. The most important thing about the containers is that they provide a reliable control surface for the application. This allows the app to interact with all of the tooling within LinkedIn without needing to specifically implement it. For one example, the container provides an HTTP endpoint of its own. For any app, I can quickly determine what the port number of this endpoint is, because there is a registry of port numbers, and I know that I can request `/admin` on that endpoint and get back either a good or a bad response, depending on the health of the application. A number of tools and automatic monitoring systems depend on this.
As soon as code is committed to the repository, a build task is started. Most of us are familiar with these processes from open source projects, and we handle our internal applications the same way. Build successes automatically become deployable artifacts and are pushed up to Artifactory. Failures have a ticket created for them, assigned to the person who checked in the code. In many cases, the bad commit is automatically reverted to maintain trunk (or master) as clean. As with everything else, these build systems are centrally managed by the Tools team. For all of them, we have helper applications that are maintained that make working with apps easily. With common repositories and build systems, I can easily introspect and manage the dependency tree for example. As the owner of the Kafka client library, this is very important. When I have a critical fix that needs to go out to hundreds of applications, I can push a library update into all the dependent applications with as little as a single command. We also have systems for tracking the versions of applications that are deployed. It enforces certain rules and deployment steps, which can be defined for each app, which means that we can set a release process that can be followed by anyone. Which means I can trust developers to deploy applications to production because they will always follow the deployment path we have worked out together. Deployment is pretty amazing as well. Not only can we use the version tracking system to perform multiple steps with the push of a button, if I need to get a little more manual it’s still only one command to deploy anywhere in our infrastructure.
Once deployed, monitoring is the most important part of running an application. If there’s an application that doesn’t have some sort of monitoring on it, it may as well not exist at all. At LinkedIn, our monitoring systems, including graphing and alerting, are all provided as a service for the rest of the organization by our Infrastructure SRE team. What’s more, it is a completely self-service system. Metrics do not have to be approved and on-boarded before they can be used. If a developer wants to expose a new metric, all they have to do is annotate the sensor within the application. The container logic takes care of polling the sensor and producing the metrics into Kafka. From there, the monitoring system consumes them and within about 5 minutes, graphs are available. We can then set up a dashboard with multiple metrics, including alert thresholds. Once the metrics are in the system, they are accessible by everyone, and anyone can set up their own dashboard to watch something. Many common components have their own metrics and dashboards automatically provided without the application needing to annotate them. For example, if an application uses a Kafka client, there are a number of metrics that are produced by default. There are also dashboards for some common things, like HTTP servers. It’s also possible to publish metrics into the system separately from the container. Since we use Kafka for collecting metrics, all you have to do is publish a metrics message. We have helper REST applications for this.
Let’s be honest, none of this runs 100% all the time. With applications in a constant state of change, where does this put us at the top of the stack where we have me, an SRE, trying to keep the site up? Everything is on fire all the time, and that’s OK. Hardware is always failing, but ProdOps is detecting that and resolving it. The developers are constantly checking in changes, some of them pretty sketchy, and the tooling is taking care of building the code and generating deployables. Thanks to our Infrastructure SRE team, when those sketchy changes do make it to production, there is monitoring to detect problems and help us resolve them quickly.
SRE focuses on architecting and running the application. We write tools and scripts to support this, and sometimes we write more general tools that other teams use as well. When something breaks, I work with my developers (as an embedded SRE) and get it fixed. Our NOC is there to monitor the high-level metrics, as most of the monitoring and alerting goes directly to the teams responsible. The NOC watches overall site health, and they track many metrics related to site growth. When there is a problem, they help coordinate multiple teams in fixing it. This is what we call “site up”, and it is the top priority. A big component of this is that our incident process, both the response and the followup, are blameless. It doesn’t matter who caused a problem, what is important is that we fix it and then make sure it doesn’t happen again. Trying to figure out who is at fault takes time away from other things, and only serves to make someone feel bad and make them less likely to contribute something meaningful in the future.
As with any system, you must review it all the time and make sure you’re headed in the right direction. Like any other application, the infrastructure components are constantly being improved. Some of the incidents we have to resolve expose deficiencies, whether it’s something we have a missed monitoring or a process that needs to be changed to be safer. As users of the tools and infrastructure, SREs and developers are providing feedback on what works and what doesn’t. For bigger changes to what we’re doing, we have several steering committees that can be engaged to provide broader input and direction. The ProdOps, SRE, and development organizations each have their own committee covering different areas, and we collaborate with each other as needed. The teams are comprised of individual contributors, higher level technical employees, and not managers. This is important, because it feeds into our culture of strong technical leadership. Most importantly, our systems are set up to provide for open collaboration between all teams. Common code and config repositories are one aspect of this – when everyone can see what’s going on, everyone can contribute. This means that when I find a problem with a tool, I can create a fix and send the owner a patch to review. As opposed to just giving them the feedback, after which they need to set aside time to look at it among all the other things they have to do, duplicate the problem, create a fix, and get it reviewed.

I'm No Hero: Full Stack Reliability at LinkedIn

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie I'm No Hero: Full Stack Reliability at LinkedIn

Ähnlich wie I'm No Hero: Full Stack Reliability at LinkedIn (20)

Mehr von Todd Palino

Mehr von Todd Palino (7)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

I'm No Hero: Full Stack Reliability at LinkedIn

Hinweis der Redaktion