URP? Excuse You! The Three Metrics You Have to Know

•

1 like•965 views

(Todd Palino, LinkedIn) Kafka Summit SF 2018 What do you really know about how to monitor a Kafka cluster for problems? Is your most reliable monitoring your users telling you there’s something broken? Are you capturing more metrics than the actual data being produced? Sure, we all know how to monitor disk and network, but when it comes to the state of the brokers, many of us are still unsure of which metrics we should be watching, and what their patterns mean for the state of the cluster. Kafka has hundreds of measurements, from the high-level numbers that are often meaningless to the per-partition metrics that stack up by the thousands as our data grows. We will thoroughly explore three key monitoring concepts in the broker, that will leave you an expert in identifying problems with the least amount of pain: -Under-replicated Partitions: The mother of all metrics -Request Latencies: Why your users complain -Thread pool utilization: How could 80% be a problem? We will also discuss the necessity of availability monitoring and how to use it to get a true picture of what your users see, before they come beating down your door!

Technology

URP? Excuse You!
Todd Palino
Senior Staff Engineer, Site Reliability
LinkedIn

• What is Kafka
• Encyclopedia of Monitoring
• Automation
What This
Talk Is Not

Monitoring is not Alerting
• Collect everything
• Alert on nothing
• Events are better than metrics
• Tests are better than alerts
• Sleep is best in life

• What’s an SLA?
• Availability
• Latency
• Customer Guarantees
Service
Level
Objectives

The Three Metrics You Need to Know
Partitions that are not
fully replicated within
the cluster
URP
The overall utilization
of an Apache Kafka
broker
Request Handlers
How long requests are
taking, in which stage
of processing
Request Timing

Under-Replicated Partitions
• Highly discussed
• Overall cluster health
• Replication is a consumer and producer

Under-Replicated Partitions
EXAMPLE: FAILED BROKER

Under-Replicated Partitions
EXAMPLE: CONSUMER PROBLEMS

Under-Replicated Partitions
EXAMPLE: PRODUCER PROBLEMS

Under-Replicated Partitions
• Overrated
• Doesn’t map to SLO
• Often not actionable
• Collect, but don’t alert

Everybody
In The
Pool
• Specialized thread pools
• Clients deal with network and
request pools
• Request handlers do most of the
work

Request
Handlers
• Decode and validate
• Perform task
• Wait for other brokers
• Assemble response

Request Handler Problems
• Anything that causes Kafka
to expend CPU cycles
• Includes problems related to
failing disks (IO wait)
• SSL and compression work
both can use a lot of CPU
CPU Time Timeout Deadlock
• Most often due to failing to
process controller requests
• Intra-cluster requests tend to
be bound by partition counts
• Rapidly starves the pool of
threads
• Should always be a code bug
• Usually looks exactly like a
timeout problem
• Rare, but hard to identify

Request Handler Problems
EXAMPLE: TIMEOUT OR DEADLOCK

Brokers Don’t Shouldn’t Do Compression
• Kafka brokers are running a new version
• Message format has been set to the new version
• Clients haven’t upgraded
Up Conversion Down Conversion
• Kafka brokers are running a new version
• Message format is set to an older version due to
clients
• Producer clients update to new version

Request Timing
• Remote – Waiting for other brokers
• Response Queue – Waiting to send
• Response Send - Send to client
• Total – Request handling, end to end
• Request Queue – Waiting to process
• Local – Work local to the broker

Request Timing
EXAMPLE: PRODUCE TOTAL TIME

Request Timing
EXAMPLE: PRODUCE LOCAL TIME

Request Timing
EXAMPLE: PRODUCE REMOTE TIME

Availability
Monitoring
• SLO, part 2
• Measured externally
• Client focused
• github.com/linkedin/kafka-monitor

Operating System
And Hardware
Metrics
• What do they mean?
• What application is causing it?
• Don’t alert unless:
• 100% clear signal
• 100% clear response

Capacity
Planning
• Plan in advance
• Multi-factor
• Don’t alert for capacity

Capacity
Metrics
• Request Handler Idle Ratio
• Disk Utilization
• Partition Count
• Network Utilization

If You Remember Nothing Else…
• Define your service level objectives
• Monitor your service level objectives
• Metrics that cover many problems are noisy
• Buy Kafka: The Definitive Guide

Getting (and Giving) Help
• Kafka Monitor
• https://github.com/linkedin/kafka-monitor
• Burrow
• https://github.com/linkedin/Burrow
• Cruise Control
• https://github.com/linkedin/cruise-control
• kafka-tools
• https://github.com/linkedin/kafka-tools
LinkedIn Open Source Get Involved
• Community
• users@kafka.apache.org
• dev@kafka.apache.org
• Bugs and Work:
• https://issues.apache.org/jira/projects/KAFKA

What's hot

(Preston Thompson, Braze) Kafka Summit SF 2018 If you collect billions of data points every day and create billions more sending and tracking messages, then you know you need to get your infrastructure right. Our clients use Braze to engage their users over their lifecycle via push notifications, emails, in-app messages and more. Using our Currents product, clients can enable multiple configurable integrations to export this event data in real time to a variety of third-party systems, allowing them to tightly integrate with the rest of their operations and understand the impacts of their engagement strategy. We use Kafka and the Kafka ecosystem to power this high volume real-time export. As you’d expect in a big data environment, we take data collected from a variety of sources—our SDKs, email partner APIs, our own systems—and produce it to Kafka, with topics for each type of event (about 30 types). Kafka Streams filters and transforms this data according to the configurations set by our clients. Clients can choose which types of events should be sent to which third-party systems. Kafka Connect helps to export the data to third-party systems in real time using custom developed connectors. We run a connector instance for each integration for each customer that consumes from the integration-specific topic. On top of it all, we built a service to manage the pipeline. The service provides configurations to the Streams application and also creates topics for new integrations and uses the Connect REST API to create and manage connectors. In this talk, I will discuss: -How we started our journey in designing this large-scale streaming architecture -Why streaming technologies were necessary to solve our technology and business issues -The lessons we learned along the way that can help you with your Kafka-based architecture

Real-Time Dynamic Data Export Using the Kafka Ecosystem

confluent

While we frequently talk about how to build interesting products on top of machine and event data, the reality is that collecting, organizing, providing access to, and managing this data is where most people get stuck. Many organizations understand the use cases around their data – fraud detection, quality of service and technical operations, user behavior analysis, for example – but are not necessarily data infrastructure experts. In this session, we’ll follow the flow of data through an end to end system built to handle tens of terabytes an hour of event-oriented data, providing real time streaming, in-memory, SQL, and batch access to this data. We’ll go into detail on how open source systems such as Hadoop, Kafka, Solr, and Impala/Hive are actually stitched together; describe how and where to perform data transformation and aggregation; provide a simple and pragmatic way of managing event metadata; and talk about how applications built on top of this platform get access to data and extend its functionality. Attendees will leave this session knowing not just which open source projects go into a system such as this, but how they work together, what tradeoffs and decisions need to be addressed, and how to present a single general purpose data platform to multiple applications. This session should be attended by data infrastructure engineers and architects planning, building, or maintaining similar systems.

Building an Event-oriented Data Platform with Kafka, Eric Sammer

confluent

The grace period is a parameter of windowed operations such as Window or Session aggregates, or stream-stream joins. This configuration determines how long after a window ends any new data will still be processed. Events or records arriving after the expiration of the grace period are dropped from such windows. Prior to KIP-633, the default grace period was 24 hours which resulted in an endless stream of unexpected behaviors, problems, confusion and bewilderment for inexperienced users where results would not show up for 24 hours. In this talk, we cover the public API changes to the TimeWindows, SessionWindows, JoinWindows and SlidingWindows as well as the new guidance going forward.

The New Way of Configuring Grace Periods for Windowed Operations in Kafka Str...

HostedbyConfluent

Have you ever migrated Kafka clusters from one data center to another being completely transparent to client applications? At PayPal, as part of a massive datacenter migration initiative, Kafka team successfully moved all PayPal Kafka traffic across data centers. This initiative involved migrating 20+ Kafka clusters (1000+ broker and zookeeper nodes), as well as 60+ mirrormaker groups which seamlessly handle Kafka traffic volumes as high as 1 trillion messages per day. Throughout the course of this migration, applications required no modification, encountered 0% service outage, 0% message loss and duplicated messages. The whole migration process was fully transparent to Kafka applications. In this session, you will learn the strategies, techniques and tools the PayPal Kafka team has utilized for managing the migration process. You will also learn the lessons and pitfalls they experienced during this exercise, as well as the secret sauce of making the migration successful.

How did we move the mountain? - Migrating 1 trillion+ messages per day across...

HostedbyConfluent

When moving to a cloud native architecture Moogsoft knew they needed more scale than Rabbit could provide. Moogsoft moved into Kafka which is known for quick writing and driving heavy event driven workloads on top of niceties such as replayability. Choosing the tool was easy, finding a vendor that ticked all their boxes was not. They needed to ensure scalability, upgradability, builds via existing IAC pipelines, and observability via existing tools. When Moogsoft found Aiven, they were impressed with their offering and ability to scale on demand. During this presentation we will explore how Moogsoft used Aiven for Kafka to manage and scale their data in the cloud.

The Road Most Traveled: A Kafka Story | Heikki Nousiainen, Aiven

HostedbyConfluent

Apache Kafka is the backbone for building architectures that deal with billions of events a day. Chris Castle, Developer Advocate, will show you where it might fit in your roadmap. - What Apache Kafka is and how to use it on Heroku - How Kafka enables you to model your data as immutable streams of events, introducing greater parallelism into your applications - How you can use it to solve scale problems across your stack such as managing high throughput inbound events and building data pipelines Learn more at https://www.heroku.com/kafka Reveal.js version of slides: http://slides.com/christophercastle/deck#/

Event Driven Architectures with Apache Kafka on Heroku

Heroku

Monitoring Apache Kafka When you are running systems in production, clearly you want to make sure they are up and running at all times. But in a distributed system such as Apache Kafka… what does “up and running” even mean? Experienced Apache Kafka users know what is important to monitor, which alerts are critical and how to respond to them. They don’t just collect metrics - they go the extra mile and use additional tools to validate availability and performance on both the Kafka cluster and their entire data pipelines. In this presentation, we’ll discuss best practices of monitoring Apache Kafka. We’ll look at which metrics are critical to alert on, which are useful in troubleshooting and what may actually misleading. We’ll review a few “worst practices” - common mistakes that you should avoid. We’ll then look at what metrics don’t tell you - and how to cover those essential gaps.

Monitoring Apache Kafka

confluent

Developing cloud native microservices introduced us to many new challenges. One of the most difficult is to build reliable microservices integrations and their data exchange patterns. In this session I will share my 10 years of experience with building microservices and application runtime platforms with some of the largest European organisations. I will introduce basic principles of developing Java Spring Boot with Apache Kafka. These patterns can be used for: microservices communication decoupling, implementing microservices state stores, avoiding dependencies on traditional database systems. This session is targeted for developers who are interested in learning new cloud native development practices and understanding how event streaming microservices improve their current work. Demo application code will be available to participants.

Building Event Streaming Microservices with Spring Boot and Apache Kafka | Ja...

HostedbyConfluent

As a data professional, you are the glue that makes cross-platform integrations possible. With the increase in adoption of hybrid cloud architectures, Kafka is an increasingly relevant tool for building data pipelines between platforms and accelerating delivery on cloud projects. Early exposure to Kafka on Azure capabilities gives you an edge to build better mousetraps at the design phase. Customers already running Kafka on premises and are looking to extend Kafka systems to Azure can get started quickly with Confluent Cloud. Additionally, DevOps for self-managed options can be easily scalable with Ansible for Virtual Machines or containers via Azure Kubernetes Services or Azure Container Instances. This session is presented from the Microsoft Solution Architect perspective by Israel Ekpo, Microsoft Cloud Solution Architect and Alicia Moniz, Microsoft MVP. They will cover use cases and scenarios, along with key Azure integration points and architecture patterns.

Confluent On Azure: Why you should add Confluent to your Azure toolkit | Alic...

HostedbyConfluent

You have learned about Kafka event sourcing with streams and using Kafka as a database, but you may be having a tough time wrapping your head around what that means and what challenges you will face. Kafka’s exactly once semantics, data retention rules, and stream DSL make it a great database for real-time transaction processing. This talk will focus on how to use Kafka events as a database. We will talk about using KTables vs GlobalKTables, and how to apply them to patterns we use with traditional databases. We will go over a real-world example of joining events against existing data and some issues to be aware of. We will finish covering some important things to remember about state stores, partitions, and streams to help you avoid problems when your data sets become large.

Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...

HostedbyConfluent

Watch this talk here: https://www.confluent.io/online-talks/integrating-apache-kafka-into-your-environment-on-demand Integrating Apache Kafka with other systems in a reliable and scalable way is a key part of an event streaming platform. This session will show you how to get streams of data into and out of Kafka with Kafka Connect and REST Proxy, maintain data formats and ensure compatibility with Schema Registry and Avro, and build real-time stream processing applications with Confluent KSQL and Kafka Streams. This session is part 4 of 4 in our Fundamentals for Apache Kafka series.

Integrating Apache Kafka Into Your Environment

confluent

Building a Web Application with Kafka as your Database

confluent

Tale of two streaming frameworks (Karthik D - Walmart)

KafkaZone

As an AWS shop, Zillow engineering teams have been using various messaging and streaming services for years. As Zillow 2.0 piled through, new requirements and pain points made us rethink our streaming stack. The need for high data quality, decoupling producers & consumers and real time homes data called for a new platform which would empower developers, enable data governance and reduce incidents caused by bad data. In this session, you will learn why Zillow decided to go with Kafka for that platform, what tools we built to meet developers where they are and what common challenges you could face as you migrate other streaming solutions to Kafka.

How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...

HostedbyConfluent

Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...

confluent

With 3.6 million paid print and digital subscriptions, how did The New York Times remain a leader in an evolving industry that once relied on print? It fundamentally changed its infrastructure at the core to keep up with the new expectations of the digital age and its consumers. Now every piece of content ever published by The New York Times throughout the past 166 years and counting is stored in Apache Kafka. Join The New York Times' Director of Engineering Boerge Svingen to learn how the innovative news giant of America transformed the way it sources content while still maintaining searchability, accuracy and accessibility through a variety of applications and services—all through the power of a real-time streaming platform. In this talk, Boerge will: -Provide an overview of what the publishing infrastructure used to look like -Deep dive into the log-based architecture of The New York Times’ Publishing Pipeline -Explain the schema, monolog and skinny log used for storing articles -Share challenges and lessons learned -Answer live questions submitted by the audience Watch the recording: https://videos.confluent.io/watch/SURnGMNNzsvDHYCmnCkJEY?

Apache Kafka® Delivers a Single Source of Truth for The New York Times

confluent

How to over-engineer things and have fun? | Oto Brglez, OPALAB

HostedbyConfluent

APAC Kafka Summit - Best Of

confluent

Is your organization adopting Kafka as their messaging bus but you've found that it will take too long to migrate your existing service-oriented architecture to a log-oriented architecture? Some of the biggest challenges in building a new stream processor can be implementing all the business logic again. It has become increasingly common for companies with high-throughput source streams and change-data-capture logs to want to build systems fast. At Ticketmaster, we have found a solution to the problem by leveraging the business logic in our existing services and calling them from our Java based KafkaStreams processor applications in an efficient manner. In this talk, we will examine the initial challenges we faced in our transition, then we will explore the solutions we built to address the use cases at Ticketmaster. The primary focus will address our workflow around calling services to bring stream processor applications to market fast. We will review our challenges and share tips for success.

Leveraging services in stream processor apps at Ticketmaster (Derek Cline, Ti...

confluent

William Hill is one of the UK’s largest, most well-established gaming companies with a global presence across 9 countries with over 16,000 employees. In recent years the gaming industry and in particular sports betting, has been revolutionised by technology. Customers now demand a wide range of events and markets to bet on both pre-game and in-play 24/7. This has driven out a business need to process more data, provide more updates and offer more markets and prices in real time. At William Hill, we have invested in a completely new trading platform using Apache Kafka. We process vast quantities of data from a variety of feeds, this data is fed through a variety of odds compilation models, before being piped out to UI apps for use by our trading teams to provide events, markets and pricing data out to various end points across the whole of William Hill. We deal with thousands of sporting events, each with sometimes hundreds of betting markets, each market receiving hundreds of updates. This scales up to vast numbers of messages flowing through our system. We have to process, transform and route that data in real time. Using Apache Kafka, we have built a high throughput, low latency pipeline, based on Cloud hosted Microservices. When we started, we were on a steep learning curve with Kafka, Microservices and associated technologies. This led to fast learnings and fast failings. In this session, we will tell the story of what we built, what went well, what didn’t go so well and what we learnt. This is a story of how a team of developers learnt (and are still learning) how to use Kafka. We hope that you will be able to take away lessons and learnings of how to build a data processing pipeline with Apache Kafka.

Building High-Throughput, Low-Latency Pipelines in Kafka

confluent

What's hot (20)

Real-Time Dynamic Data Export Using the Kafka Ecosystem

Building an Event-oriented Data Platform with Kafka, Eric Sammer

The New Way of Configuring Grace Periods for Windowed Operations in Kafka Str...

How did we move the mountain? - Migrating 1 trillion+ messages per day across...

The Road Most Traveled: A Kafka Story | Heikki Nousiainen, Aiven

Event Driven Architectures with Apache Kafka on Heroku

Monitoring Apache Kafka

Building Event Streaming Microservices with Spring Boot and Apache Kafka | Ja...

Confluent On Azure: Why you should add Confluent to your Azure toolkit | Alic...

Using Kafka as a Database For Real-Time Transaction Processing | Chad Preisle...

Integrating Apache Kafka Into Your Environment

Building a Web Application with Kafka as your Database

Tale of two streaming frameworks (Karthik D - Walmart)

How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...

Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...

Apache Kafka® Delivers a Single Source of Truth for The New York Times

How to over-engineer things and have fun? | Oto Brglez, OPALAB

APAC Kafka Summit - Best Of

Leveraging services in stream processor apps at Ticketmaster (Derek Cline, Ti...

Building High-Throughput, Low-Latency Pipelines in Kafka

Similar to URP? Excuse You! The Three Metrics You Have to Know

Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)

Ontico

Kafka at scale facebook israel

Gwen (Chen) Shapira

Apache Kafka lies at the heart of the largest data pipelines, handling trillions of messages and petabytes of data every day. Learn the right approach for getting the most out of Kafka from the experts at LinkedIn and Confluent. Todd Palino and Gwen Shapira demonstrate how to monitor, optimize, and troubleshoot performance of your data pipelines—from producer to consumer, development to production—as they explore some of the common problems that Kafka developers and administrators encounter when they take Apache Kafka from a proof of concept to production usage. Too often, systems are overprovisioned and underutilized and still have trouble meeting reasonable performance agreements. Topics include: - What latencies and throughputs you should expect from Kafka - How to select hardware and size components - What you should be monitoring - Design patterns and antipatterns for client applications - How to go about diagnosing performance bottlenecks - Which configurations to examine and which ones to avoid

Putting Kafka Into Overdrive

Todd Palino

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1pGpnbd. Bhakti Mehta approaches best practices for building resilient, stable and predictable services: preventing cascading failures, timeouts pattern, retry pattern, circuit breakers and other techniques which have been pervasively used at Blue Jeans Network. Filmed at qconsf.com. Bhakti Mehta is the author of "RESTful Java Patterns and Best practices” and "Developing RESTful Services with JAX-RS 2.0, WebSockets, and JSON”. Bhakti is a Senior Software Engineer at Blue Jeans Network. As part of her current role, she works on developing RESTful services that can be consumed by ISV partners and the developer community.

Resilience Planning & How the Empire Strikes Back

C4Media

A commonly used version control system in the ColdFusion community is Subversion -- a centralized system that relies on being connected to a central server. The next generation version control systems are “decentralized”, in that version control tasks do not rely on a central server. Decentralized version control systems are more efficient and offer a more practical way of software development. In this session, Indy takes you through the considerations in moving from Subversion to Git, a decentralized version control system. You also get to understand the pros and cons of each and hear of the practical experience of migrating projects to decentralized version control. Version control is often used in conjunction with a testing framework and continuous integration. To complete the picture, Indy walks you through how to integrate Git with a testing framework, MXUnit, and a continuous integration server, Hudson.

Make It Cooler: Using Decentralized Version Control

indiver

Asynchronous programming using CompletableFutures in Java

Oresztész Margaritisz

Fault Tolerance in Distributed Environment

Orkhan Gasimov

Production Ready Microservices at Scale

Rajeev Bharshetty

View full webinar on demand at http://bit.ly/nginxbenchmarking Whether you’re doing performance testing or planning for infrastructure needs, benchmarking can be a big deal. Join us for this webinar where we cover NGINX benchmarking best practices, including: - the test environment - configuring NGINX - using benchmarking tools - and more! You’ll learn how to approach doing benchmarks so that you obtain results that are more accurate, better understood, and do a better job of addressing the needs of your project.

Benchmarking NGINX for Accuracy and Results

NGINX, Inc.

Best practices for highly available and large scale SolrCloud

Anshum Gupta

Cassandra is pretty awesome, sure I am biased, but it rocks. Always on, tuneable consistency and multi-master architecture? Let’s get our web scale on and build a highly available app that never goes down! Hold on a second. There is one key piece of the puzzle that has a massive impact on your applications availability: the client driver. In this talk we will go through the how to best configure your clients to make the most of failure handling and tuneable consistency in Cassandra.

Client Drivers and Cassandra, the Right Way

DataStax Academy

Kafka is a high-throughput, fault-tolerant, scalable platform for building high-volume near-real-time data pipelines. This presentation is about tuning Kafka pipelines for high-performance. Select configuration parameters and deployment topologies essential to achieve higher throughput and low latency across the pipeline are discussed. Lessons learned in troubleshooting and optimizing a truly global data pipeline that replicates 100GB data under 25 minutes is discussed.

Tuning kafka pipelines

Sumant Tambe

My talk at Confoo 2016 Montreal It is well said that "The more you sweat on the field, the less you bleed in war". Failures are an inevitable part of complex systems. Accepting that failures happen, will help you design the system's reactions to specific failures. This talks on best practices for building resilient, stable and predictable services: preventing Cascading failures, Timeouts pattern, Retry pattern,Circuit breakers and many more techniques in microservices

Expect the unexpected: Prepare for failures in microservices

Bhakti Mehta

It's possible to introduce real-time features to PHP applications without deep modifications of the current codebase. Using WAMP you can build distributed systems out of application components which are loosely coupled and communicate in (soft) real-time. There is no need to learn a whole new language, with the implications it has. It also opens the door to write reactive, event-based, distributed architectures and to achieve easier scalability by distributing messages to multiple systems.

Adding Real-time Features to PHP Applications

Ronny López

Design Review Best Practices - SREcon 2014

Mandi Walls

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1vfO62b. Lisa Van Gelder provides simple tips and tricks for improving delivery without investing lots of time up front creating complex deployment frameworks. Filmed at qconsf.com. Lisa Van Gelder is a Senior Consultant at Cyrus Innovation where she works with companies to build and deliver software solutions, improve their software development process, and speed up delivery.

Continuous Delivery for the Rest of Us

C4Media

CoAP Talk

Basuke Suzuki

Design Reviews for Operations - Velocity Europe 2014

Mandi Walls

Stream Processing @ Lyft

Jamie Grier

Project Sherpa: How RightScale Went All in on Docker

RightScale

Similar to URP? Excuse You! The Three Metrics You Have to Know (20)

Metrics are Not Enough: Monitoring Apache Kafka / Gwen Shapira (Confluent)

Kafka at scale facebook israel

Putting Kafka Into Overdrive

Resilience Planning & How the Empire Strikes Back

Make It Cooler: Using Decentralized Version Control

Asynchronous programming using CompletableFutures in Java

Fault Tolerance in Distributed Environment

Production Ready Microservices at Scale

Benchmarking NGINX for Accuracy and Results

Best practices for highly available and large scale SolrCloud

Client Drivers and Cassandra, the Right Way

Tuning kafka pipelines

Expect the unexpected: Prepare for failures in microservices

Adding Real-time Features to PHP Applications

Design Review Best Practices - SREcon 2014

Continuous Delivery for the Rest of Us

CoAP Talk

Design Reviews for Operations - Velocity Europe 2014

Stream Processing @ Lyft

Project Sherpa: How RightScale Went All in on Docker

More from confluent

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...

confluent

Santander Stream Processing with Apache Flink

confluent

Unlocking the Power of IoT: A comprehensive approach to real-time insights

confluent

El Stream processing es un requisito previo de la pila de data streaming, que impulsa aplicaciones y pipelines en tiempo real. Permite una mayor portabilidad de datos, una utilización optimizada de recursos y una mejor experiencia del cliente al procesar flujos de datos en tiempo real. En nuestro taller práctico híbrido, aprenderás cómo filtrar, unir y enriquecer fácilmente datos en tiempo real dentro de Confluent Cloud utilizando nuestro servicio Flink sin servidor.

Workshop híbrido: Stream Processing con Flink

confluent

Our talk will explore the transformative impact of integrating Confluent, HiveMQ, and SparkPlug in Industry 4.0, emphasizing the creation of a Unified Namespace. In addition to the creation of a Unified Namespace, our webinar will also delve into Stream Governance and Scaling, highlighting how these aspects are crucial for managing complex data flows and ensuring robust, scalable IIoT-Platforms. You will learn how to ensure data accuracy and reliability, expand your data processing capabilities, and optimize your data management processes. Don't miss out on this opportunity to learn from industry experts and take your business to the next level.

Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...

confluent

La arquitectura impulsada por eventos (EDA) será el corazón del ecosistema de MAPFRE. Para seguir siendo competitivas, las empresas de hoy dependen cada vez más del análisis de datos en tiempo real, lo que les permite obtener información y tiempos de respuesta más rápidos. Los negocios con datos en tiempo real consisten en tomar conciencia de la situación, detectar y responder a lo que está sucediendo en el mundo ahora.

AWS Immersion Day Mapfre - Confluent

confluent

Eventos y Microservicios - Santander TechTalk

confluent

Q&A with Confluent Experts: Navigating Networking in Confluent Cloud

confluent

Citi TechTalk Session 2: Kafka Deep Dive

confluent

Traditional data pipelines often face scalability issues and challenges related to cost, their monolithic design, and reliance on batch data processing. They also typically operate under the premise that all data needs to be stored in a single centralized data source before it's put to practical use. Confluent Cloud on Amazon Web Services (AWS) provides a fully managed cloud-native platform that helps you simplify the way you build real-time data flows using streaming data pipelines and Apache Kafka.

Build real-time streaming data pipelines to AWS with Confluent

confluent

Q&A with Confluent Professional Services: Confluent Service Mesh

confluent

Citi Tech Talk: Event Driven Kafka Microservices

confluent

An in depth look at how Confluent is being used in the financial services industry. Gain an understanding of how organisations are utilising data in motion to solve common problems and gain benefits from their real time data capabilities. It will look more deeply into some specific use cases and show how Confluent technology is used to manage costs and mitigate risks. This session is aimed at Solutions Architects, Sales Engineers and Pre Sales, and also the more technically minded business aligned people. Whilst this is not a deeply technical session, a level of knowledge around Kafka would be helpful.

Confluent & GSI Webinars series - Session 3

confluent

Transforming applications built with traditional messaging solutions such as TIBCO, MQ and Solace to be scalable, reliable and ready for the move to cloud How can applications built with traditional messaging technologies like TIBCO, Solace and IBM MQ be modernised and be made cloud ready? What are the advantages to Event Streaming approaches to pub/sub vs traditional message queues? What are the strengeths and weaknesses of both approaches, and what use cases and requirements are actually a better fit for messaging than Kafka?

Citi Tech Talk: Messaging Modernization

confluent

Citi Tech Talk: Data Governance for streaming and real time data

confluent

Confluent & GSI Webinars series: Session 2

confluent

Vous apprendrez également à : • Créer plus rapidement des produits et fonctionnalités à l’aide d’une suite complète de connecteurs et d’outils de gestion des flux, et à connecter vos environnements à des pipelines de données • Protéger vos données et charges de travail les plus critiques grâce à des garanties intégrées en matière de sécurité, de gouvernance et de résilience • Déployer Kafka à grande échelle en quelques minutes tout en réduisant les coûts et la charge opérationnelle associés

Data In Motion Paris 2023

confluent

Confluent Partner Tech Talk with Synthesis

confluent

The Future of Application Development - API Days - Melbourne 2023

confluent

The Playful Bond Between REST And Data Streams

confluent

More from confluent (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...

Santander Stream Processing with Apache Flink

Unlocking the Power of IoT: A comprehensive approach to real-time insights

Workshop híbrido: Stream Processing con Flink

Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...

AWS Immersion Day Mapfre - Confluent

Eventos y Microservicios - Santander TechTalk

Q&A with Confluent Experts: Navigating Networking in Confluent Cloud

Citi TechTalk Session 2: Kafka Deep Dive

Build real-time streaming data pipelines to AWS with Confluent

Q&A with Confluent Professional Services: Confluent Service Mesh

Citi Tech Talk: Event Driven Kafka Microservices

Confluent & GSI Webinars series - Session 3

Citi Tech Talk: Messaging Modernization

Citi Tech Talk: Data Governance for streaming and real time data

Confluent & GSI Webinars series: Session 2

Data In Motion Paris 2023

Confluent Partner Tech Talk with Synthesis

The Future of Application Development - API Days - Melbourne 2023

The Playful Bond Between REST And Data Streams

Recently uploaded

[BuildWithAI] Introduction to Gemini.pdf

Sandro Moreira

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

Retrieval augmented generation (RAG) is the most popular style of large language model application to emerge from 2023. The most basic style of RAG works by vectorizing your data and injecting it into a vector database like Milvus for retrieval to augment the text output generated by an LLM. This is just the beginning. One of the ways that we can extend RAG, and extend AI, is through multilingual use cases. Typical RAG is done in English using embedding models that are trained in English. In this talk, we’ll explore how RAG could work in languages other than English. We’ll explore French, Chinese, and Polish.

Introduction to Multilingual Retrieval Augmented Generation (RAG)

Zilliz

CNIC Information System with Pakdata Cf In Pakistan

danishmna97

Keynote 2: APIs in 2030: The Risk of Technological Sleepwalk Paolo Malinverno, Growth Advisor - The Business of Technology Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

apidays

Elevate Developer Efficiency & build GenAI Application with Amazon Q

Bhuvaneswari Subramani

In this keynote, Asanka Abeysinghe, CTO,WSO2 will explore the shift towards platformless technology ecosystems and their importance in driving digital adaptability and innovation. We will discuss strategies for leveraging decentralized architectures and integrating diverse technologies, with a focus on building resilient, flexible, and future-ready IT infrastructures. We will also highlight WSO2's roadmap, emphasizing our commitment to supporting this transformative journey with our evolving product suite.

Platformless Horizons for Digital Adaptability

WSO2

The value of a flexible API Management solution for Open Banking Steve Melan, Manager for IT Innovation and Architecture - State's and Saving's Bank of Luxembourg Apidays New York 2024: The API Economy in the AI Era (April 30 & May 1, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays New York 2024 - The value of a flexible API Management solution for O...

apidays

Discover the innovative features and strategic vision that keep WSO2 an industry leader. Explore the exciting 2024 roadmap of WSO2 API management, showcasing innovations, unified APIM/APK control plane, natural language API interaction, and cloud native agility. Discover how open source solutions, microservices architecture, and cloud native technologies unlock seamless API management in today's dynamic landscapes. Leave with a clear blueprint to revolutionize your API journey and achieve industry success!

WSO2's API Vision: Unifying Control, Empowering Developers

WSO2

MINDCTI Revenue Release Quarter One 2024

MIND CTI

Corporate and higher education. Two industries that, in the past, have had a clear divide with very little crossover. The difference in goals, learning styles and objectives paved the way for differing learning technologies platforms to evolve. Now, those stark lines are blurring as both sides are discovering they have content that’s relevant to the other. Join Tammy Rutherford as she walks through the pros and cons of corporate and higher ed collaborating. And the challenges of these different technology platforms working together for a brighter future.

Corporate and higher education May webinar.pptx

Rustici Software

ICT role in 21st century education and its challenges

rafiqahmad00786416

DBX First Quarter 2024 Investor Presentation

Dropbox

Understanding the FAA Part 107 License ..

Christopher Logan Kennedy

Strategies for Landing an Oracle DBA Job as a Fresher

Remote DBA Services

Exploring Multimodal Embeddings with Milvus

Zilliz

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Zilliz

Following the popularity of "Cloud Revolution: Exploring the New Wave of Serverless Spatial Data," we're thrilled to announce this much-anticipated encore webinar. In this sequel, we'll dive deeper into the Cloud-Native realm by uncovering practical applications and FME support for these new formats, including COGs, COPC, FlatGeoBuf, GeoParquet, STAC, and ZARR. Building on the foundation laid by industry leaders Michelle Roby of Radiant Earth and Chris Holmes of Planet in the first webinar, this second part offers an in-depth look at the real-world application and behind-the-scenes dynamics of these cutting-edge formats. We will spotlight specific use-cases and workflows, showcasing their efficiency and relevance in practical scenarios. Discover the vast possibilities each format holds, highlighted through detailed discussions and demonstrations. Our expert speakers will dissect the key aspects and provide critical takeaways for effective use, ensuring attendees leave with a thorough understanding of how to apply these formats in their own projects. Elevate your understanding of how FME supports these cutting-edge technologies, enhancing your ability to manage, share, and analyze spatial data. Whether you're building on knowledge from our initial session or are new to the serverless spatial data landscape, this webinar is your gateway to mastering cloud-native formats in your workflows.

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Safe Software

MS Copilot expands with MS Graph connectors

Nanddeep Nachan

Whatsapp Number Escorts Call girls 8617370543 Available 24x7 Mcleodganj Call Girls Service Offer Genuine VIP Model Escorts Call Girls in Your Budget. Mcleodganj Call Girls Service Provide Real Call Girls Number. Make Your Sexual Pleasure Memorable with Our Mcleodganj Call Girls at Affordable Price. Top VIP Escorts Call Girls, High Profile Independent Escorts Call Girls, Housewife Women Escorts Call Girl, College Girls Escorts Call Girls, Russian Escorts Call girls Service in Your Budget.

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Deepika Singh

Recently uploaded (20)

[BuildWithAI] Introduction to Gemini.pdf

How to Troubleshoot Apps for the Modern Connected Worker

Introduction to Multilingual Retrieval Augmented Generation (RAG)

CNIC Information System with Pakdata Cf In Pakistan

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

Elevate Developer Efficiency & build GenAI Application with Amazon Q

Platformless Horizons for Digital Adaptability

Apidays New York 2024 - The value of a flexible API Management solution for O...

WSO2's API Vision: Unifying Control, Empowering Developers

MINDCTI Revenue Release Quarter One 2024

Corporate and higher education May webinar.pptx

ICT role in 21st century education and its challenges

DBX First Quarter 2024 Investor Presentation

Understanding the FAA Part 107 License ..

Strategies for Landing an Oracle DBA Job as a Fresher

Exploring Multimodal Embeddings with Milvus

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

MS Copilot expands with MS Graph connectors

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model

URP? Excuse You! The Three Metrics You Have to Know

1. URP? Excuse You! Todd Palino Senior Staff Engineer, Site Reliability LinkedIn

2. • What is Kafka • Encyclopedia of Monitoring • Automation What This Talk Is Not

3. Why Talk About Monitoring?

4. Messages per Day at LinkedIn

5. What is Monitoring (not)?

6. Monitoring is not Alerting • Collect everything • Alert on nothing • Events are better than metrics • Tests are better than alerts • Sleep is best in life

7. • What’s an SLA? • Availability • Latency • Customer Guarantees Service Level Objectives

8. Key Kafka Metrics

9. The Three Metrics You Need to Know Partitions that are not fully replicated within the cluster URP The overall utilization of an Apache Kafka broker Request Handlers How long requests are taking, in which stage of processing Request Timing

10. Under-Replicated Partitions • Highly discussed • Overall cluster health • Replication is a consumer and producer

11. Under-Replicated Partitions EXAMPLE: FAILED BROKER

12. Under-Replicated Partitions EXAMPLE: CONSUMER PROBLEMS

13. Under-Replicated Partitions EXAMPLE: PRODUCER PROBLEMS

14. Under-Replicated Partitions • Overrated • Doesn’t map to SLO • Often not actionable • Collect, but don’t alert

15. Everybody In The Pool • Specialized thread pools • Clients deal with network and request pools • Request handlers do most of the work

16. Request Handlers • Decode and validate • Perform task • Wait for other brokers • Assemble response

17. Request Handler Problems • Anything that causes Kafka to expend CPU cycles • Includes problems related to failing disks (IO wait) • SSL and compression work both can use a lot of CPU CPU Time Timeout Deadlock • Most often due to failing to process controller requests • Intra-cluster requests tend to be bound by partition counts • Rapidly starves the pool of threads • Should always be a code bug • Usually looks exactly like a timeout problem • Rare, but hard to identify

18. Request Handler Problems EXAMPLE: TIMEOUT OR DEADLOCK

19. Request Handler Problems • Anything that causes Kafka to expend CPU cycles • Includes problems related to failing disks (IO wait) • SSL and compression work both can use a lot of CPU CPU Time Timeout Deadlock • Most often due to failing to process controller requests • Intra-cluster requests tend to be bound by partition counts • Rapidly starves the pool of threads • Should always be a code bug • Usually looks exactly like a timeout problem • Rare, but hard to identify

20. Brokers Don’t Do Compression

21. Brokers Don’t Shouldn’t Do Compression • Kafka brokers are running a new version • Message format has been set to the new version • Clients haven’t upgraded Up Conversion Down Conversion • Kafka brokers are running a new version • Message format is set to an older version due to clients • Producer clients update to new version

22. Request Timing • Remote – Waiting for other brokers • Response Queue – Waiting to send • Response Send - Send to client • Total – Request handling, end to end • Request Queue – Waiting to process • Local – Work local to the broker

23. Request Timing EXAMPLE: PRODUCE TOTAL TIME

24. Request Timing EXAMPLE: PRODUCE LOCAL TIME

25. Request Timing EXAMPLE: PRODUCE REMOTE TIME

26. Thank you?

27. What’s Missing?

28. Availability Monitoring • SLO, part 2 • Measured externally • Client focused • github.com/linkedin/kafka-monitor

29. Operating System And Hardware Metrics • What do they mean? • What application is causing it? • Don’t alert unless: • 100% clear signal • 100% clear response

30. Capacity Planning • Plan in advance • Multi-factor • Don’t alert for capacity

31.

32. Capacity Metrics • Request Handler Idle Ratio • Disk Utilization • Partition Count • Network Utilization

33. Wrapping Up

34. If You Remember Nothing Else… • Define your service level objectives • Monitor your service level objectives • Metrics that cover many problems are noisy • Buy Kafka: The Definitive Guide

35. Getting (and Giving) Help • Kafka Monitor • https://github.com/linkedin/kafka-monitor • Burrow • https://github.com/linkedin/Burrow • Cruise Control • https://github.com/linkedin/cruise-control • kafka-tools • https://github.com/linkedin/kafka-tools LinkedIn Open Source Get Involved • Community • users@kafka.apache.org • dev@kafka.apache.org • Bugs and Work: • https://issues.apache.org/jira/projects/KAFKA

36. Thank you

URP? Excuse You! The Three Metrics You Have to Know

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to URP? Excuse You! The Three Metrics You Have to Know

Similar to URP? Excuse You! The Three Metrics You Have to Know (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

URP? Excuse You! The Three Metrics You Have to Know