SlideShare ist ein Scribd-Unternehmen logo
1 von 74
A consumer proxy that solves head-of-line blocking for
Kafka consumers
1
Hey everyone, and welcome to our presentation. I’m Aashish and together with Aryan, Jordan, and Michael - our team built Triage, a consumer proxy that solves head-of-
line blocking for Kafka consumers.
Overview
1. Microservices & Event-Driven Architecture (EDA)
2. Message Queues & Apache Kafka
3. Head-of-Line Blocking
4. Existing Solutions
5. Introducing Triage
6. Triage Design Challenges
7. Future Work
8. Q&A
2
Here’s a quick overview of what you can expect. First, we’ll address the larger context of microservices and event-driven architecture. From there, we’ll take a look at
message queues and focus on Apache Kafka, with a few details on how it works. Next, we’ll examine the problem of head-of-line blocking and its consequences, after
which we’ll share our research on some existing solutions. At that point, we’ll present Triage and our approach to solving head-of-line blocking, along with some
interesting design challenges we faced. We’ll end with some ideas for future work, and leave some room for a Q&A. We’re excited to show you what we built so let’s get
started!
“63% of enterprises have adopted
microservice architectures, and it’s
only expected to grow in the
coming decade.”
3
Microservice architecture has really gained in popularity over the last decade and in 2020, it was estimated that over 63% of enterprises had adopted microservices and
were satis
fi
ed with the tradeo
ff
s.
4
Shopping
App
API Logic
DB
Orders Microservice
API Logic
DB
Products Microservice
API Logic
DB
Stock Microservice
Here’s an example of a microservice architecture for a shopping app.
The takeaway here is to notice how the services are isolated into separate pieces. The orders, products, and stock inventory services all have their own logic and data
stores, and the shopping app can communicate with all of them.
What do microservices offer?
1. Development work can occur in parallel
2. Scalability becomes easier
3. Polyglot environment
5
Since services can be decoupled in this way, work can be done in parallel which leads to faster development times. Additionally, there’s a bene
fi
t in the ability to take
individual components and scale them independently.
Often, multiple technologies and programming languages are used in these setups, which is known as a polyglot microservice environment. Given the use of these
di
ff
erent languages, an important question is:
“How do we successfully
achieve intra-system
communication?”
6
How do we successfully achieve the required intra-system communication, for the system to function properly? One option is to use a request-response model, which is
commonly used on the web.
Request
Response
7
Request
Response
Request
Response
Imagine a # of interconnected microservices where services can send a request, and wait for responses. The issue is that if a single service in this chain experiences a
slowdown, the request lifecycle of any connected service will also be delayed.
To overcome this problem, a common choice is to implement an event-driven architecture, or an EDA.
EDAs are centered around events -
which are changes in state - or
noti
fi
cations about a change.
8
EDAs are centered around events, which can be thought of as changes in state, or noti
fi
cations about a change.
In an EDA, services can operate
independently without concern for
the state of any other service.
9
The key here is that services can operate independently without concern for the state of any other service.
10
Event-Driven Architecture
The service on the left can communicate with all 3 services on the right, independently. This architecture bypasses the problem where a delayed service causes a
slowdown throughout the entire system.
In order to achieve this decoupling, EDAs can be implemented using message queues.
Message Queue Functionality
Queue
Producer
11
Consumer
Here we have two producers to the left of the message queue. These applications write events to the queue. The consumer, which is to the right, reads these events o
ff
of the queue.
Traditional message queues:
events are read and then
removed.
Log-based message queues:
events are persisted on a log.
12
In traditional message queues, events are read and then removed from the queue. An alternative approach is to use log-based message queues. Here, all the events are
persisted on a log so you don’t lose them once they’re read.
13
Powered by
Among log-based message queues, Kafka is the most popular - over 80% of Fortune 100 companies across industries use it as part of their architecture.
What does Kafka offer?
•Scalability
•Parallelism
•Decoupling
14
Kafka is designed for scalability and parallelism, and it maintains the intended decoupling of an EDA. It’s worth taking a look at what’s unique about Kafka and how it
works.
In Kafka, events are
called messages.
15
In the context of Kafka, events are called messages and this is how we’ll refer to them.
Topic
Kafka
Partition 2
Partition 1
16
In this image, messages are grouped using a named identi
fi
er - called a topic.
Kafka achieves scalability by writing all the messages of a topic to partitions.
So in this example, messages in a single topic are written to two di
ff
erent partitions.
17
Topic 1
Topic 2
Partition 2
Consumer Group A
Consumer 1
Consumer 2
Consumer 3
Consumer 4
Producer 1
Producer 2
Partition 1
Kafka
Partition 2
Partition 1
Consumer Group B
If we add the other pieces of the architecture, it’ll look something like this.
Producers, seen on the left, write messages to a topic.
Consumers, on the right, are organized into groups with a group ID. If a consumer wants to read messages, it can subscribe to a speci
fi
c topic; then, individual consumer
instances can read messages from a partition.
Want more scalability?
Add more partitions.
18
Need more parallelism?
Use consumer instances.
To achieve more scalability, you could simply increase the # of partitions per topic.
Additionally, the use of multiple consumer instances means that the messages can be processed in parallel.
While a consumer instance can
consume from more than one
partition, a partition can only be
consumed by a single consumer
instance.
19
It is important to note that while a consumer instance can consume from more than one partition, a partition can only be consumed by a single consumer instance. In
other words, 2 di
ff
erent consumer instances can’t consume from the same partition.
Kafka commits
20
• O
ff
set: A number that indicates the position of the message in
the queue.
• A consumer periodically commits o
ff
sets back to Kafka to
acknowledge the last message it successfully processed.
• In case of a crash, Kafka will remember where to resume
message delivery from.
Kafka uses commits to know which messages have been successfully processed.
The way this works is that every message on a Kafka partition has an o
ff
set - this is a number that indicates the position of the message in the queue. Think of it like an
index in an array.
A consumer periodically commits o
ff
sets back to Kafka, indicating the last message it successfully processed. If a consumer instance crashes, Kafka will remember
where to resume message delivery from.
21
O
ff
set 48 49 50 51
Producer
Consumer
Last Committed O
ff
set
Kafka
Here, once the consumer commits o
ff
set #50, Kafka knows that the messages from 48-50 have all been successfully processed. The consumer can continue consuming
before it commits the next o
ff
set.
1. Producers write messages to a speci
fi
c topic.
2. Kafka routes these messages to partitions.
3. Consumers subscribe to a speci
fi
c topic to receive messages
f and commit o
ff
sets.
4. Each partition in a topic can only be consumed by one
d consumer instance.
22
Recap
To recap, producers write messages to a speci
fi
c topic. Kafka then routes these messages to partitions.
Consumers subscribe to a topic to receive messages and commit o
ff
sets.
Each partition in a topic can only be consumed by one consumer instance.
Overview
23
1. Microservices & Event-Driven Architecture (EDA)
2. Message Queues & Apache Kafka
3. Head-of-Line Blocking
4. Existing Solutions
5. Introducing Triage
6. Triage Design Challenges
7. Future Work
8. Q&A
Now that we've shown the larger context, Jordan from our team will explain the problem of head-of-line blocking in message queues.
Head-of-Line Blocking
24
A real-world example of head of line blocking that we are all likely familiar with is when you're at the supermarket and the person in the front of the line is taking a long
time to
fi
nish paying. Perhaps they're trying to use expired coupons or have multiple fruits each with their own ID or they're trying to pay with bitcoin. It slows down the
entire line and everyone behind them has to wait.
Head-of-Line Blocking - Message Queues
25
Processing
in
Progress
Message queues can also su
ff
er from head of line blocking. In this example, there are four messages. The
fi
rst green message is processed quickly.
Animation
The orange one though takes longer to process, and crucially, while it’s being processed, all of the other messages have to wait.
Animation
Once the slow message is processed, the rest of the queue can proceed.
Animation
26
Poison Pills
Non-Uniform Consumer Latency
There are two major causes of head of line blocking when it comes to message queues. The
fi
rst is poison pills.
Head-of-Line Blocking - Poison Pills
27
In this example, the circles are regular messages and the skull and crossbones represents a poison pill.
A poison pill message is one that the consumer does not know how to handle. For example, if the application developer is expecting an order quantity as an integer but
receives one as a string, and has not written error handling to handle this scenario, the application may crash. This will prevent processing of all of the messages behind
the poison pill message in the queue.
The
fi
rst message is consumed quickly.
Animate
but the poison pill message crashes the consumer application.
Animate
No further messages can be processed.
Head-of-Line Blocking - Non-Uniform Consumer Latency
Orange Service
Green Service
28
Processing
in
Progress
The second main cause of head of line blocking is non-uniform, consumer latency.
Suppose we have a consumer application that calls one of two external services depending on the content of a message… for green messages the application calls the
green service and for orange messages the application calls the orange service.
The
fi
rst message is processed normally since the green service is healthy.
Trigger Animation
Now imagine that the orange service is slower than usual to respond, perhaps due to network issues.
Trigger Animation
This means that the processing of all the messages in the queue is slowed, even though the green messages have nothing to do with the orange service. The messages
are not able to be processed until the orange service completes. Once the orange service
fi
nishes, the block is lifted, and the rest of the messages can be processed
Trigger Animation
Solution Requirements
Polyglot
Data Loss Prevention
29
Open Source
Handle
Poison Pills
Handle Non-
Uniform Consumer
Latency
In determining our desired approach to solving head of line blocking, we decided on
fi
ve solution requirements.
The
fi
rst two were handling the two main causes of head-of-line-blocking.
The third requirement was that data loss was prevented. A naive way of handling head-of-line-blocking would be to just drop messages that are causing it. This might be
appropriate for non-critical scenarios, such as tracking likes on social media where it's not critical that every like is captured. However, for critical situations such as those
involving orders, it is crucial that every order is captured, otherwise potential revenue may be lost. We wanted a solution that could prevent data loss.
The fourth requirement was that the potential solution could be easily integrated into polyglot microservice environments.
Lastly, the
fi
fth requirement was that the potential solution would be open source and easily available to developers.
Overview
30
1. Microservices & Event-Driven Architecture (EDA)
2. Message Queues & Apache Kafka
3. Head-of-Line Blocking
4. Existing Solutions
5. Introducing Triage
6. Triage Design Challenges
7. Future Work
8. Q&A
With these solution requirements in mind, we’ll now look at the existing solutions that we found that addressed head-of-line-blocking…
Existing Solutions
31
1. Con
fl
uent Parallel Consumer
2. DoorDash's Worker Model
3. Uber’s Consumer Proxy
- The three solutions we found were Con
fl
uent Parallel Consumer, DoorDash’s Worker Model, and Uber’s Consumer Proxy.
Existing Solutions Comparison
32
Polyglot
Data Loss
Prevention
Open
Source
DoorDash
Kafka
Workers
Uber
Consumer
Proxy
Con
fl
uent
Parallel
Consumer
Handles
Poison Pill
Handle Non-
Uniform Consumer
Latency
- Con
fl
uent Parallel Consumer
fi
xes head-of-line blocking caused by both poison pills as well as non-uniform consumer latency. But it doesn’t have a way to store
poison pill messages, and since we cannot tolerate data loss, this solution was not viable for our use case. Also, their library is written in Java, meaning developers
would have to write their applications in Java as well; this was counter to our goal of
fi
nding a solution that worked well in a polyglot environment.
- While using Kafka, DoorDash experienced spikes in latency in their consumer applications. Individual slow messages were causing delayed processing for all
messages in a given partition - a real world example of non-uniform consumer latency. To address this, they introduced something they called "Kafka Workers". This
solution, however, failed to address poison pills, and with no mechanism to prevent data loss, this solution was insu
ffi
cient.
- Lastly, Uber’s Consumer Proxy solves head-of-line blocking resulting from both poison pills and from non-uniform consumer latency - Poison pills are handled without
data loss, and non-uniform consumer latency is addressed by parallel consumption of messages. Uber built Consumer Proxy as its own piece of infrastructure in order
to work well in polyglot environments. However, as an in-house solution, it is not available for us or other developers to use.
Overview
1. Microservices & Event-Driven Architecture (EDA)
2. Message Queues & Apache Kafka
3. Head-of-Line Blocking
4. Existing Solutions
5. Introducing Triage
6. Triage Design Challenges
7. Future Work
8. Q&A
33
- Given that none of the existing solutions
fi
t all of our requirements, we decided to build Triage. Next, Aryan will discuss what Triage is, and how it handles both causes
of head of line blocking.
What is Triage?
34
Kafka
Cluster
Consumer
Application
Thanks, Jordan… Triage acts as a proxy for consumer applications. It ingests messages from the Kafka Cluster and sends
them to downstream consumer applications.
Triage Instance
Triage at a high level
35
Partition Application Logic
DynamoDB
Instance
Consumer
Application
Partition
Partition
Kafka Topic
- Here’s a high-level view of a Triage instance in the cloud.
- Triage consumes from a single partition, just like any other Kafka consumer.
- Triage's functionality consists of the application logic, running in an AWS container, and a DynamoDB instance.
- Problematic messages are stored in Dynamo for examination at a later time
- This pattern is known as the "dead-letter pattern"
Messages
36
Dead
Letter
Store
Dead-Letter Pattern
- In dead-letter patterns, problematic messages (referred to as dead letters) are removed from the consumer application and
persisted to an external data store for later processing.
Partition
Commit Tracker - Overview
37
Triage Application Logic
msg ack/nack
ack
nack
ack
ack
Consumer
Instance
Consumer
Instance
Consumer
Instance
To manage commits back to Kafka, Triage uses an internal system of acknowledgements with a component we call Commit
Tracker.
Consumers can send an “ack”, a positive acknowledgement, back to Triage, indicating that a message was successfully
processed or a “nack”, a negative acknowledgement, to indicate a poison pill message.
Commit Tracker - Ack/Nack
38
0 1 2 3
Offset: 4 5 6 7 8 9
offset msg acked?
0
1
2
3
4
5
6
7
8
9
false
false
false
false
false
false
false
false
false
false
Ack
true
true
true
true
true
Stored
Ack
Nack
true
true
true
true
Commit Tracker
Using the Commit Tracker, Triage can calculate which o
ff
sets to commit back to Kafka. This ensures that the health of the
partition is maintained.
Let's take a look at how Commit Tracker works, since it's central to the functionality of Triage.
Triage
fi
rst ingests a large batch of messages and stores them in a hashmap.
TRIGGER ANIMATION
The keys of the hashmap are the message o
ff
sets and the values are a custom struct with two
fi
elds: the message itself and
a boolean, indicating whether it has been acknowledged.
As Triage receives "acks" from consumers, we update the commit hash accordingly.
TRIGGER ANIMATION
When a message is "nacked", however, we cannot update the commit hash immediately.
TRIGGER ANIMATION
We must
fi
rst ensure the message has been successfully written to our dead-letter store, which is a DynamoDB table, and
only then do we update the commit hash.
TRIGGER ANIMATION
Next, the rest of the messages are processed by the consumers, including one, the orange message, that takes a long time
to be processed by the consumer. As a result, the faster green messages are processed and acked before the orange one
is.
TRIGGER ANIMATION
39
0 1 2 3
Offset Committed: Commit
5
Offset: 4 5 6 7 8 9
offset msg acked?
0
1
2
3
4
5
6
7
8
9
false
true
true
true
true
true
39
Commit Tracker - Commit Calculator
true
true
true
true
Commit Tracker
It's important to note, that since we always wait for con
fi
rmation from Dynamo before updating the commit hash, at this
point, whether a message has been "acked" or "nacked" isn't important - we only want to know that a message has been
acknowledged in some way.
So, how do we calculate which o
ff
set to commit back to Kafka? We want to commit as many o
ff
sets as we can, so we need
to
fi
nd the greatest committable o
ff
set.
Periodically, a component called "Commit Calculator" runs in the background. It checks the commit hash to see the greatest
o
ff
set with a value of true, for which all lower o
ff
sets also have a value of true.
TRIGGER ANIMATION
Triage can then commit this o
ff
set back to Kafka.
TRIGGER ANIMATION
Once we receive con
fi
rmation from Kafka that the commit was successful, we can then delete all entries up to and including
that o
ff
set from Commit Tracker, since they're no longer needed.
TRIGGER ANIMATION
How Triage Solves Head-of-Line Blocking
40
With this understanding of Commit Tracker and the core functionality of Triage, let's take a look at how we solve Head of
Line Blocking due to both Poison Pills and Non-Uniform Consumer Latency
msg ack/nack
nack
ack
DynamoDB
Dead Letter Store
ack
ack
How Triage Solves Poison Pills
41
Let's start with Poison Pills - here we can see a consumer application receiving a poison pill message.
- Trigger Animation
Consumer applications can tell Triage that the message they've received is a poison pill by sending a "nack".
- Trigger Animation
Triage sends that message to a DynamoDB table, so that it can be handled at a later time. This frees up the consumer to
continue processing messages.
- Trigger Animation
How Triage Solves Non-Uniform Consumer Latency
42
To address non-uniform consumer latency, Triage enables the parallel consumption of messages from a single partition.
Here, we have two instances of a single consumer application that rely on one of two external services based on the
contents of a message. For orange messages, the application calls the orange external service; for greens, the green
service.
- Trigger Animation
Here, you can see that because the orange service is slow, the consumer instance at the top is taking an unusually long time
to process a message.
- Trigger Animation (TALK OVER)
Because of the one-to-many pattern enabled by Triage, healthy consumer instances are able to continue consumption, so
the queue keeps moving.
Overview
43
1. Microservices & Event-Driven Architecture (EDA)
2. Message Queues & Apache Kafka
3. Head-of-Line Blocking
4. Existing Solutions
5. Introducing Triage
6. Triage Design Challenges
7. Future Work
8. Q&A
Now that you know how Triage solves head-of-line blocking, Mike will cover some of the challenges that we faced when
building Triage as well as our plans for some improvements we'd like to build out.
Triage Design Challenges
• Achieving Parallel Processing via Concurrency
• Polyglot Support
• Ease of Deployment
44
Based on our requirements and our intended design for Triage, there were three notable challenges that we'd like to discuss.
- Achieving Parallel Processing via Concurrency,
- Polyglot Support,
- and Ease of Deployment.
For each of these challenges, I'll talk a little about them and discuss our respective solutions.
Let's start with parallel consumption via concurrency
Parallel Consumption
Kafka Partition
45
We need a one to many relationship between Triage and instances of a consumer application to solve HoLB caused by NUCL.
Challenge: Achieving Concurrency
46
Our solution was to write the application logic of Triage in Go. Go is designed with concurrency in mind via what are called Goroutines.
We can think of Goroutines as non-blocking function loops. Several, think thousands, Goroutines can run in the background with very little resource overhead.
Challenge: Parallel Consumption
Solution: Go & Goroutines
Goroutine C
Goroutine B
Goroutine A
47
Triage
Within Triage, we run a dedicated Goroutine for each downstream consumer instance. These Goroutines pull messages and send them to consumer instances, allowing
us to consume from a single partition in parallel.
Concurrency in Triage
48
Concurrency via Go also allowed us to implement Triage as a single application. Each major component of Triage exists as a Goroutine, that themselves utilize other
Goroutines. We achieved communication across these Goroutines using channels.
Challenge: Achieving Concurrency
49
Goroutine 1 Goroutine 2
Channel
Channels are strongly-typed queue like structures. Goroutines can places messages on the channel for other goroutines to receive.
When messages are received, it's important to know that they are removed from the channel.
Challenge: Achieving Concurrency
50
Goroutine 2
Goroutine 1
Goroutine 3
Goroutine 5
Goroutine 4
Goroutine 6
Channel
Because messages are removed, we can have multiple senders and receivers without worrying about unintended data duplication.
Animate
Concurrency in Triage
51
Fetcher
Commit Tracker
Consumer Manager
messagesChannel
newConsumersChannel
Dispatch
senderRoutine A
senderRoutine B
Connection Request
Let's take a look at some of the major components of Triage and how we take advantage of concurrency.
At a high level, we need a process to continually ingest messages from Kafka - this Goroutine is called Fetcher, in blue. It then needs to pipe these messages via the
"messages channel" to a Goroutine called Dispatch and write them to our Commit Tracker in green.
While all this is happening, we need another process to listen for incoming connection requests from consumer instances - we call this Goroutine "Consumer Manager".
When it receives a request, after authenticating it, Consumer Manager places the network address of the consumer instance onto a "newConsumers" channel.
When Dispatch receives a network address via this channel, it creates yet another Goroutine called "senderRoutine" that pulls messages from the messagesChannel.
These senderRoutines, as their names imply, send messages to their respective consumer instances.
52
Connection Request
Dynamo
DB
Fetcher
Commit
Tracker
Consumer
Manager
messagesChannel
newConsumersChannel
Dispatch
senderRoutine A
senderRoutine B
Commit
Calculator
commitsChannel
messages
commits
acknowledgementsChannel
Filter
Reaper
deadLettersChannel
consumerRoutine
Triage Application Logic
committerRoutine
Zooming out a little bit really hammers home the bene
fi
ts we gain from concurrency. All of the components inside Triage, that you can see on the screen, are goRoutines,
many of which rely on other goRoutines.
While implementing all of this functionality without Go is certainly possible, Go made it very intuitive for us, cementing it as the correct language for the job.
* All components should be in an "application logic box"
Triage Design Challenges
• Achieving Parallel Consumption via Concurrency
• Polyglot Support
• Ease of Deployment
53
The next challenge we faced was polyglot support.
Challenge: Polyglot Support
Java Consumer
Go Consumer
NodeJS Consumer
54
Kafka
Cluster
Triage Instances
As you can see on the right side of the diagram, we needed Triage to be able to support consumer applications written in a host of di
ff
erent languages.
Challenge: Polyglot Support
• Solution:
• Implementation: Service + Thin Client Library
• Network Protocol: gRPC
55
Our solution was to implement Triage as a service coupled with a thin client library, in addition to our choice of gRPC as our primary network communication protocol.
Service vs Client Library
56
Before choosing our implementation model, we considered both a pure client library and pure service approach.
Potential Client Library Implementation
Consumer Application
57
Kafka
Cluster
A potential pure client library implementation would have all the application logic of Triage exist as imported code within the consumer application.
This comes with the bene
fi
t of not having to introduce new pieces of infrastructure to a user's system and makes testing Triage simpler. But, supporting additional
languages would require a complete rewrite of Triage.
Maintaining Triage would be pretty di
ffi
cult, since any change to a system's Kafka version would require updating all versions of Triage.
We considered these to be poor tradeo
ff
s.
An alternative would be to implement Triage as a service.
Service Implementation
58
Kafka
Cluster
Triage Service Consumer Application
With the pure service approach, Triage would act as a piece of infrastructure that sits between the Kafka cluster and consumer applications. This allows us to avoid the
aforementioned cons of a client library implementation, but we still wanted to make connecting to Triage simple for developers.
Challenge: Polyglot Support
Solution: Service + Thin Client Library
59
Triage Service
Kafka
Cluster
Consumer Application
Triage
Client
We decided on a hybrid approach.
The core application logic of Triage exists on a container running in AWS.
Consumer applications use a thin client library to manage communicating with Triage.
This lightweight client exists within each instance of a consumer application.
It provides convenience methods for sending an initial connection request and exposing an endpoint to receive messages from Triage.
Multi-language Support
60
Kafka
Cluster
Triage Service
Triage Client
Triage Client
Triage Client
While we don't gain the full language agnosticism that a pure service approach might o
ff
er, building out multi-language support only requires us to rewrite our simple
client library in another language.
1. Sends an initial HTTP request to Triage to request a
connection.
2. Runs a gRPC server to receive messages from Triage.
61
The client library:
Ultimately, the client library only
1) sends an initial HTTP request to Triage to request a connection
and
2) runs a gRPC server to receive messages.
Because it's operationally very simple, rewriting the client library is far more manageable than rewriting Triage in its entirety.
Machine A
Machine B
doWork()
code execution
gRPC - an RPC Framework
Machine A
doWork()
code execution
62
Local Procedure Call Remote Procedure Call
To manage the communication between Triage and consumer instances, we chose gRPC as a network protocol, primarily for the ease with which we could build out
multi language support. I think it's helpful to talk a little bit about what gRPC is.
gRPC is an RPC framework, created by Google, where RPC stands for remote procedure call. We can think of procedure calls as simple function calls or invocations.
With a local procedure call, everything exists on a single host machine. In the
fi
gure on the left, the function "doWork" is executed on Machine A resulting in code being
executed on Machine A.
Remote procedure calls, however, allow us to execute code on a di
ff
erent machine. In the
fi
gure on the right, "doWork" is being called on Machine A, resulting in code
being executed on Machine B.
gRPC & Triage
63
gRPC Client
processMessage(message)
Triage
Consumer Instance
gRPC Server
code execution
It's helpful to understand that gRPC uses the same client-server model that we're familiar with.
With Triage, the Triage container acts as a gRPC client, and calls "processMessage()", with the message as an argument.
The consumer instance runs a gRPC server that listens for this procedure call. It then executes code to process the message before sending a response, the "ack" or
"nack" we've talked about before, back to Triage.
64
Code Generation
gRPC Server gRPC Server
gRPC Server
• function name
• Parameters
• Return value
gRPC Service De
fi
nition
The biggest reason we decided on gRPC is its code generation feature.
Using what's called a gRPC service de
fi
nition, client and server implementations can be automatically generated in all major programming languages.
Creating a gRPC service de
fi
nition is pretty straightforward. You simply de
fi
ne a function interface - that is, what is the name of the function, what parameters does it
have, and what does it return.
Because the most complicated part of building the Triage client library is handled for us via this code generation, we can write support for other languages with relative
ease.
Triage Design Challenges
• Achieving Parallel Consumption via Concurrency
• Polyglot Support
• Ease of Deployment
65
The
fi
nal challenge we faced was making Triage easy to deploy for application developers.
Challenge: Ease of Deployment
• Solution:
• AWS CDK
• Triage CLI
66
Our solution was to create an automated deployment script using AWS's Cloud Development Kit. Developers can use our command line tool, Triage CLI, to easily deploy
Triage to AWS using this CDK script.
Challenge: Ease of Deployment
67
Kafka Topic ECS
Partition
Partition
Partition
Solution: AWS CDK
Because Triage operates on a per-partition basis, we needed to deploy a container running Triage for each partition in a given Kafka topic. To do this, we used Elastic
Container Service, speci
fi
cally with Fargate as our deployment vehicle.
With ECS, we can de
fi
ne a minimum number of Triage containers running at any given time - were one to crash, for some reason, another would be provisioned to
replace it automatically.
Using Fargate means management of individual compute resources is abstracted away for our users and allows them to only think about containers.
The key for us, was that by using CDK, we could write a reusable script to deploy Triage containers via Fargate. That being said, we still needed to answer the question of
how to interpolate user speci
fi
c information, such as Kafka authentication credentials, into Triage during deployment.
Challenge: Ease of Deployment
Solution: Triage CLI
• triage init
• triage deploy
• Triage network address
• Authentication Key
68
To do so, we created a command line tool called Triage CLI. It can be downloaded as an NPM package and features a 2-step deployment process.
triage init installs any necessary dependencies for deployment and generates a con
fi
guration
fi
le where developers can supply authentication and Kafka speci
fi
c
information.
triage deploy interpolates the data in this con
fi
guration
fi
le into the CDK script. It also creates an internal con
fi
g
fi
le used by individual Triage containers. It then deploys
these containers to AWS.
Finally, it returns the network address and authentication key needed for consuming applications to connect to Triage. Using Triage CLI, a developer can leverage our
CDK Script to deploy Triage to the cloud.
Triage Design Challenges
• Achieving Parallel Consumption via Concurrency
• Polyglot Support
• Ease of Deployment
69
Having solved these major challenges, we were able to build Triage without compromising on any of our design requirements. For a more in-depth exploration of how
Triage works and implementation details, check out our write up, linked in the Zoom meeting description.
Overview
70
1. Microservices & Event-Driven Architecture (EDA)
2. Message Queues & Apache Kafka
3. Head-of-Line Blocking
4. Existing Solutions
5. Introducing Triage
6. Triage Design Challenges
7. Future Work
8. Q&A
Before we open up for questions, we'd like to cover some features we'd like to add.
Future Work
1. Extend client library language support
2. Cause of Failure for Dead Letter Table
3. Dead Letter Noti
fi
cations
71
We'd
fi
rst like to build out additional language support for our thin client library. As we've discussed, doing so shouldn't be di
ffi
cult, since the majority of the work is done
for us via gRPC code generation. Supporting other popular languages like JavaScript or Ruby would help us serve more developers.
We'd also like to add a cause of failure column to our table that stores dead-letter messages - it would contain failure reasons that developers could supply when sending
a "nack" back to Triage for poison pills. This would aid in analysis and remedying faulty messages.
Finally, we'd like to add a simple noti
fi
cation system that could alert developers when poison pills are stored in the dead-letter table, allowing for rapid response. We think
this is perhaps the easiest to implement and is likely our next step.
72
Questions?
github.com/Team-Triage
Aashish Balaji Jordan Swartz
Michael Jung
Aryan Binazir
Toronto, Canada San Diego, CA Los Angeles, CA
Chapel Hill, NC
With that, I'd like to thank you all for joining us this afternoon and we'll open the
fl
oor for questions!

Weitere ähnliche Inhalte

Ähnlich wie Triage Presentation

MuleSoft Meetup Singapore #8 March 2021
MuleSoft Meetup Singapore #8 March 2021MuleSoft Meetup Singapore #8 March 2021
MuleSoft Meetup Singapore #8 March 2021Julian Douch
 
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQShameera Rathnayaka
 
Kafka Fundamentals
Kafka FundamentalsKafka Fundamentals
Kafka FundamentalsKetan Keshri
 
Enterprise messaging with jms
Enterprise messaging with jmsEnterprise messaging with jms
Enterprise messaging with jmsSridhar Reddy
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperAnandMHadoop
 
A Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka SkillsA Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka SkillsRavindra kumar
 
Apache kafka- Onkar Kadam
Apache kafka- Onkar KadamApache kafka- Onkar Kadam
Apache kafka- Onkar KadamOnkar Kadam
 
Microservices in a Streaming World
Microservices in a Streaming WorldMicroservices in a Streaming World
Microservices in a Streaming WorldHans Jespersen
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Timothy Spann
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenDimosthenis Botsaris
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Drivenarconsis
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationKnoldus Inc.
 

Ähnlich wie Triage Presentation (20)

Apache kafka
Apache kafkaApache kafka
Apache kafka
 
MuleSoft Meetup Singapore #8 March 2021
MuleSoft Meetup Singapore #8 March 2021MuleSoft Meetup Singapore #8 March 2021
MuleSoft Meetup Singapore #8 March 2021
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQCluster_Performance_Apache_Kafak_vs_RabbitMQ
Cluster_Performance_Apache_Kafak_vs_RabbitMQ
 
Apache Kafka
Apache Kafka Apache Kafka
Apache Kafka
 
Kafka Fundamentals
Kafka FundamentalsKafka Fundamentals
Kafka Fundamentals
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Enterprise messaging with jms
Enterprise messaging with jmsEnterprise messaging with jms
Enterprise messaging with jms
 
Kafka Deep Dive
Kafka Deep DiveKafka Deep Dive
Kafka Deep Dive
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
 
A Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka SkillsA Quick Guide to Refresh Kafka Skills
A Quick Guide to Refresh Kafka Skills
 
Event Driven Architecture
Event Driven ArchitectureEvent Driven Architecture
Event Driven Architecture
 
SA UNIT II KAFKA.pdf
SA UNIT II KAFKA.pdfSA UNIT II KAFKA.pdf
SA UNIT II KAFKA.pdf
 
Apache kafka- Onkar Kadam
Apache kafka- Onkar KadamApache kafka- Onkar Kadam
Apache kafka- Onkar Kadam
 
Microservices in a Streaming World
Microservices in a Streaming WorldMicroservices in a Streaming World
Microservices in a Streaming World
 
Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)Hello, kafka! (an introduction to apache kafka)
Hello, kafka! (an introduction to apache kafka)
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
 
Introduction to Kafka and Event-Driven
Introduction to Kafka and Event-DrivenIntroduction to Kafka and Event-Driven
Introduction to Kafka and Event-Driven
 
Adsa u4 ver 1.0
Adsa u4 ver 1.0Adsa u4 ver 1.0
Adsa u4 ver 1.0
 
Removing performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configurationRemoving performance bottlenecks with Kafka Monitoring and topic configuration
Removing performance bottlenecks with Kafka Monitoring and topic configuration
 

Kürzlich hochgeladen

Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 

Kürzlich hochgeladen (20)

Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 

Triage Presentation

  • 1. A consumer proxy that solves head-of-line blocking for Kafka consumers 1 Hey everyone, and welcome to our presentation. I’m Aashish and together with Aryan, Jordan, and Michael - our team built Triage, a consumer proxy that solves head-of- line blocking for Kafka consumers.
  • 2. Overview 1. Microservices & Event-Driven Architecture (EDA) 2. Message Queues & Apache Kafka 3. Head-of-Line Blocking 4. Existing Solutions 5. Introducing Triage 6. Triage Design Challenges 7. Future Work 8. Q&A 2 Here’s a quick overview of what you can expect. First, we’ll address the larger context of microservices and event-driven architecture. From there, we’ll take a look at message queues and focus on Apache Kafka, with a few details on how it works. Next, we’ll examine the problem of head-of-line blocking and its consequences, after which we’ll share our research on some existing solutions. At that point, we’ll present Triage and our approach to solving head-of-line blocking, along with some interesting design challenges we faced. We’ll end with some ideas for future work, and leave some room for a Q&A. We’re excited to show you what we built so let’s get started!
  • 3. “63% of enterprises have adopted microservice architectures, and it’s only expected to grow in the coming decade.” 3 Microservice architecture has really gained in popularity over the last decade and in 2020, it was estimated that over 63% of enterprises had adopted microservices and were satis fi ed with the tradeo ff s.
  • 4. 4 Shopping App API Logic DB Orders Microservice API Logic DB Products Microservice API Logic DB Stock Microservice Here’s an example of a microservice architecture for a shopping app. The takeaway here is to notice how the services are isolated into separate pieces. The orders, products, and stock inventory services all have their own logic and data stores, and the shopping app can communicate with all of them.
  • 5. What do microservices offer? 1. Development work can occur in parallel 2. Scalability becomes easier 3. Polyglot environment 5 Since services can be decoupled in this way, work can be done in parallel which leads to faster development times. Additionally, there’s a bene fi t in the ability to take individual components and scale them independently. Often, multiple technologies and programming languages are used in these setups, which is known as a polyglot microservice environment. Given the use of these di ff erent languages, an important question is:
  • 6. “How do we successfully achieve intra-system communication?” 6 How do we successfully achieve the required intra-system communication, for the system to function properly? One option is to use a request-response model, which is commonly used on the web.
  • 7. Request Response 7 Request Response Request Response Imagine a # of interconnected microservices where services can send a request, and wait for responses. The issue is that if a single service in this chain experiences a slowdown, the request lifecycle of any connected service will also be delayed. To overcome this problem, a common choice is to implement an event-driven architecture, or an EDA.
  • 8. EDAs are centered around events - which are changes in state - or noti fi cations about a change. 8 EDAs are centered around events, which can be thought of as changes in state, or noti fi cations about a change.
  • 9. In an EDA, services can operate independently without concern for the state of any other service. 9 The key here is that services can operate independently without concern for the state of any other service.
  • 10. 10 Event-Driven Architecture The service on the left can communicate with all 3 services on the right, independently. This architecture bypasses the problem where a delayed service causes a slowdown throughout the entire system. In order to achieve this decoupling, EDAs can be implemented using message queues.
  • 11. Message Queue Functionality Queue Producer 11 Consumer Here we have two producers to the left of the message queue. These applications write events to the queue. The consumer, which is to the right, reads these events o ff of the queue.
  • 12. Traditional message queues: events are read and then removed. Log-based message queues: events are persisted on a log. 12 In traditional message queues, events are read and then removed from the queue. An alternative approach is to use log-based message queues. Here, all the events are persisted on a log so you don’t lose them once they’re read.
  • 13. 13 Powered by Among log-based message queues, Kafka is the most popular - over 80% of Fortune 100 companies across industries use it as part of their architecture.
  • 14. What does Kafka offer? •Scalability •Parallelism •Decoupling 14 Kafka is designed for scalability and parallelism, and it maintains the intended decoupling of an EDA. It’s worth taking a look at what’s unique about Kafka and how it works.
  • 15. In Kafka, events are called messages. 15 In the context of Kafka, events are called messages and this is how we’ll refer to them.
  • 16. Topic Kafka Partition 2 Partition 1 16 In this image, messages are grouped using a named identi fi er - called a topic. Kafka achieves scalability by writing all the messages of a topic to partitions. So in this example, messages in a single topic are written to two di ff erent partitions.
  • 17. 17 Topic 1 Topic 2 Partition 2 Consumer Group A Consumer 1 Consumer 2 Consumer 3 Consumer 4 Producer 1 Producer 2 Partition 1 Kafka Partition 2 Partition 1 Consumer Group B If we add the other pieces of the architecture, it’ll look something like this. Producers, seen on the left, write messages to a topic. Consumers, on the right, are organized into groups with a group ID. If a consumer wants to read messages, it can subscribe to a speci fi c topic; then, individual consumer instances can read messages from a partition.
  • 18. Want more scalability? Add more partitions. 18 Need more parallelism? Use consumer instances. To achieve more scalability, you could simply increase the # of partitions per topic. Additionally, the use of multiple consumer instances means that the messages can be processed in parallel.
  • 19. While a consumer instance can consume from more than one partition, a partition can only be consumed by a single consumer instance. 19 It is important to note that while a consumer instance can consume from more than one partition, a partition can only be consumed by a single consumer instance. In other words, 2 di ff erent consumer instances can’t consume from the same partition.
  • 20. Kafka commits 20 • O ff set: A number that indicates the position of the message in the queue. • A consumer periodically commits o ff sets back to Kafka to acknowledge the last message it successfully processed. • In case of a crash, Kafka will remember where to resume message delivery from. Kafka uses commits to know which messages have been successfully processed. The way this works is that every message on a Kafka partition has an o ff set - this is a number that indicates the position of the message in the queue. Think of it like an index in an array. A consumer periodically commits o ff sets back to Kafka, indicating the last message it successfully processed. If a consumer instance crashes, Kafka will remember where to resume message delivery from.
  • 21. 21 O ff set 48 49 50 51 Producer Consumer Last Committed O ff set Kafka Here, once the consumer commits o ff set #50, Kafka knows that the messages from 48-50 have all been successfully processed. The consumer can continue consuming before it commits the next o ff set.
  • 22. 1. Producers write messages to a speci fi c topic. 2. Kafka routes these messages to partitions. 3. Consumers subscribe to a speci fi c topic to receive messages f and commit o ff sets. 4. Each partition in a topic can only be consumed by one d consumer instance. 22 Recap To recap, producers write messages to a speci fi c topic. Kafka then routes these messages to partitions. Consumers subscribe to a topic to receive messages and commit o ff sets. Each partition in a topic can only be consumed by one consumer instance.
  • 23. Overview 23 1. Microservices & Event-Driven Architecture (EDA) 2. Message Queues & Apache Kafka 3. Head-of-Line Blocking 4. Existing Solutions 5. Introducing Triage 6. Triage Design Challenges 7. Future Work 8. Q&A Now that we've shown the larger context, Jordan from our team will explain the problem of head-of-line blocking in message queues.
  • 24. Head-of-Line Blocking 24 A real-world example of head of line blocking that we are all likely familiar with is when you're at the supermarket and the person in the front of the line is taking a long time to fi nish paying. Perhaps they're trying to use expired coupons or have multiple fruits each with their own ID or they're trying to pay with bitcoin. It slows down the entire line and everyone behind them has to wait.
  • 25. Head-of-Line Blocking - Message Queues 25 Processing in Progress Message queues can also su ff er from head of line blocking. In this example, there are four messages. The fi rst green message is processed quickly. Animation The orange one though takes longer to process, and crucially, while it’s being processed, all of the other messages have to wait. Animation Once the slow message is processed, the rest of the queue can proceed. Animation
  • 26. 26 Poison Pills Non-Uniform Consumer Latency There are two major causes of head of line blocking when it comes to message queues. The fi rst is poison pills.
  • 27. Head-of-Line Blocking - Poison Pills 27 In this example, the circles are regular messages and the skull and crossbones represents a poison pill. A poison pill message is one that the consumer does not know how to handle. For example, if the application developer is expecting an order quantity as an integer but receives one as a string, and has not written error handling to handle this scenario, the application may crash. This will prevent processing of all of the messages behind the poison pill message in the queue. The fi rst message is consumed quickly. Animate but the poison pill message crashes the consumer application. Animate No further messages can be processed.
  • 28. Head-of-Line Blocking - Non-Uniform Consumer Latency Orange Service Green Service 28 Processing in Progress The second main cause of head of line blocking is non-uniform, consumer latency. Suppose we have a consumer application that calls one of two external services depending on the content of a message… for green messages the application calls the green service and for orange messages the application calls the orange service. The fi rst message is processed normally since the green service is healthy. Trigger Animation Now imagine that the orange service is slower than usual to respond, perhaps due to network issues. Trigger Animation This means that the processing of all the messages in the queue is slowed, even though the green messages have nothing to do with the orange service. The messages are not able to be processed until the orange service completes. Once the orange service fi nishes, the block is lifted, and the rest of the messages can be processed Trigger Animation
  • 29. Solution Requirements Polyglot Data Loss Prevention 29 Open Source Handle Poison Pills Handle Non- Uniform Consumer Latency In determining our desired approach to solving head of line blocking, we decided on fi ve solution requirements. The fi rst two were handling the two main causes of head-of-line-blocking. The third requirement was that data loss was prevented. A naive way of handling head-of-line-blocking would be to just drop messages that are causing it. This might be appropriate for non-critical scenarios, such as tracking likes on social media where it's not critical that every like is captured. However, for critical situations such as those involving orders, it is crucial that every order is captured, otherwise potential revenue may be lost. We wanted a solution that could prevent data loss. The fourth requirement was that the potential solution could be easily integrated into polyglot microservice environments. Lastly, the fi fth requirement was that the potential solution would be open source and easily available to developers.
  • 30. Overview 30 1. Microservices & Event-Driven Architecture (EDA) 2. Message Queues & Apache Kafka 3. Head-of-Line Blocking 4. Existing Solutions 5. Introducing Triage 6. Triage Design Challenges 7. Future Work 8. Q&A With these solution requirements in mind, we’ll now look at the existing solutions that we found that addressed head-of-line-blocking…
  • 31. Existing Solutions 31 1. Con fl uent Parallel Consumer 2. DoorDash's Worker Model 3. Uber’s Consumer Proxy - The three solutions we found were Con fl uent Parallel Consumer, DoorDash’s Worker Model, and Uber’s Consumer Proxy.
  • 32. Existing Solutions Comparison 32 Polyglot Data Loss Prevention Open Source DoorDash Kafka Workers Uber Consumer Proxy Con fl uent Parallel Consumer Handles Poison Pill Handle Non- Uniform Consumer Latency - Con fl uent Parallel Consumer fi xes head-of-line blocking caused by both poison pills as well as non-uniform consumer latency. But it doesn’t have a way to store poison pill messages, and since we cannot tolerate data loss, this solution was not viable for our use case. Also, their library is written in Java, meaning developers would have to write their applications in Java as well; this was counter to our goal of fi nding a solution that worked well in a polyglot environment. - While using Kafka, DoorDash experienced spikes in latency in their consumer applications. Individual slow messages were causing delayed processing for all messages in a given partition - a real world example of non-uniform consumer latency. To address this, they introduced something they called "Kafka Workers". This solution, however, failed to address poison pills, and with no mechanism to prevent data loss, this solution was insu ffi cient. - Lastly, Uber’s Consumer Proxy solves head-of-line blocking resulting from both poison pills and from non-uniform consumer latency - Poison pills are handled without data loss, and non-uniform consumer latency is addressed by parallel consumption of messages. Uber built Consumer Proxy as its own piece of infrastructure in order to work well in polyglot environments. However, as an in-house solution, it is not available for us or other developers to use.
  • 33. Overview 1. Microservices & Event-Driven Architecture (EDA) 2. Message Queues & Apache Kafka 3. Head-of-Line Blocking 4. Existing Solutions 5. Introducing Triage 6. Triage Design Challenges 7. Future Work 8. Q&A 33 - Given that none of the existing solutions fi t all of our requirements, we decided to build Triage. Next, Aryan will discuss what Triage is, and how it handles both causes of head of line blocking.
  • 34. What is Triage? 34 Kafka Cluster Consumer Application Thanks, Jordan… Triage acts as a proxy for consumer applications. It ingests messages from the Kafka Cluster and sends them to downstream consumer applications.
  • 35. Triage Instance Triage at a high level 35 Partition Application Logic DynamoDB Instance Consumer Application Partition Partition Kafka Topic - Here’s a high-level view of a Triage instance in the cloud. - Triage consumes from a single partition, just like any other Kafka consumer. - Triage's functionality consists of the application logic, running in an AWS container, and a DynamoDB instance. - Problematic messages are stored in Dynamo for examination at a later time - This pattern is known as the "dead-letter pattern"
  • 36. Messages 36 Dead Letter Store Dead-Letter Pattern - In dead-letter patterns, problematic messages (referred to as dead letters) are removed from the consumer application and persisted to an external data store for later processing.
  • 37. Partition Commit Tracker - Overview 37 Triage Application Logic msg ack/nack ack nack ack ack Consumer Instance Consumer Instance Consumer Instance To manage commits back to Kafka, Triage uses an internal system of acknowledgements with a component we call Commit Tracker. Consumers can send an “ack”, a positive acknowledgement, back to Triage, indicating that a message was successfully processed or a “nack”, a negative acknowledgement, to indicate a poison pill message.
  • 38. Commit Tracker - Ack/Nack 38 0 1 2 3 Offset: 4 5 6 7 8 9 offset msg acked? 0 1 2 3 4 5 6 7 8 9 false false false false false false false false false false Ack true true true true true Stored Ack Nack true true true true Commit Tracker Using the Commit Tracker, Triage can calculate which o ff sets to commit back to Kafka. This ensures that the health of the partition is maintained. Let's take a look at how Commit Tracker works, since it's central to the functionality of Triage. Triage fi rst ingests a large batch of messages and stores them in a hashmap. TRIGGER ANIMATION The keys of the hashmap are the message o ff sets and the values are a custom struct with two fi elds: the message itself and a boolean, indicating whether it has been acknowledged. As Triage receives "acks" from consumers, we update the commit hash accordingly. TRIGGER ANIMATION
  • 39. When a message is "nacked", however, we cannot update the commit hash immediately. TRIGGER ANIMATION We must fi rst ensure the message has been successfully written to our dead-letter store, which is a DynamoDB table, and only then do we update the commit hash. TRIGGER ANIMATION Next, the rest of the messages are processed by the consumers, including one, the orange message, that takes a long time to be processed by the consumer. As a result, the faster green messages are processed and acked before the orange one is. TRIGGER ANIMATION
  • 40. 39 0 1 2 3 Offset Committed: Commit 5 Offset: 4 5 6 7 8 9 offset msg acked? 0 1 2 3 4 5 6 7 8 9 false true true true true true 39 Commit Tracker - Commit Calculator true true true true Commit Tracker It's important to note, that since we always wait for con fi rmation from Dynamo before updating the commit hash, at this point, whether a message has been "acked" or "nacked" isn't important - we only want to know that a message has been acknowledged in some way. So, how do we calculate which o ff set to commit back to Kafka? We want to commit as many o ff sets as we can, so we need to fi nd the greatest committable o ff set. Periodically, a component called "Commit Calculator" runs in the background. It checks the commit hash to see the greatest o ff set with a value of true, for which all lower o ff sets also have a value of true. TRIGGER ANIMATION Triage can then commit this o ff set back to Kafka.
  • 41. TRIGGER ANIMATION Once we receive con fi rmation from Kafka that the commit was successful, we can then delete all entries up to and including that o ff set from Commit Tracker, since they're no longer needed. TRIGGER ANIMATION
  • 42. How Triage Solves Head-of-Line Blocking 40 With this understanding of Commit Tracker and the core functionality of Triage, let's take a look at how we solve Head of Line Blocking due to both Poison Pills and Non-Uniform Consumer Latency
  • 43. msg ack/nack nack ack DynamoDB Dead Letter Store ack ack How Triage Solves Poison Pills 41 Let's start with Poison Pills - here we can see a consumer application receiving a poison pill message. - Trigger Animation Consumer applications can tell Triage that the message they've received is a poison pill by sending a "nack". - Trigger Animation Triage sends that message to a DynamoDB table, so that it can be handled at a later time. This frees up the consumer to continue processing messages. - Trigger Animation
  • 44. How Triage Solves Non-Uniform Consumer Latency 42 To address non-uniform consumer latency, Triage enables the parallel consumption of messages from a single partition. Here, we have two instances of a single consumer application that rely on one of two external services based on the contents of a message. For orange messages, the application calls the orange external service; for greens, the green service. - Trigger Animation Here, you can see that because the orange service is slow, the consumer instance at the top is taking an unusually long time to process a message. - Trigger Animation (TALK OVER) Because of the one-to-many pattern enabled by Triage, healthy consumer instances are able to continue consumption, so the queue keeps moving.
  • 45. Overview 43 1. Microservices & Event-Driven Architecture (EDA) 2. Message Queues & Apache Kafka 3. Head-of-Line Blocking 4. Existing Solutions 5. Introducing Triage 6. Triage Design Challenges 7. Future Work 8. Q&A Now that you know how Triage solves head-of-line blocking, Mike will cover some of the challenges that we faced when building Triage as well as our plans for some improvements we'd like to build out.
  • 46. Triage Design Challenges • Achieving Parallel Processing via Concurrency • Polyglot Support • Ease of Deployment 44 Based on our requirements and our intended design for Triage, there were three notable challenges that we'd like to discuss. - Achieving Parallel Processing via Concurrency, - Polyglot Support, - and Ease of Deployment. For each of these challenges, I'll talk a little about them and discuss our respective solutions. Let's start with parallel consumption via concurrency
  • 47. Parallel Consumption Kafka Partition 45 We need a one to many relationship between Triage and instances of a consumer application to solve HoLB caused by NUCL.
  • 48. Challenge: Achieving Concurrency 46 Our solution was to write the application logic of Triage in Go. Go is designed with concurrency in mind via what are called Goroutines. We can think of Goroutines as non-blocking function loops. Several, think thousands, Goroutines can run in the background with very little resource overhead.
  • 49. Challenge: Parallel Consumption Solution: Go & Goroutines Goroutine C Goroutine B Goroutine A 47 Triage Within Triage, we run a dedicated Goroutine for each downstream consumer instance. These Goroutines pull messages and send them to consumer instances, allowing us to consume from a single partition in parallel.
  • 50. Concurrency in Triage 48 Concurrency via Go also allowed us to implement Triage as a single application. Each major component of Triage exists as a Goroutine, that themselves utilize other Goroutines. We achieved communication across these Goroutines using channels.
  • 51. Challenge: Achieving Concurrency 49 Goroutine 1 Goroutine 2 Channel Channels are strongly-typed queue like structures. Goroutines can places messages on the channel for other goroutines to receive. When messages are received, it's important to know that they are removed from the channel.
  • 52. Challenge: Achieving Concurrency 50 Goroutine 2 Goroutine 1 Goroutine 3 Goroutine 5 Goroutine 4 Goroutine 6 Channel Because messages are removed, we can have multiple senders and receivers without worrying about unintended data duplication. Animate
  • 53. Concurrency in Triage 51 Fetcher Commit Tracker Consumer Manager messagesChannel newConsumersChannel Dispatch senderRoutine A senderRoutine B Connection Request Let's take a look at some of the major components of Triage and how we take advantage of concurrency. At a high level, we need a process to continually ingest messages from Kafka - this Goroutine is called Fetcher, in blue. It then needs to pipe these messages via the "messages channel" to a Goroutine called Dispatch and write them to our Commit Tracker in green. While all this is happening, we need another process to listen for incoming connection requests from consumer instances - we call this Goroutine "Consumer Manager". When it receives a request, after authenticating it, Consumer Manager places the network address of the consumer instance onto a "newConsumers" channel. When Dispatch receives a network address via this channel, it creates yet another Goroutine called "senderRoutine" that pulls messages from the messagesChannel. These senderRoutines, as their names imply, send messages to their respective consumer instances.
  • 54. 52 Connection Request Dynamo DB Fetcher Commit Tracker Consumer Manager messagesChannel newConsumersChannel Dispatch senderRoutine A senderRoutine B Commit Calculator commitsChannel messages commits acknowledgementsChannel Filter Reaper deadLettersChannel consumerRoutine Triage Application Logic committerRoutine Zooming out a little bit really hammers home the bene fi ts we gain from concurrency. All of the components inside Triage, that you can see on the screen, are goRoutines, many of which rely on other goRoutines. While implementing all of this functionality without Go is certainly possible, Go made it very intuitive for us, cementing it as the correct language for the job. * All components should be in an "application logic box"
  • 55. Triage Design Challenges • Achieving Parallel Consumption via Concurrency • Polyglot Support • Ease of Deployment 53 The next challenge we faced was polyglot support.
  • 56. Challenge: Polyglot Support Java Consumer Go Consumer NodeJS Consumer 54 Kafka Cluster Triage Instances As you can see on the right side of the diagram, we needed Triage to be able to support consumer applications written in a host of di ff erent languages.
  • 57. Challenge: Polyglot Support • Solution: • Implementation: Service + Thin Client Library • Network Protocol: gRPC 55 Our solution was to implement Triage as a service coupled with a thin client library, in addition to our choice of gRPC as our primary network communication protocol.
  • 58. Service vs Client Library 56 Before choosing our implementation model, we considered both a pure client library and pure service approach.
  • 59. Potential Client Library Implementation Consumer Application 57 Kafka Cluster A potential pure client library implementation would have all the application logic of Triage exist as imported code within the consumer application. This comes with the bene fi t of not having to introduce new pieces of infrastructure to a user's system and makes testing Triage simpler. But, supporting additional languages would require a complete rewrite of Triage. Maintaining Triage would be pretty di ffi cult, since any change to a system's Kafka version would require updating all versions of Triage. We considered these to be poor tradeo ff s. An alternative would be to implement Triage as a service.
  • 60. Service Implementation 58 Kafka Cluster Triage Service Consumer Application With the pure service approach, Triage would act as a piece of infrastructure that sits between the Kafka cluster and consumer applications. This allows us to avoid the aforementioned cons of a client library implementation, but we still wanted to make connecting to Triage simple for developers.
  • 61. Challenge: Polyglot Support Solution: Service + Thin Client Library 59 Triage Service Kafka Cluster Consumer Application Triage Client We decided on a hybrid approach. The core application logic of Triage exists on a container running in AWS. Consumer applications use a thin client library to manage communicating with Triage. This lightweight client exists within each instance of a consumer application. It provides convenience methods for sending an initial connection request and exposing an endpoint to receive messages from Triage.
  • 62. Multi-language Support 60 Kafka Cluster Triage Service Triage Client Triage Client Triage Client While we don't gain the full language agnosticism that a pure service approach might o ff er, building out multi-language support only requires us to rewrite our simple client library in another language.
  • 63. 1. Sends an initial HTTP request to Triage to request a connection. 2. Runs a gRPC server to receive messages from Triage. 61 The client library: Ultimately, the client library only 1) sends an initial HTTP request to Triage to request a connection and 2) runs a gRPC server to receive messages. Because it's operationally very simple, rewriting the client library is far more manageable than rewriting Triage in its entirety.
  • 64. Machine A Machine B doWork() code execution gRPC - an RPC Framework Machine A doWork() code execution 62 Local Procedure Call Remote Procedure Call To manage the communication between Triage and consumer instances, we chose gRPC as a network protocol, primarily for the ease with which we could build out multi language support. I think it's helpful to talk a little bit about what gRPC is. gRPC is an RPC framework, created by Google, where RPC stands for remote procedure call. We can think of procedure calls as simple function calls or invocations. With a local procedure call, everything exists on a single host machine. In the fi gure on the left, the function "doWork" is executed on Machine A resulting in code being executed on Machine A. Remote procedure calls, however, allow us to execute code on a di ff erent machine. In the fi gure on the right, "doWork" is being called on Machine A, resulting in code being executed on Machine B.
  • 65. gRPC & Triage 63 gRPC Client processMessage(message) Triage Consumer Instance gRPC Server code execution It's helpful to understand that gRPC uses the same client-server model that we're familiar with. With Triage, the Triage container acts as a gRPC client, and calls "processMessage()", with the message as an argument. The consumer instance runs a gRPC server that listens for this procedure call. It then executes code to process the message before sending a response, the "ack" or "nack" we've talked about before, back to Triage.
  • 66. 64 Code Generation gRPC Server gRPC Server gRPC Server • function name • Parameters • Return value gRPC Service De fi nition The biggest reason we decided on gRPC is its code generation feature. Using what's called a gRPC service de fi nition, client and server implementations can be automatically generated in all major programming languages. Creating a gRPC service de fi nition is pretty straightforward. You simply de fi ne a function interface - that is, what is the name of the function, what parameters does it have, and what does it return. Because the most complicated part of building the Triage client library is handled for us via this code generation, we can write support for other languages with relative ease.
  • 67. Triage Design Challenges • Achieving Parallel Consumption via Concurrency • Polyglot Support • Ease of Deployment 65 The fi nal challenge we faced was making Triage easy to deploy for application developers.
  • 68. Challenge: Ease of Deployment • Solution: • AWS CDK • Triage CLI 66 Our solution was to create an automated deployment script using AWS's Cloud Development Kit. Developers can use our command line tool, Triage CLI, to easily deploy Triage to AWS using this CDK script.
  • 69. Challenge: Ease of Deployment 67 Kafka Topic ECS Partition Partition Partition Solution: AWS CDK Because Triage operates on a per-partition basis, we needed to deploy a container running Triage for each partition in a given Kafka topic. To do this, we used Elastic Container Service, speci fi cally with Fargate as our deployment vehicle. With ECS, we can de fi ne a minimum number of Triage containers running at any given time - were one to crash, for some reason, another would be provisioned to replace it automatically. Using Fargate means management of individual compute resources is abstracted away for our users and allows them to only think about containers. The key for us, was that by using CDK, we could write a reusable script to deploy Triage containers via Fargate. That being said, we still needed to answer the question of how to interpolate user speci fi c information, such as Kafka authentication credentials, into Triage during deployment.
  • 70. Challenge: Ease of Deployment Solution: Triage CLI • triage init • triage deploy • Triage network address • Authentication Key 68 To do so, we created a command line tool called Triage CLI. It can be downloaded as an NPM package and features a 2-step deployment process. triage init installs any necessary dependencies for deployment and generates a con fi guration fi le where developers can supply authentication and Kafka speci fi c information. triage deploy interpolates the data in this con fi guration fi le into the CDK script. It also creates an internal con fi g fi le used by individual Triage containers. It then deploys these containers to AWS. Finally, it returns the network address and authentication key needed for consuming applications to connect to Triage. Using Triage CLI, a developer can leverage our CDK Script to deploy Triage to the cloud.
  • 71. Triage Design Challenges • Achieving Parallel Consumption via Concurrency • Polyglot Support • Ease of Deployment 69 Having solved these major challenges, we were able to build Triage without compromising on any of our design requirements. For a more in-depth exploration of how Triage works and implementation details, check out our write up, linked in the Zoom meeting description.
  • 72. Overview 70 1. Microservices & Event-Driven Architecture (EDA) 2. Message Queues & Apache Kafka 3. Head-of-Line Blocking 4. Existing Solutions 5. Introducing Triage 6. Triage Design Challenges 7. Future Work 8. Q&A Before we open up for questions, we'd like to cover some features we'd like to add.
  • 73. Future Work 1. Extend client library language support 2. Cause of Failure for Dead Letter Table 3. Dead Letter Noti fi cations 71 We'd fi rst like to build out additional language support for our thin client library. As we've discussed, doing so shouldn't be di ffi cult, since the majority of the work is done for us via gRPC code generation. Supporting other popular languages like JavaScript or Ruby would help us serve more developers. We'd also like to add a cause of failure column to our table that stores dead-letter messages - it would contain failure reasons that developers could supply when sending a "nack" back to Triage for poison pills. This would aid in analysis and remedying faulty messages. Finally, we'd like to add a simple noti fi cation system that could alert developers when poison pills are stored in the dead-letter table, allowing for rapid response. We think this is perhaps the easiest to implement and is likely our next step.
  • 74. 72 Questions? github.com/Team-Triage Aashish Balaji Jordan Swartz Michael Jung Aryan Binazir Toronto, Canada San Diego, CA Los Angeles, CA Chapel Hill, NC With that, I'd like to thank you all for joining us this afternoon and we'll open the fl oor for questions!