The shift to stream processing at LinkedIn has accelerated over the past few years. We now have over 200 Samza applications in production processing more than 260B events per day. Many of these are new applications, but there have also been more migrations from existing online and offline applications. To support the influx of new use cases, we have improved the flexibility, efficiency and reliability of Apache Samza.
In this talk, we will take a brief look at the broader streaming ecosystem at LinkedIn, then we will zoom in on a few representative use cases and explain how they are powered by recent advancements to Apache Samza including a unified high level API, flexible deployment model, batch processing, and more.
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Unified Stream Processing at Scale with Apache Samza - BDS2017
1. 1
Unified Stream Processing at Scale with Apache
Samza
Jake Maes
Staff SW Engineer at LinkedIn
Apache Samza PMC
2. 2
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Online Service
Use Case: Batch Streaming
Future
3. 3
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Online Service
Use Case: Batch Streaming
Future
4. 4
About
● Stream processing framework
● Production at LinkedIn since 2014
● Apache top level project since 2014
● 16 Committers
● 74 Contributors
● Known for
Scale
Managed local state
Pluggability
Kafka integration
5. 5
● Low latency
● One message at a time
● Checkpointing, durable state
● All I/O with high-performance message brokers
Traditional Stream Processing
7. 7
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Online Service
Use Case: Batch Streaming
Future
8. 8
● Anti abuse
● Derived data
● Search Indexing
● Geographic filtering
● A/B testing infrastructure
● Many many more…
Stream Processing Use Cases at LinkedIn
9. 9
Stream Processing Ecosystem – The Dream
Applications and Services
Samza
Kafka
Storage
External
Streams
Storage
&
Serving
Brooklin
11. 11
Expansion of Stream Processing at LinkedIn
● Influx of applications
10 -> 200+ over 3 years
13K containers processing 260B events/day
● Migrations of existing applications
Online services
Offline jobs
● Incoming applications have different expectations
Services
12. 12
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Online Service
Use Case: Batch Streaming
Future
13. 13
Case Study – Notification Scheduler
Processor
User Chat
Event
User Action
Event
Connection
Activity
Event
Restful
Services
Member
profile
database
Aggregation
Engine
Channel
Selection
State
store
input1
input2
input3
① Local Data Access
② Remote Database Lookup
③ Remote Service Calloutput
14. 14
Online Service + Stream Processing
Why use stream processor?
● Richer framework than Kafka clients
Requirements:
● Deployment model
Cluster (YARN) environment not suitable
● Remote I/O
Dependencies on other services
I/O latency stalls single threaded processor
Container parallelism - too much overhead
Services
18. 18
Performance for Remote I/O
Baseline
Thread pool size = 10
Max concurrency = 1
Thread pool size = 10
Max concurrency = 3
Sync I/O with MultithreadingSingle thread
19. 19
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Online Service
Use Case: Batch Streaming
Future
20. 20
Case Study - Unified Metrics with Samza
UMP
Analyst
Pig
Script
“Compile”Author
Generate Fluent Code +
Runtime Config
Deploy+
+
21. 21
Offline Jobs
Why use stream processor?
● Lower latency
Requirements:
● HDFS I/O
● Same app in batch and streaming
Best of both worlds
● Composable API
26. 26
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Service
Use Case: Batch Streaming
Future
27. 27
What’s Next?
● SQL
Prototyped 2015
Now getting full time attention
● High Level API extensions
Better config, I/O, windowing, and more
● Beam Runner
Samza performance with Beam API
● Table support
28. 28
Thank You
Contact:
● Email dev@samza.apache.org
● Social http://twitter.com/jakemaes
Links:
● http://samza.apache.org
● http://github.com/apache/samza
● https://engineering.linkedin.com/blog
30. 30
High Level API - Composable Operators
filter select a subset of messages from the stream
map map one input message to an output message
flatMap map one input message to 0 or more output messages
merge union all inputs into a single output stream
partitionBy re-partition the input messages based on a specific field
sendTo send the result to an output stream
sink send the result to an external system (e.g. external DB)
window window aggregation on the input stream
join join messages from two input streams
Stateless
Functions
I/O
Functions
Stateful
Functions
32. 32
Typical Flow - Two Stages Minimum
Re-
partition
window map sendTo
PageVie
w
Event
PageViewEvent
ByMemberId
PageViewEventP
er
MemberStream
PageViewRepartitionTask PageViewByMemberIdCounterTask
Hinweis der Redaktion
Talk is an evolution story of Stream Processing at Linkedin:
Few years ago, services, batch, and stream processing isolated
Now stream processing used everywhere
Talk focuses on LI, but should apply if your organization is looking to adopt or expand its usage of stream processing.
Title of the talk used the word “Unified” = Stream processing framework that can be used by itself, embedded in an online service, or deployed in both streaming and batch environments seamlessly
Latency Spectrum
Samza-Kafka interaction optimized
Performance:
Most processors can handle 10K msg/s per container. Yes, even with state!
Have seen trivial processors like a repartitioner handle as much as 50K msg/s per container
Have run benchmarks showing 1.2M msg/s on a single host
Under the hood:
Partitions
Data parallelism
Could be files on HDFS, partitions in Kafka, etc.
Tasks
Logical unit of execution
Isolated local state
Processor
Computational parallelism (coarse grained, 1 JVM)
Single threaded event loop
Work assignment
Input partitions are assigned to task instances
A particular task instance usually processes 1 partition from each input (topic)
A task instance will often write to all output partitions
Checkpoints are used to track progress
Changelog for state durability
So, how does this fit into the broader ecosystem?
Stream processing center of the world
Left is storage data at rest
Brooklin is stream ingestion normalization layer
CDC plus ingestion from other streams
Events come into Kafka from brooklin or apps & services
Processed by Samza and back out to Kafka
Ingested by other storage and serving components
Common pattern
Optimized for streams (everything is a stream)
Realistic? no
Stream processor is optimized for interacting with streams, it makes sense to pursue an architecture which provides access to all the necessary data sources and sinks as streams.
In reality, streaming applications often need to interface with a number of other systems. Why?
Too expensive to replicate everything into Kafka [Kafka Connect]
Processor was written offline but for latency purposes needs to also run in streaming mode
Some datasources are shared with other services that need Random Access that is easier to provide from a database or serving layer and we don’t want multiple sources of truth
Because some systems don’t have the ability to ingest from a stream (either because it wasn’t created, or they just wouldn’t be able to do it fast enough)
Because sometimes for security purposes, it’s better for an application to interact directly with another
Where does this come from? Well over the remainder of this talk. I’ll describe how stream processing has changed at LinkedIn and dig in to 2 sources of the evolving requirements and how we adapted to them.
As I mentioned earlier, at LinkedIn we’ve been using Samza in production for over 3 years.
In that time it has grown from 10 to over 200 applications.
We now have over 1300 app instances, with an average of 10 containers each, handling over 260B events per day
(conservative numbers
These applications are not all new stream processors; many of them are migrations of existing applications that can be divided into 2 main categories:
Preexis
Why use stream processor?
Abstractions for input output streams, checkpointing, durable state, etc.
Existing services often don’t fit with the YARN deployment model
May already have dedicated hardware they want to use
May require a more static host assignment. e.g. if they’re exposing a RESTful endpoint
Also tend to depend on other services
Datasources with only RESTful or client APIs (not streaming)
Remote I/O introduces huge latency into single-threaded event loop.
Workaround: users would manage their own thread pool and use manual commit in window()
Workaround2: users would just use a massive number of containers to get more parallelism
Metrics used to be run daily with pig scripts
Now same script can be compiled into a Samza processor that runs both online and offline. No need to rewrite for the new platform.
Real time metrics
From a single definition (e.g. a Pig script ) generate batch and real time flows. Flip a config!
How to associate the keys from multiple streams?
Copartitioning
Each input has:
Same partition key
Same partition count
Same partition algorithm (usually hash + modulo)
The result of the co-partitioning requirement:
Most stateful jobs include a repartition stage, which re-keys or reshuffles the inputs to achieve co-partitioning
Often implemented as separate processors that are deployed at the same time.