Unified Stream Processing at Scale with Apache Samza - BDS2017

1
Unified Stream Processing at Scale with Apache
Samza
Jake Maes
Staff SW Engineer at LinkedIn
Apache Samza PMC

2
Agenda
Intro to Stream Processing
Stream Processing Ecosystem at LinkedIn
Use Case: Pre-Existing Online Service
Use Case: Batch  Streaming
Future

3
Agenda
Future

4
About
● Stream processing framework
● Production at LinkedIn since 2014
● Apache top level project since 2014
● 16 Committers
● 74 Contributors
● Known for
 Scale
 Managed local state
 Pluggability
 Kafka integration

5
● Low latency
● One message at a time
● Checkpointing, durable state
● All I/O with high-performance message brokers
Traditional Stream Processing

6
Partitioned Processing
TaskTask0
State0
Changelog Stream
(partition 0)
Checkpoint
Stream
Processor
Output StreamsInput Streams
(partition 0)

7
Agenda
Future

8
● Anti abuse
● Derived data
● Search Indexing
● Geographic filtering
● A/B testing infrastructure
● Many many more…
Stream Processing Use Cases at LinkedIn

9
Stream Processing Ecosystem – The Dream
Applications and Services
Samza
Kafka
Storage
External
Streams
Storage
&
Serving
Brooklin

10
Stream Processing Ecosystem - Reality
Applications and Services
Samza
Kafka
Storage
External
Streams
Storage
&
Serving
Brooklin

11
Expansion of Stream Processing at LinkedIn
● Influx of applications
 10 -> 200+ over 3 years
 13K containers processing 260B events/day
● Migrations of existing applications
 Online services
 Offline jobs
● Incoming applications have different expectations
Services

12
Agenda
Future

13
Case Study – Notification Scheduler
Processor
User Chat
Event
User Action
Event
Connection
Activity
Event
Restful
Services
Member
profile
database
Aggregation
Engine
Channel
Selection
State
store
input1
input2
input3
① Local Data Access
② Remote Database Lookup
③ Remote Service Calloutput

14
Online Service + Stream Processing
Why use stream processor?
● Richer framework than Kafka clients
Requirements:
● Deployment model
 Cluster (YARN) environment not suitable
● Remote I/O
 Dependencies on other services
 I/O latency stalls single threaded processor
 Container parallelism - too much overhead
Services

15
App Instance
Embedded Samza
● Zookeeper-based JobCoordinator
 Uses Zookeeper for leader election
 Leader assigns work to the processors
ZooKeeper
Stream Processor
Samza
Container
Job
Coordinator*
App Instance
Stream Processor
Samza
Container
Job
Coordinator
App Instance
Stream Processor
Samza
Container
Job
Coordinator
* Leader

16
Asynchronous Event Loop
Stream Processor
Event Loop
 Single thread
 1 : Task
 n : Task
Restful Services
Java NIO, Netty

17
Checkpointing
● Sync – Barrier
● Async - Watermark
t1 t2 t3 tc t4
checkpoint
callback3
complete
time
callback1
complete
callback2
complete
callback4
complete

18
Performance for Remote I/O
Baseline
Thread pool size = 10
Max concurrency = 1
Thread pool size = 10
Max concurrency = 3
Sync I/O with MultithreadingSingle thread

19
Agenda
Future

20
Case Study - Unified Metrics with Samza
UMP
Analyst
Pig
Script
“Compile”Author
Generate Fluent Code +
Runtime Config
Deploy+
+

21
Offline Jobs
Why use stream processor?
● Lower latency
Requirements:
● HDFS I/O
● Same app in batch and streaming
 Best of both worlds
● Composable API

22
Low Level Logic
public class PageViewByMemberIdCounterTask implements InitableTask, StreamTask, WindowableTask {
private final SystemStream pageViewCounter = new SystemStream("kafka", "MemberPageViews");
private KeyValueStore<String, PageViewPerMemberIdCounterEvent> windowedCounters;
private Long windowSize;
@Override
public void init(Config config, TaskContext context) throws Exception {
this.windowedCounters = (KeyValueStore<String, PageViewPerMemberIdCounterEvent>)
context.getStore("windowed-counter-store");
this.windowSize = config.getLong("task.window.ms");
}
@Override
public void window(MessageCollector collector, TaskCoordinator coordinator) throws Exception {
getWindowCounterEvent().forEach(counter ->
collector.send(new OutgoingMessageEnvelope(pageViewCounter, counter.memberId, counter)));
}
@Override
public void process(IncomingMessageEnvelope envelope, MessageCollector collector, TaskCoordinator coordinator) throws
Exception {
PageViewEvent pve = (PageViewEvent) envelope.getMessage();
countPageViewEvent(pve);
}
}

23
High Level Logic
public class RepartitionAndCounterExample implements StreamApplication {
@Override public void init(StreamGraph graph, Config config) {
MessageStream<PageViewEvent> pve =
graph.getInputStream("pageViewEvent", (k, m) -> (PageViewEvent) m);
OutputStream<String, MyOutputType, MyOutputType> mpv = graph
.getOutputStream("memberPageViews", m -> m.memberId, m -> m);
pve
.partitionBy(m -> m.memberId)
.window(Windows.keyedTumblingWindow(m -> m.memberId, Duration.ofMinutes(5), () -> 0,
(m, c) -> c + 1))
.map(MyOutputType::new)
.sendTo(mpv);
}
} Built-in transform functions

24
Batch <-> Streaming
streams.pageViewEvent.system=kafka
streams.pageViewEvent.physical.name=PageViewEvent
streams.memberPageViews.system= kafka
streams.memberPageViews.physical.name=MemberPageViews
streams.pageViewEvent.system=hdfs
streams.pageViewEvent.physical.name=hdfs://mydbsnapshot/PageViewEvent/
streams.memberPageViews.system=hdfs
streams.memberPageViews.physical.name=hdfs://myoutputdb/MemberPageViews
Streaming config
Batch config

25
Performance - HDFS
● Profile count,
group by country
● 500 files
● 250GB

26
Agenda
Use Case: Pre-Existing Service
Future

27
What’s Next?
● SQL
 Prototyped 2015
 Now getting full time attention
● High Level API extensions
 Better config, I/O, windowing, and more
● Beam Runner
 Samza performance with Beam API
● Table support

28
Thank You
Contact:
● Email dev@samza.apache.org
● Social http://twitter.com/jakemaes
Links:
● http://samza.apache.org
● http://github.com/apache/samza
● https://engineering.linkedin.com/blog

30
High Level API - Composable Operators
filter select a subset of messages from the stream
map map one input message to an output message
flatMap map one input message to 0 or more output messages
merge union all inputs into a single output stream
partitionBy re-partition the input messages based on a specific field
sendTo send the result to an output stream
sink send the result to an external system (e.g. external DB)
window window aggregation on the input stream
join join messages from two input streams
Stateless
Functions
I/O
Functions
Stateful
Functions

32
Typical Flow - Two Stages Minimum
Re-
partition
window map sendTo
PageVie
w
Event
PageViewEvent
ByMemberId
PageViewEventP
er
MemberStream
PageViewRepartitionTask PageViewByMemberIdCounterTask

Unified Stream Processing at Scale with Apache Samza - BDS2017

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Unified Stream Processing at Scale with Apache Samza - BDS2017

Ähnlich wie Unified Stream Processing at Scale with Apache Samza - BDS2017 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Unified Stream Processing at Scale with Apache Samza - BDS2017

Hinweis der Redaktion