More Related Content Similar to Streaming analytics better than batch – when and why by Dawid Wysakowicz and Adam Kawa at Big Data Spain 2017 (20) More from Big Data Spain (20) Streaming analytics better than batch – when and why by Dawid Wysakowicz and Adam Kawa at Big Data Spain 20172. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Streaming analytics better
than batch - when and why ?
_Adam Kawa - Dawid Wysakowicz -_
Krzysztof Zarzycki_
3. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Have you ever
built cool
Big Data
pipelines?
4. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
5. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Example Use-Case
■ Can be done in batch and
real-time
■ User session analytics at Spotify
● Simple stats
■ Duration, number of songs,
skips, searches etc.
● Advanced analytics
■ Mood, physical activity, real-time
content, ads
6. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Example Output
How long do users listen
to a new edition of
Discover Weekly?
_1. Dashboards_
7. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Example Output
How long do users listen
to a new edition of
Discover Weekly?
Australian users are
listening to Discover
Weekly too short !!!
_1. Dashboards_ _2. Alerts_
8. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Example Output
How long do users listen
to a new edition of
Discover Weekly?
Australian users are
listening to Discover
Weekly too short !!!
Recommend songs
and ads based on
current activity.
_1. Dashboards_ _2. Alerts_ _3. Content_
9. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
1st
- Batch Architecture
1h
1h
1h
1h - 1d
1h
User
Events
User
Sessions
10. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
1st
- Batch Architecture
1h
1h
1h
1d
1h
User
Events
User
Sessions
11. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
The More Moving Parts …
⬇ The higher learning curve
⬇ The more gluing code
⬇ The larger administrative effort
⬇ The more error-prone solution
12. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Long Waiting Time
Image source: “Continuous Analytics: Stream Query Processing in Practice”, Michael J Franklin, Professor, UC Berkley, Dec 2009 and
http://www.slideshare.net/JoshBaer/shortening-the-feedback-loop-big-data-spain-external
13. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
2nd
- Micro-Batch Architecture
1m - 1h
14. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
♪ ♪
No Built-In Session Windows
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
[10:00 - 11:00) [11:00 - 12:00)
15. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
♪ ♪
No Built-In Session Windows
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
[10:00 - 11:00) [11:00 - 12:00)
16. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Late Data …
♪ ♪ ♪ ♪ ♪ ♪ Event Time
14:55 - 16:35
Processing
Time
17. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
... Included in Current Batch
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪
♪
14:55 - 16:35 16:50 - …
Event Time
Processing
Time
18. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Out-Of-Order Data …
♪ ♫ ♪ Event Time
Processing
Time
19. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Out-Of-Order Data …
♪ ♫ ♪ ♪ ♪ ♫
♪ ♪ ♫
Event Time
Processing
Time
20. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Out-Of-Order Data …
♪ ♫ ♪ ♪ ♪ ♫
♪ ♪ ♫
Event Time
Processing
Time
21. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
... Breaks Correctness
♪ ♫ ♪ ♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫
♪ ♪ ♫ ♪ ♫ ♪ ♬ ♫
♪
Event Time
Processing
Time
22. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Problems
FILES,
BATCHES,
DATA LAKES
23. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Solving Streaming Problem With Batch?
24. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
3rd
- Streaming-First Architecture
25. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
User Session Windows
♪Case A ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
Case B ♪ ♪ ♪ ♪ ♪ ♪
Session gap
eg. 15
minutes
♪
♪♪
5
26. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
User Session Windows
♪Case A ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
Case B ♪ ♪ ♪ ♪ ♪ ♪
Session gap
eg. 15
minutes
♪
♪♪
5
[3,2]
27. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Reading From Kafka
val sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪♪ ♪ ♪ ♪ ♪ ♪ ♪
28. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Session Windows With Gap
val sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪
User 1
User 2
29. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Session Windows With Gap
val sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
.window(EventTimeSessionWindows.withGap(Time.minutes(15)))
User 1 ♪ ♪ ♪ ♪ ♪ ♪
Session gap
- 15 minutes
♪♪
30. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Analyzing User Session
val sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
.window(EventTimeSessionWindows.withGap(Time.minutes(15)))
.apply(new CountSessionStats())
User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪
31. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Handling Late Events
val sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
.window(EventTimeSessionWindows.withGap(Time.minutes(15)))
.allowedLateness(Time.minutes(60))
.apply(new CountSessionStats())
User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪ ♪
32. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Triggering Early Results
val sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
.window(EventTimeSessionWindows.withGap(Time.minutes(15)))
.trigger(EarlyTriggeringTrigger.every(Time.minutes(10)))
.allowedLateness(Time.minutes(60))
.apply(new CountSessionStats())
User 1 ♪ ♪ ♪ ♪ ♪ ♪ ♪♪
33. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Sessionization Example
val sessionStream : DataStream[SessionStats] = sEnv
.addSource(new KafkaConsumer(...))
.keyBy(_.userId)
.window(EventTimeSessionWindows.withGap(Time.minutes(15)))
.trigger(EarlyTriggeringTrigger.every(Time.minutes(10)))
.allowedLateness(Time.minutes(60))
.apply(new CountSessionStats())
Working example:
https://github.com/getindata/flink-use-case
34. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Modern Stream Processing Engines
■ Rich stream processing semantic
● Built-in support for event-time windows
● Accurate results for late / out-of-order events and replays
● Early triggers
■ Low latency and high-throughput
■ Exactly-once stateful processing
35. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Modern Stream Processing Engines
■ Rich stream processing semantic
● Built-in support for event-time windows
● Accurate results for late / out-of-order events and replays
● Early triggers
■ Low latency and high-throughput
■ Exactly-once stateful processing
User survey:
http://data-artisans.com/flink-user-survey-2016-part-1
http://data-artisans.com/flink-user-survey-2016-part-2
36. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
37. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
How can I reprocess data?
38. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Reprocessing Events In Flink
1. Take periodic snapshots of a job
● It stores Kafka offsets, on-flight sessions, application state
2. Restart a job from a savepoint rather than from a beginning
39. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
What if data is no longer in Kafka?
40. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Consuming Data From HDFS
■ Run your streaming code on HDFS (bounded data)
● You need to read data in event-time based order
● Implement mechanism of proper watermark generation
41. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
What are usual stream processing
applications?
42. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Stream Analytics
Image source: https://www.slideshare.net/sinisalyh/storm-at-spotify
43. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Stream 24/7 Applications
44. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
When is batch processing good?
45. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Batch Processing Use-Cases
■ Ad-hoc analytics and data exploration
● Notebooks, Spark/Flink/Hive, Parquet, complete data sets
■ Technical advantages
● A large swaths of historical data in HDFS
● High-level libraries in mature batch technologies
46. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Batch Processing Use-Cases
■ Ad-hoc analytics and data exploration
● Notebooks, Spark/Flink/Hive, Parquet, complete data sets
■ Implementation advantages
● Offline experiments over large historical data
■ Historical events are usually stored in HDFS, not Kafka
● High-level libraries in batch processing technologies
■ Spark MLlib, H2O
(when data arrives continuously)
don’t solve
streaming problem
with batch jobs
47. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Who Are You, actually?
■ At GetInData, we build custom Big Data solutions
● Hadoop, Flink, Spark, Kafka and more
■ Our team is today represented by
Krzysztof
Zarzycki
Dawid
Wysakowicz
Adam
Kawa
48. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
■ Stream often the natural representation of your data
■ Stream processing is not only about low latency
Summary
49. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Q&A
50. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Thanks !
51. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
52. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Log Abstraction
11:00 -
12:00
12:00 -
13:00
…
…
10:00 - …
10:00 - …
10:00 -
11:00
53. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Spark Structured Streaming
⬇ Operates on top of micro-batches (Spark SQL engine)
■ The ALPHA version and the experimental API until July 11,
2017
⬆ Easy-to-learn API (Dataset/DataFrame)
⬆ Rich ecosystem of tools and libraries e.g. MLlib
⬆ Supports event-time
⬇ Sessionization not yet supported - SPARK-10816
⬇ Queryable state not yet supported - SPARK-16738
54. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Kafka Streams
⬇ No exactly-once (just at-least-once)
⬇ Kafka as the only data source
⬇ No bounded streams (batch) optimizations
⬆ Simplicity
⬆ Embedded into application
⬆ Supports event-time
⬇ Lack of session windows
55. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Apache Beam
⬆ Unified API for batch and streaming
⬆ Rich streaming processing semantics
⬆ Complex TriggerDSL
⬆ Multiple runtime environments
⬆ Spark, Flink, Apex, Dataflow
⬆ Side inputs and outputs
⬇ Verbose Java API
⬇ New project - Top level since 01/2017
56. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Google Dataflow
■ Runtime environment for Apache Beam in Google Cloud
⬇ No support for Iterative Computations
⬆ Supports Side Outputs
⬆ Works with every Google Cloud Service (Pub/Sub, BigTable
etc.)
57. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
How to join with other data
sets/streams?
58. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Join With Other Datasets / Streams
■ Flink can join windowed streams easily
■ Join of data stream with data set is WIP
● Even with slowly changing data set!
● Even keyed data
Stream 2
Stream 1
Joined Stream Input Stream Joined Stream
+
Id Name
1 John Doe
2 Jane Doe
Dataset
+
59. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
I like this streaming API.
Can I use it for batch?
60. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Unified batch and streaming API
■ Not with raw Flink API
■ But with Flink Table API
■ Apache Beam
61. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Is Flink production ready?
62. © Copyright. All rights reserved. Not to be reproduced without prior written consent.
Powered By Flink