High-volume event streams (traditional network data, media, IoT sensor data, activity events on social media, etc.) are becoming widespread in the telecom industry. In particular, live analysis of telco log files and performance metrics allows network operators to observe the status of the system and identify possible problems using online aggregations and machine-learning algorithms. (Offline batch analysis of streams using tools like MapReduce is often too slow to respond to things happening right now; hence, it is not the best choice.)
Ignacio Manuel Mulas Viela and Nicolas Seyvet demonstrate an analytics pipeline setup for a telco use case that processes an unbounded dataset of logs and performance metrics. Raw data, logs, and cloud telemetry information are extracted from a production cloud infrastructure using Collectd, Openstack Ceilometer, and Logstash. This is piped into a distributed messaging system, Kafka, then analyzed by Apache Flink—a distributed stream analysis framework that is capable of analyzing thousands of messages per second, extracting insights that can be monitored by humans—and visualized using the ELK (Elasticsearch, Logstash, Kibana) stack.
Ignacio and Nicolas discuss the challenges and benefits of building an analytics pipeline following the Kappa architecture paradigm using the aforementioned tools and demonstrate Kappa’s value through an example use case. The use case analyzes and extracts statistical information from a stream of data and uses machine-learning techniques to develop an advanced anomaly detector, using two online machine-learning algorithms implemented on top of Flink: the online k-means detector and the Bayesian detector.
2. Ericsson Internal | 2011-10-19 | Page 4
Once Upon A Time…
Flink-forward | Ignacio Mulas | 12-October-2015Flink Meetup | Ignacio Mulas | 12-October-2015Flink Meetup | Ignacio Mulas | 26-November-2015Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
“I want an advanced real-time analytics system to
monitor my cloud infrastructure.”
… By your most precious client
3. Ericsson Internal | 2011-10-19 | Page 5
› Data source
– Events (metrics, logs) from physical and virtual servers
› Analytics:
– Real-time
– Statistical analysis
– Anomaly or novelty detection
High Level View
…
Flink-forward | Ignacio Mulas | 12-October-2015
Data source
Analytics
4. Ericsson Internal | 2011-10-19 | Page 6
› Bounded A start and an end
Finite, ingestion stops
› Unbounded A start but no end
Infinite, ever-growing
Data Set
Flink-forward | Ignacio Mulas | 12-October-2015Flink Meetup | Ignacio Mulas | 12-October-2015Flink Meetup | Ignacio Mulas | 26-November-2015Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
t3 t2 t1 t0…
tn
t3 t2 t1 t0…t∞
Unbounded
Bounded
5. Ericsson Internal | 2011-10-19 | Page 7
› Twitter’s Nathan Mars
› But
– Two independent pipelines
– Complex maintenance
– Complex merge
Lambda Architecture
Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
6. Ericsson Internal | 2011-10-19 | Page 8Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
7. Ericsson Internal | 2011-10-19 | Page 9
Kappa Architecture
Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
8. Ericsson Internal | 2011-10-19 | Page 10
› New model to abstract data processing
– Millwheel, Spark Streaming, Dataflow, Stratosphere (Flink)
› Stream engines
› Correctness
- Strong consistency
- Exactly-once-processing
› Resilience, fault tolerance
› Tools that can deal with time *
› APIs
The (Short) Evolution
Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
9. Ericsson Internal | 2011-10-19 | Page 11
Principles
Kappa Architecture
Everything is a
stream
Immutable data
sources
Single analytics
framework
Stream replay
Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
10. Ericsson Internal | 2011-10-19 | Page 12
› Stream representation
– Unbounded dataset composed by a sequence of events
› Data pipeline:
– Sequence of transformations on an unbounded data set that generates another
set with more insightful data
– UNIX pipes
Basics
Flink-forward | Ignacio Mulas | 12-October-2015Flink Meetup | Ignacio Mulas | 12-October-2015Flink Meetup | Ignacio Mulas | 26-November-2015Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
…
Pub/Sub
11. Ericsson Internal | 2011-10-19 | Page 13
Our Stack
Kafka Elastic
Search
Kibana
Flink
Analytics job 1
Analytics job 2
…
raw data
results job 1
…
…
Data sources
Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
…
12. Ericsson Internal | 2011-10-19 | Page 14
First Data Pipeline
Raw
data
Statistical
analysis DashboardEnriched data
Flink-forward | Ignacio Mulas | 12-October-2015Flink Meetup | Ignacio Mulas | 26-November-2015Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
13. Ericsson Internal | 2011-10-19 | Page 15
› Event time, which is when an event occurred
› Processing time, which is when an event is observed in the system
Time
Event time
Processingtime
reality
skew
Time drifts
Unordered events
Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
14. Ericsson Internal | 2011-10-19 | Page 16
Event Time
e0e1e2e3
…
t0t1t2t3
<tp0,e0><tp1,e1><tp2,e2><tp3,e3>
<te0,e0><te1,e1><te2,e2><te3,e3>
EventTimeExstractor()
enableTimestamps()
<te0,e0><te1,e1><te2,e2><te3,e3>
w2 w1 w0
window()
Flink-forward | Ignacio Mulas | 12-October-2015
Execution time
Te0
+ window
+ watermark
Flink Meetup | Ignacio Mulas | 26-November-2015Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
e: event
tp: processing time
te: event time
15. Ericsson Internal | 2011-10-19 | Page 17
2nd Client meeting…
Flink-forward | Ignacio Mulas | 12-October-2015Flink Meetup | Ignacio Mulas | 12-October-2015Flink Meetup | Ignacio Mulas | 26-November-2015Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
“I want an advanced real-time analytics system to
monitor my cloud infrastructure.”
… By your most precious client
It is nice, but… I cannot look at thousands of
numbers simultaneously, can you do better?
16. Ericsson Internal | 2011-10-19 | Page 18
› Machine learning
– Automatically detect anomalies in the infrastructure
– Learn using raw and advanced metrics
› … add a new transformation to my unbounded data!
Advanced Data Pipeline
…
Stats
ML
analyticsData source
Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
17. Ericsson Internal | 2011-10-19 | Page 19
› Unsupervised machine learning
› Create a statistical model for “normal” behavior
– Poisson: count-based parameters
– Gaussian: value-based parameters
› Model adapts over time
Bayesian Detector
OK ANOMALYANOMALY
Flink Meetup | Ignacio Mulas | 26-November-2015Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
18. Ericsson Internal | 2011-10-19 | Page 20
Log-Frequency Novelty Detector
…
…
Frequency_i+1
Frequency_2
Frequency_n
Phase 1: LEARN!
Phase 2: DETECT!
…
Frequency_1
OK
NOK
Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
Time
window
Events
…
…
…
…
History
19. Ericsson Internal | 2011-10-19 | Page 21
Multi-Variable Detector
t0t0t0 hk
hi
hm
… if…
.keyBy(host)
-slave
Flink Meetup | Ignacio Mulas | 26-November-2015Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
…
…
…
20. Ericsson Internal | 2011-10-19 | Page 22
Improved Data Pipeline
Raw
data
Bayesian
novelty
detector
Dashboard
Anomalies
Flink-forward | Ignacio Mulas | 12-October-2015Flink Meetup | Ignacio Mulas | 26-November-2015Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
Statistical
analysis
Enriched dataRaw data
21. Ericsson Internal | 2011-10-19 | Page 23
3rd Client meeting
Flink-forward | Ignacio Mulas | 12-October-2015Flink Meetup | Ignacio Mulas | 12-October-2015Flink Meetup | Ignacio Mulas | 26-November-2015Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
“I want an advanced real-time analytics system to
monitor my cloud infrastructure.”
… By your most precious client
Great! I can now spot when and where changes
occur … I´ll buy it!
22. Ericsson Internal | 2011-10-19 | Page 24
› Tools, abstractions and APIs unifying stream/batch
› Consistency, resiliency, fault-tolerance
› Event time handling
› Kappa architecture simplifies Big Data
– One stack, many pipelines (batch/stream)
– Flexible/extensible architecture
› Machine learning can be applied on unbounded data sets
– Treated as a complex transformation
– Some caveats
Summary
Flink-forward | Ignacio Mulas | 12-October-2015Flink Meetup | Ignacio Mulas | 12-October-2015Flink Meetup | Ignacio Mulas | 26-November-2015Strata London | Ignacio Mulas & Nicolas Seyvet | 3 – June – 2016
Stream
Batch
Καρρα
23. Please, feel free to contact us if you have
suggestions/comments/questions
ignacio.mulas.viela@ericsson.com / @ immulvi
nicolas.seyvet@ericsson.com / @NicolasSeyvet
Thank you!
Editor's Notes
2015-09-30
Ericsson is a Telecommunication equipment supplier.
Ever growing volumes of data, shorter time constraints and increasing needs for accuracy are defining the new analytics environment. In the telecom industry, traditional user and network data co-exists with machine-to-machine (M2M) traffic, media data, social activities, etc. In terms of volumes, this can be referred to as an “explosion” of data. This is a great business opportunity for Telco operators and a key angle to take full advantage of current infrastructure investments (4G, LTE). Add some animations with trucks and sensors, etc.
Ericsson is moving to cloud and running Virtual Network function which are basically cloud based telco applications for the core network.
There are OSS and monitoring systems on the market but how could we do this better?
Build a data pipeline to stream events and perfrom realtime analytics in order to eventually do some machine learning.
The story is that,in general there are two kinds of data sets. Either it is a bunch of data, ie there is a beginning and an end to it. OR it is infinite.
https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101, The term “streaming” is used today to mean a variety of different things (and for simplicity, I’ve been using it somewhat loosely up until now), which can lead to misunderstandings about what streaming really is, or what streaming systems are actually capable of. As such, I would prefer to define the term somewhat precisely.
The crux of the problem is that many things that ought to be described by what they are (e.g., unbounded data processing, approximate results, etc.), have come to be described colloquially by how they historically have been accomplished (i.e., via streaming execution engines). This lack of precision in terminology clouds what streaming really means, and in some cases, burdens streaming systems themselves with the implication that their capabilities are limited to characteristics frequently described as “streaming,” such as approximate or speculative results. Given that well-designed streaming systems are just as capable (technically mo
The principles: Bounding unbounded data with windows
We use the term unbounded data for an infinite, ever-growing data stream, and the term bounded data for a data stream that happens to have a beginning and an end (data ingestion stops after a while). It is clear that the notion of an unbounded data stream includes (is a superset of) the notion of a bounded data set:
Streaming applications create bounded data from unbounded data using windows, i.e., creating bounds using some characteristic of the data, most prominently based on timestamps of events. For example, one can choose to create a window of all records that belong to the same session (a session being defined of a period of activity followed by a period of inactivity). The simplest form of a window is (when we know that the input is bounded), to include all the data in one window. Letís call this a ìglobal windowî. This way, we have created a streaming program that does ìbatch processingî:
Early streaming systems suffered from efficient problems (record-by-record event processing and ACK) This led to belief that a streaming layer can only complement a batch systems, or that hybrids of streamign and batching (mcroi-batching) are required for efficiency.
Lambda advocates using a batch system for handling the heavy lifting, augmenting it with a streaming system that catches up with data ingestion producing early but incomplete results (approximate) results. Then logic tries to merge the produced results for serving. low-latency, inaccurate results (either because of the use of an approximation algorithm, or because the streaming system itself does not provide correctness), and some time later a batch system rolls along and provides you with correct output
Well known.
Merge -> synchonization pbs.
The Lambda architecture had well-known disadvantages, in particular that the merging process was often painful, as was the fact that two separate codebases that express the same logic need to be maintained.
Later, Jay Kreps advocated that only one system, the stream processor, should be used for the entirety of data transformations, drastically simplifying the whole architecture:
https://www.oreilly.com/ideas/questioning-the-lambda-architecture
Given that well-designed streaming systems are just as capable (technically more so) of producing correct, consistent, repeatable results as any existing batch engine, I prefer to isolate the term streaming to a very specific meaning: a type of data processing engine that is designed with infinite data sets in mind.
Early streaming systems suffered from efficiency problems due to design choices that sacrificed throughput, in particular, record-by-record event processing and acknowledgement
Spark lineage in batch vs check-pointing. Something that is easy to do with batch is much harder with streams.
Tools for reasoning about time
Mostly correct is not good enough. Required for exact once processing which is required for repeatable results cannot replace batch otherwise. Correctness: This gets you parity with batch.
http://data-artisans.com/why-apache-beam/
Taking a cue from this foundational work, we rewrote Flink s DataStream API in Flink 0.10 to incorporate many of the concepts described in the Dataflow paper, moving away from the old Flink 0.9 DataStream API. We retained this API with Flink 1.0 and made it stable and backwards compatible.
As you can see from these tables, Flink is the runner which currently fulfills those requirements. With Flink, Beam becomes a truly compelling platform for the industry.
Lambda -> batch + processing
Kappa -> everything is a stream
Immutable data sources -> determinsitic results and possibility to generate different views. (see martin kleppman). You store the logs so u store events and the seq of things.
Whenever pipeline evolves, you can replay the sseq of events to restore the state of the computations and restart.
Single analytics you have transformations and operators to perform analytics.
Add some comtent
Make it better, change image
Can take a subset. The stream is main abstraction towards the dataset. List od´f ordered eevent, it is a single opbjevt with a representatomn and a set of operators.
Put distributed data source with logstah
Frequency, gradients, median, std dev,
To do this we needed to take in consideration aspects like time. Since this a unbounded data set.
How to deal with time?
Batch slices datasets into bounded dat sets then computes.
But how to deal with late events, events that might never arrive, latency in the network leading to a distortion between expected time and actual arrival time?
New data will arrive, old data may be retracted or updated. any system hould be able to cope with these facts on its own, with completeness being a convenient optimization rather than a semantic necessity.
Varying event time skew, meaning it is not possible to expect most of the data for a given event time X within some constant e of time Y (Y= X + e).