Introduction to Streaming Analytics

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Introduction to Streaming Analytics
Guido Schmutz

Guido Schmutz
Working for Trivadis for more than 19 years
Oracle ACE Director for Fusion Middleware and SOA
Co-Author of different books
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Member of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 25 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Twitter: gschmutz

Our company.
© Trivadis – The Company3 03.06.16
Trivadis is a market leader in IT consulting, system integration, solution engineering
and the provision of IT services focusing on and and Open
Source technologies
in Switzerland, Germany, Austria and Denmark. We offer our services in the following
strategic business fields:
Trivadis Services takes over the interacting operation of your IT systems.
O P E R A T I O N

COPENHAGEN
MUNICH
LAUSANNE
BERN
ZURICH
BRUGG
GENEVA
HAMBURG
DÜSSELDORF
FRANKFURT
STUTTGART
FREIBURG
BASEL
VIENNA
With over 600 specialists and IT experts in your region.
© Trivadis – The Company4 03.06.16
14 Trivadis branches and more than
600 employees
200 Service Level Agreements
Over 4,000 training participants
Research and development budget:
CHF 5.0 million
Financially self-supporting and
sustainably profitable
Experience from more than 1,900
projects per year at over 800
customers

Agenda
1. Introduction & Foundation
2. Designing Streaming Analytics Solutions
3. Implementing Event Hub
4. Implementing Data Ingestion
5. Implementing Streaming Analytics
6. Scalability & Reliability
7. Streaming Analytics in Architecture
8. Summary

Big Data Definition (4 Vs)
+ Time to action ? – Big Data + Real-Time = Stream Processing
Characteristics of Big Data: Its Volume, Velocity
and Variety in combination

The world is changing …
The model of Generating/Consuming Data has changed ….
Old Model: few companies are generating data, all others are consuming data
New Model: all of use are generating data, and all of us are consuming data

Who is generating Big Data?
The progress and innovation is no longer hindered by the ability to collect data
But by the ability to manage, analyze, summarize, visualize and discover knowledge
from the collected data in a timely manner and in a scalable fashion
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and networks
(measuring all kinds of data)

Traditional Data Processing - Challenges
• Introduces too much “decision latency”
• Responses are delivered “after the fact”
• Maximum value of the identified situation is lost
• Decision are made on old and stale data
• “Data a Rest”

The New Era: Streaming Data Analytics / Fast Data
• Events are analyzed and processed in
real-time as the arrive
• Decisions are timely, contextual and
based on fresh data
• Decision latency is eliminated
• “Data in motion”

Real Time Analytics Use Cases
• Algorithmic Trading
• Online Fraud Detection
• Geo Fencing
• Proximity/Location Tracking
• Intrusion detection systems
• Traffic Management
• Recommendations
• Churn detection
• Internet of Things (IoT) / Intelligence
Sensors
• Social Media/Data Analytics
• Gaming Data Feed
• …

What happen in an internet minute

Internet Of Things – Sensors
are/will be everywhere
There are more devices tapping into the internet
than people on earth
How do we prepare our systems/architecture for
the future?
Source: Cisco Source: The Economist

Different Types of Stream/Event Processing
Simple Event Processing (SEP)
Event Stream Processing (ESP)

Different Types of Stream/Event Processing
Complex Event Processing (CEP)

Native Streaming vs.
Micro-Batching
Native Streaming
• Events processed as they
arrive
• + low-latency
• - throughput
• - fault tolerance is expensive
Micro-Batching
• Splits incoming stream in
small batches
• + high(er) throughput
• + easier fault tolerance
• - lower latency
Source: Distributed Real-Time Stream Processing:
Why and How by Petr Zapletal

How to design a Streaming Analytics Solution?
Event
Stream
event
Data Ingestion
event
Persist
(Queue)
Event
Stream
event
Data Ingestion
event
Analytics
event
Analytics
result
result
Event
Stream
event
Data Ingestion/
Analytics
result

Demo Use Case – Truck Sensors
Truck
Data
Ingestion
Geo-Fencing
2016-06-02 14:39:56.605|98|27|Mark
Lochbihler|803014426|Wichita to
Little Rock Route 2|Normal|38.65|-
90.21|5187297736652502631
{"timestamp": "2016-06-02
14:39:56.991", "truckId": 99,
"driverId": 31, "driverName":
"Rommel Garcia", "routeId":
1565885487, "routeName":
"Springfield to KC Via Hanibal",
"eventType": "Normal", "latitude":
37.16, "longitude": "-94.46",
"correlationId":
5187297736652502631}
Reckless Driving
Detector
NEAR
ENTER
Truck
Driver
DashboardMovement Movement
JSON
Reckless
Driver

Designing Streaming Analytics
Solutions

How to design a Streaming Analytics System?
It usually starts very simple … just one data pipeline
Event
Stream
Analyticsevent
Data
Ingestion

New Event Stream sources are added …
Event
Stream
Analytics
2nd Event
Stream
3rd Event
Stream
nth Event
Stream
event
event
event
event
Data
Ingestion
2nd Data
Ingestion
3rd Data
Ingestion
Nth Data
Ingestion

New Processors are interested in the events …
Event
Stream
Analytics
2nd Event
Stream
3rd Event
Stream
nth Event
Stream
2nd Analyticsevent
event
event
event
Data
Ingestion
2nd Data
Ingestion
3rd Data
Ingestion
Nth Data
Ingestion

… and the solution becomes the problem
Event
Stream
Analytics
2nd Event
Stream
3rd Event
Stream
nth Event
Stream
2nd Analytics
3rd Analytics
Nth
Analytics
event
event
event
event
Data
Ingestion
2nd Data
Ingestion
3rd Data
Ingestion
Nth Data
Ingestion

… and the solution becomes the problem
New
Customers
Operational
Logs
Click
Stream
Meter
Readings
event
event
event
event
CDC
Ingestion
Log Ingestion
Click Stream
Ingestion
Senor
Ingestion
Hadoop/Data
Warehouse
Recommendation
System
Log Search
Fraud Detection

Decouple event streams from consumers
„Unified Log“
Remember Enterprise
Service Bus (ESB) ?
Enterprise Event Bus Event Stream AnalyticsEvent Stream Ingestion
CDC
Ingestion
Log Ingestion
Click Stream
Ingestion
Senor
Ingestion
Hadoop/Data
Warehouse
Recommendation
System
Log Search
Fraud Detection
What is the
idea of a
Unified Log?
New
Customers
Operational
Logs
Click
Stream
Meter
Readings

Unified Log – What is it?
By Unified Log, we do not mean this ….
137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-includes/js/tinymce/wp-tinymce.php?c=1&ver=349-20805 HTTP/1.1" 200 101114
137.229.78.245 - - [02/Jul/2012:13:22:28 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30747
137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "POST /wp-admin/post.php HTTP/1.1" 302 -
137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "GET /wp-admin/post.php?post=387&action=edit&message=1 HTTP/1.1" 200 73160
137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/css/editor.css?ver=3.4.1 HTTP/1.1" 304 -
137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-20805 HTTP/1.1" 304 -
137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30809
… but this
• a structured log (records are numbered beginning with 0 based on order they are written)
• aka. commit log or
journal
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1st
record Next record
written

Central Unified Log for (real-time) subscription
Take all the organization’s data (events)
and put it into a central log for subscription
Properties of the Unified Log:
• Unified: “Enterprise”, single
deployment
• Append-Only: events are appended,
no update in place => immutable
• Ordered: each event has an offset,
which is unique within a shard
• Fast: should be able to handle
thousands of messages / sec
• Distributed: lives on a cluster of
machines
0 1 2 3 4 5 6 7 8 9
1
0
1
1
reads
writes
Collector
Consumer
System A
(time = 6)
Consumer
System B
(time = 10)
reads

Apache Kafka - Overview
Distributed publish-subscribe messaging system
Designed for processing of real time activity
stream data (logs, metrics collections, social
media streams, …)
Initially developed at LinkedIn, now part of
Apache
Does not use JMS API and standards
Kafka maintains feeds of messages in topics
Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer

Apache Kafka - Motivation
LinkedIn’s motivation for Kafka was:
• “A unified platform for handling all the real-time data feeds a large company might
have.”
Must haves
• High throughput to support high volume event feeds.
• Support real-time processing of these feeds to create new, derived feeds.
• Support large data backlogs to handle periodic ingestion from offline systems.
• Support low-latency delivery to handle more traditional messaging use cases.
• Guarantee fault-tolerance in the presence of machine failures.

Apache Kafka - Architecture
Kafka Broker
Movement
Processor
Movement Topic
Engine-Metrics Topic
1 2 3 4 5 6
Engine
Processor1 2 3 4 5 6
Truck

Apache Kafka - Architecture
Kafka Broker
Movement
Processor
Movement Topic
1 2 3 4 5 6
Engine
Processor
Partition 0
1 2 3 4 5 6
Partition 0
1 2 3 4 5 6
Partition 1 Movement
Processor
Truck

Apache
Kafka
Kafka Broker
Movement
Processor
Truck
Movement Topic
Engine
Processor
P 0
Movement
Processor
1 2 3 4 5
P 1 1 2 3 4 5
Kafka Broker
Movement Topic
P 0 1 2 3 4 5
P 1 1 2 3 4 5
P 0 1 2 3 4 5
P 0 1 2 3 4 5

Apache Kafka - Partition offsets
Offset: messages in the partitions are each assigned a unique (per partition) and
sequential id called the offset
• Consumers track their pointers via (offset, partition, topic) tuples
Consumer group C1

Apache Kafka - Performance
Kafka at LinkedIn => over 1100 brokers / 60 clusters
Kafka Performance at own setup => 6 brokers (VM) / 1 cluster
• 445’622 messages/second
• 31 MB / second
• 3.0405 ms average latency between producer / consumer
800 billion
messages/day
175 TB produced/day
650 TB consumed/day
13 million messages/second
2.75 GB / second
at busiest time of day
http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
https://engineering.linkedin.com/kafka/running-kafka-scale

Demo: Monitoring Kafka Cluster with Kafka Manager

StreamSets Data Collector
• Founded by ex-Cloudera, Informatica
employees
• Continuous open source, intent-driven,
big data ingest
• Visible, record-oriented approach fixes
combinatorial explosion
• Batch or stream processing
• Standalone, Spark cluster, MapReduce
cluster
• IDE for pipeline development by ‘civilians’
• Relatively new - first public release
September 2015
• So far, vast majority of commits are from
StreamSets staff

Apache NiFi
• Originated at NSA as Niagarafiles
• Open sourced December 2014, Apache
TLP July 2015
• Opaque, file-oriented payload
• Distributed system of processors with
centralized control
• Based on flow-based programming
concepts
• Data Provenance
• Web-based user interface

Demo: Using Apache NiFi for Collection

Implementing Streaming Analytics

Streaming Analytics
Product
Framework / Infrastructure
Open Source Closed Source

Implementing Streaming Analytics:
Oracle Stream Analytics

History of Oracle Stream Analytics
Oracle Complex Event
Processing (OCEP)
Oracle Event Processing (OEP)
Oracle Stream Explorer (SX)
Oracle Event Processing
for Java Embedded
Oracle Stream Analytics (OSA)
Oracle Edge Analytics (OAE)
BEA Weblogic Event Server
Oracle CQL
Oracle IoT Cloud Service
2016
2015
2007
2008
2012
2013

OEA
• Filtering
• Correlation
• Aggregation
• Pattern
matching
Devices /
Gateways
Services
Computing Edge Enterprise
“Sea of data”
Macro-event
High-value
Actionable
In-context
EDGE
Analytics
Stream
Analytics
FOG
• High Volume
• Continuous Streaming
• Extreme Low Latency
• Disparate Sources
• Temporal Processing
• Pattern Matching
• Machine Learning
Oracle Stream Analytics: From Noise to Value
• High Volume
• Continuous Streaming
• Sub-Millisecond Latency
• Disparate Sources
• Time-Window Processing
• Pattern Matching
• High Availability / Scalability
• Coherence Integration
• Geospatial, Geofencing
• Big Data Integration
• Business Event Visualization
• Action!

Oracle Stream Analytics Platform
What it does
• Compelling, friendly and visually stunning real time
streaming analytics user experience for Business users to
dynamically create and implement Instant Insight solutions
Key Features
• Analyze simulated or live data feeds to determine event
patterns, correlation, aggregation & filtering
• Pattern library for industry specific solutions
• Streams, References, Maps & Explorations
Benefits
• Accelerated delivery time
• Hides all challenges & complexities of underlying real-time
event-driven infrastructure

Oracle Stream Analytics - Connecting Everything &
Anything of Interest to the Business
Understanding of CQL Filtering, Correlation, Pattern: NOT NEEDED
Understanding of IT Deployment and Management: NOT NEEDED
Understanding of Development, Java, Best Practices: NOT NEEDED
Understanding of the Event Driven Platform: NOT NEEDED

Business accessibility to Geo-Streaming Analytics
Real Time Streaming Solutions face an increasing need to track "assets of interest" and
initiate actions based on encroachment of boundary proximity to fixed and moving
objects and other geographic, temporal, or event conditions.
Geo-Fence, Fence, Polygon
Geo-Streaming

“ Add value to your real time streaming data discovery and analytics by applying and including
mathematical, statistical analysis to the live output stream”
“These streaming “Excel spreadsheets” really do come to life”
Expression Builder enabling calculation for the Business
User

Concept of Connections & Connection Reuse in
Streams

Decision Table for Nested IF-THEN-ELSE Rules

Stream Analytics – Terminology for Business Users
Explorer: The Application User Interface Catalog: The repository for browsing resources

Stream: incoming flow of events that you
want to analyze (CSV, Kafka, JMS, Rest,
MQTT, …)
Exploration: application that correlates events
from streams and data sources, using filters,
groupings, summaries, ranges, and more

Shape: A blueprint of an event in a stream or
data in a data source. How the business data
is represented in the selected stream
Map: collection of geo-fences
Reference: A connection to static data that is
joined to a stream to enrich it and/or to be used in
business logic and output

Pattern: A pre-built Exploration that
addresses a particular business scenario in a
focused and simplified User Interface
Connection: collection of metadata required to
connect to an external system
Targets: defines an interface with a downstream
system

Spark Streaming

Apache Spark
Apache Spark is a fast and general engine for large-scale data processing
• The hot trend in Big Data!
• Originally developed 2009 in UC Berkley’s AMPLab
• Based on 2007 Microsoft Dryad paper
• Written in Scala, supports Java, Python, SQL and R
• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x
faster on disk
• One of the largest OSS communities in big data with over 200 contributors in 50+
organizations
• Open Sourced in 2010 – since 2014 part of Apache Software foundation

Apache Spark
Spark SQL
(Batch Processing)
Blink DB
(Approximate
Querying)
Spark Streaming
(Real-Time)
MLlib, Spark R
(Machine Learning)
GraphX
(Graph Processing)
Spark Core API and Execution Model
Spark
Standalone
MESOS YARN HDFS
Elastic
Search
NoSQL S3
Libraries
Core Runtime
Cluster Resource Managers Data Stores

Resilient Distributed Dataset (RDD)
Are
• Immutable
• Re-computable
• Fault tolerant
• Reusable
Have Transformations
• Produce new RDD
• Rich set of transformation available
• filter(), flatMap(), map(),
distinct(), groupBy(), union(),
join(), sortByKey(),
reduceByKey(), subtract(), ...
Have Actions
• Start cluster computing operations
• Rich set of action available
• collect(), count(), fold(),
reduce(), count(), …

RDD RDD
Input Source
• File
• Database
• Stream
• Collection
.count() -> 100
Data

Partitions RDD
Data
Partition 0
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Partition 7
Partition 8
Partition 9
Server 1
Server 2
Server 3
Server 4
Server 5

Partitions RDD
Data
Partition 0
Partition 1
Partition 2
Partition 3
Partition 4
Partition 5
Partition 6
Partition 7
Partition 8
Partition 9
Server 2
Server 3
Server 4
Server 5

Stage 1 – reduceByKey()
Stage 1 – flatMap() + map()
Spark Workflow Input HDFS File
HadoopRDD
MappedRDD
ShuffledRDD
Text File Output
sc.hapoopFile()
map()
reduceByKey()
sc.saveAsTextFile()
Transformations
(Lazy)
Action
(Execute
Transformations)
Master
MappedRDD
P0
P1
P3
ShuffledRDD
P0
MappedRDD
flatMap()
DAG
Scheduler

Spark Execution Model
Data
Storage
Worker
Master
Executer
Executer
Server
Executer

Stage 1 – flatMap() + map()
Data
Storage
Worker
Master
Executer
Data
Storage
Worker
Executer
Data
Storage
Worker
Executer
RDD
P0
P1
P3
Narrow TransformationMaster
filter()
map()
sample()
flatMap()
Data
Storage
Worker
Executer

Stage 2 – reduceByKey()
Data
Storage
Worker
Executer
Data
Storage
Worker
Executer
RDD
P0
Wide Transformation
Master
join()
reduceByKey()
union()
groupByKey()
Shuffle !
Data
Storage
Worker
Executer
Data
Storage
Worker
Executer

Batch vs. Real-Time Processing
Petabytes of Data
Gigabytes
Per Second

Discretized Stream (DStream)
Kafka
Truck
Truck
Truck

Kafka
Truck
Truck
Truck
Discrete by time
Individual Event
DStream = RDD

DStream DStream
X Seconds
Transform
.countByValue()
.reduceByKey()
.join
.map

time 1 time 2 time 3
message
time n….
f(message 1)
RDD @time 1
f(message 2)
f(message n)
….
message 1
RDD @time 1
message 2
message n
….
result 1
result 2
result n
….
message message message
f(message 1)
RDD @time 2
f(message 2)
f(message n)
….
message 1
RDD @time 2
message 2
message n
….
result 1
result 2
result n
….
f(message 1)
RDD @time 3
f(message 2)
f(message n)
….
message 1
RDD @time 3
message 2
message n
….
result 1
result 2
result n
….
f(message 1)
RDD @time n
f(message 2)
f(message n)
….
message 1
RDD @time n
message 2
message n
….
result 1
result 2
result n
….
Input Stream
Event DStream
MappedDStream
map()
saveAsHadoopFiles()
Time Increasing
DStreamTransformation Lineage
Actions Trigger
Spark Jobs
Adapted from Chris Fregly: http://slidesha.re/11PP7FV

Apache Storm

Apache Storm
A platform for doing analysis on streams of data as they come in, so you can react to
data as it happens.
• highly distributed real-time computation system
• Provides general primitives to do
real-time computation
• To simplify working with queues & workers
• scalable and fault-tolerant
Originated at Backtype, acquired by Twitter in 2011
Open Sourced late 2011
Part of Apache since September 2013

Apache Storm – Core concepts
Tuple
• Immutable Set of Key/value pairs
Stream
• an unbounded sequence of tuples that can be processed in parallel by Storm
Topology
• Wires data and functions via a DAG (directed acyclic graph)
• Executes on many machines similar to a MR job in Hadoop
Spout
• Source of data streams (tuples)
• can be run in “reliable” and “unreliable” mode
Bolt
• Consumes 1+ streams and produces new streams
• Complex operations often require multiple
steps and thus multiple bolts
Spout
Spout
Bolt
Bolt
Bolt
Bolt
Source of
Stream B
Subscribes: A
Emits: C
Subscribes: A
Emits: D
Subscribes: A & B
Emits: -
Subscribes: C & D
Emits: -
T T T T T T T T

Apache Storm – How does it work ?
Geo
Hashing
Trucks
Movement
Geo
Hashing
{ "timestamp" : "2016-06-02
Shuffle
Grouping
Geo
Hashing
{ "timestamp" : "2016-06-02
12:56:02.362", "truckId" : 35, "driverId" :
26, "driverName" : "Michael Aube",
"routeId" : 1090292248, "eventType" :
"Normal", "latitude" : 40.86, "longitude" :
"-89.91"}
Truck
Movement
{ "timestamp" : "2016-06-02
“geohash” : “dp206n3d“,

Geo
Hashing
Trucks
Movement
GeoFencer
Geo
Hashing
GeoFencer
Geo
Hashing
Shuffle
Grouping
Fields
Grouping
Truck
Movement
{ "timestamp" : "2016-06-02
{ "timestamp" : "2016-06-02
"-89.91"}
{ “geohash” : “dp206n3d“, "timestamp" :
"2016-06-02 12:56:02.362", "truckId" : 35,
"driverId" : 26, "driverName" : "Michael
Aube", "routeId" : 1090292248,
"eventType" : "Normal", "latitude" : 40.86,
"longitude" : "-89.91"}
{ “geohash” : “f00hfh99“, ..
{ "timestamp" : "2016-06-02

Geo
Hashing
Trucks
Movement
GeoFencer
Geo
Hashing
GeoFencer
Alerter
Geo
Hashing
Shuffle
Grouping
Fields
Grouping
Global
Grouping
Truck
Movement
{ "timestamp" : "2016-06-02
{ "timestamp" : "2016-06-02
"-89.91"}
{ “geohash” : “dp206n3d“, "timestamp" :
"2016-06-02 12:56:02.362", "truckId" : 35,
"driverId" : 26, "driverName" : "Michael
Aube", "routeId" : 1090292248,
"eventType" : "Normal", "latitude" : 40.86,
"longitude" : "-89.91"}
{ "timestamp" : "2016-06-02
{ "timestamp" : "2016-06-02 12:56:02.362",
"truckId" : 35, "driverId" : 26, "latitude" :
40.86, "longitude" : "-89.91"}
{ “geohash” : “f00hfh99“, ..

Apache Storm – Core concepts
Each Spout or Bolt are running N instances in parallel
GeoHashing
nth
Trucks
Movement
GeoFencing
nth
GeoHashing
GeoFencing
1st
Shuffle Fields
Shuffle grouping is random grouping
Fields grouping is grouped by value, such that equal value results in equal task
All grouping replicates to all tasks
Global grouping makes all tuples go to one task
None grouping makes bolt run in the same thread as bolt/spout it subscribes to
Direct grouping producer (task that emits) controls which consumer will receive
Local or Shuffle
grouping
similar to the shuffle grouping but will shuffle tuples among bolt tasks running in the same
worker process, if any. Falls back to shuffle grouping behavior.
ReportGlobal

How to scale a Streaming Analytics System?
Queue
(Persist)
Event
Stream
event
Collecting
Thread 1 event event
Processing
Thread 1 result
Collecting
Thread 2
Processing
Thread 2
event event event result
Collecting
Thread n
Processing
Thread n

Collecting
Process 1
Collecting
Process 1
Collecting
Process 1
Collecting
Process 1
Collecting
Process 1
Queue 1
(Persist)
Event
Stream
event
Collecting
Thread 1
event event
Processing
Process 1 result
Collecting
Thread 1
Processing
Process 1
Queue 2
(Persist)event
event event result
Processing
Process 1
Queue n
(Persist)

Collecting
Process 1
Collecting
Process 2
Processing A
Process 2
Processing B
Process 2
Processing A
Process 1
Processing B
Process 1
Event
Stream
Collecting
Process 1
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
Processing A
Thread 1
Q1
e
Processing B
Thread 1
Q1
e
Processing A
Process 2
Processing A
Thread n
Qn
e

How to make Streaminig Analytics System reliable?
Faults and stragglers inevitable in large clusters running big data applications
Streaming applications must recover from them quickly
Collecting
Process 2
Processing A
Process 2
Processing B
Process 2
Event
Stream
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
Collecting
Process 2
Processing A
Process 2
Processing B
Process 2
Event
Stream
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e

How to deal with “Stragglers”
Consumer goes slow
Backpressure Queue upDrop data
Other jobs grind
to a halt L
Run out of
memory L
Spill to diskNo thanks L

How to make Streaming Analytics System reliable?
Solution 1: using active/passive system (hot replication)
• Both systems process the full load
• In case of a failure, automatically switch and use the “passive” system
• Stragglers slow down both active and passive system
Stat
e
= State in-memory and/or on-disk
Collecting
Process 2
Processing A
Process 2
Processing B
Process 2
Event
Stream
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
Active
Collecting
Process 2
Processing A
Process 2
Processing B
Process 2
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
Passive
Stat
e
Stat
e

How to make Streaming Analytics System reliable?
Solution 2: Upstream backup
• Nodes buffer sent messages and reply them to new node in case of failure
• Stragglers are treated as failures
State = State in-memory and/or on-disk
buffer = Buffer for replay in-memory and/or on-disk
Collecting
Process 2
Processing A
Process 2
Processing B
Process 2
Event
Stream
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
State

Message Delivery Semantics
At most once [0,1]
• Messages my be lost
• Messages never redelivered
At least once [1 .. n]
• Messages will never be lost
• but messages may be redelivered
(might be ok if consumer can handle
it)
Exactly once [1]
• Messages are never lost
• Messages are never redelivered
• Perfect message delivery
• Incurs higher latency for transactional
semantics

Streaming Analytics in Architecture

“Traditional Architecture” for Big Data
Data
Collection
(Analytical) Data Processing
Result StoreData
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
Batch
compute
Stage
Result Store
Query
Engine
Computed
Information
Raw Data
(Reservoir)
= Data in Motion = Data at Rest

Streaming Analytics Architecture for Big Data
aka. (Complex) Event Processing)
Data
Collection
Batch
compute
Data
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
Logfiles
Sensor
RDBMS
ERP
Mobile
Machine
(Analytical) Real-Time Data Processing
Stream/Event Processing
Result Store
Messaging
Result Store

Keep raw event data
Data
Collection
Batch
compute
Data
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
Logfiles
Sensor
RDBMS
ERP
Mobile
Machine
Result Store
Messaging
Result Store
(Analytical) Batch Data Processing
Raw Data
(Reservoir)

“Lambda Architecture” for Big Data
Data
Collection
(Analytical) Batch Data Processing
Batch
compute
Result StoreData
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
Batch
compute
Messaging
Result Store
Query
Engine
Result Store
Computed
Information
Raw Data
(Reservoir)

“Kappa Architecture” for Big Data
Data
Collection
“Raw Data Reservoir”
Batch
compute
Data
Sources
Messaging
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
Logfiles
Sensor
RDBMS
ERP
Mobile
Machine
Result Store
Messaging
Result Store
Raw Data
(Reservoir)
Computed
Information

“Unified Architecture” for Big Data
Data
Collection
(Analytical) Batch Data Processing (Calculate
Models of incoming data)
Batch
compute
Result StoreData
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
Batch
compute
Messaging
Result Store
Query
Engine
Result Store
Computed
Information
Raw Data
(Reservoir)
Prediction
Models

Summary
More and more use cases (such as IoT) make Streaming Analytics necessary
Treat events as events! Infrastructures for handling lots of events are available!
Platforms such as Oracle Stream Analytics enable the business to work directly on
streaming data (empower the business analyst) => User Experience of an Excel Sheet
on streaming data
Platform such as Apache Strom and Apache Spark Streaming provide a highly-scalable
and fault-tolerant infrastructure for streaming analytics => Oracle Stream Analytics can
use Spark Streaming as the runtime infrastructure
Platforms such as Kafka provide a high volume event broker infrastructure, a.k.a. Event
Hub

Comparison
Oracle Stream Analytics Spark Streaming Spark Storm
Community n.a. > 280 contributors > 100 contributors
Language Options Java, CQL Java, Scala, Python Java, Clojure, Scala, …
Processing Models Event-Streaming Micro-Batching Event-Streaming
Processing DSL Yes Yes No
Stateful Ops Yes Yes No
Pattern detection Yes No No
Scalability & Reliability limited yes yes
Distributed RPC No No Yes
Delivery Guarantees At Least Once Exactly Once At most once / At least once
Latency sub-second seconds sub-second
”self-service” for Biz Yes No No
Platform OEP server, Spark
Streaming (YARN, Mesos)
YARN, Mesos Standalone,
DataStax EE
Storm Cluster, YARN

Guido Schmutz
Technology Manager
guido.schmutz@trivadis.com

Introduction to Streaming Analytics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Introduction to Streaming Analytics

Ähnlich wie Introduction to Streaming Analytics (20)

Mehr von Guido Schmutz

Mehr von Guido Schmutz (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introduction to Streaming Analytics