Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. Storing such huge event streams into HDFS or a NoSQL datastore is feasible and not such a challenge anymore. But if you want to be able to react fast, with minimal latency, you can not afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics right after you consume the event streams. Products for doing event processing, such as Oracle Event Processing or Esper, are avaialble for quite a long time and also used to be called Complex Event Processing (CEP). In the last 3 years, another family of products appeared, mostly out of the Big Data Technology space, called Stream Processing or Streaming Analytics. These are mostly open source products/frameworks such as Apache Storm, Spark Streaming, Apache Samza as well as supporting infrastructures such as Apache Kafka. In this talk I will present the theoretical foundations for Event and Stream Processing and present what differences you might find between the more traditional CEP and the more modern Stream Processing solutions and show that a combination of both will bring the most value.
2. Guido Schmutz
Working for Trivadis for more than 19 years
Oracle ACE Director for Fusion Middleware and SOA
Co-Author of different books
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Member of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 25 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Twitter: gschmutz
7. Big Data Definition (4 Vs)
+ Time to action ? – Big Data + Real-Time = Stream Processing
Characteristics of Big Data: Its Volume, Velocity
and Variety in combination
8. The world is changing …
The model of Generating/Consuming Data has changed ….
Old Model: few companies are generating data, all others are consuming data
New Model: all of use are generating data, and all of us are consuming data
9. Who is generating Big Data?
The progress and innovation is no longer hindered by the ability to collect data
But by the ability to manage, analyze, summarize, visualize and discover knowledge
from the collected data in a timely manner and in a scalable fashion
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and networks
(measuring all kinds of data)
10. Traditional Data Processing - Challenges
• Introduces too much “decision latency”
• Responses are delivered “after the fact”
• Maximum value of the identified situation is lost
• Decision are made on old and stale data
• “Data a Rest”
11. The New Era: Streaming Data Analytics / Fast Data
• Events are analyzed and processed in
real-time as the arrive
• Decisions are timely, contextual and
based on fresh data
• Decision latency is eliminated
• “Data in motion”
12. Real Time Analytics Use Cases
• Algorithmic Trading
• Online Fraud Detection
• Geo Fencing
• Proximity/Location Tracking
• Intrusion detection systems
• Traffic Management
• Recommendations
• Churn detection
• Internet of Things (IoT) / Intelligence
Sensors
• Social Media/Data Analytics
• Gaming Data Feed
• …
14. Internet Of Things – Sensors
are/will be everywhere
There are more devices tapping into the internet
than people on earth
How do we prepare our systems/architecture for
the future?
Source: Cisco Source: The Economist
15. Different Types of Stream/Event Processing
Simple Event Processing (SEP)
Event Stream Processing (ESP)
16. Different Types of Stream/Event Processing
Complex Event Processing (CEP)
17. Native Streaming vs.
Micro-Batching
Native Streaming
• Events processed as they
arrive
• + low-latency
• - throughput
• - fault tolerance is expensive
Micro-Batching
• Splits incoming stream in
small batches
• + high(er) throughput
• + easier fault tolerance
• - lower latency
Source: Distributed Real-Time Stream Processing:
Why and How by Petr Zapletal
18. How to design a Streaming Analytics Solution?
Event
Stream
event
Data Ingestion
event
Persist
(Queue)
Event
Stream
event
Data Ingestion
event
Analytics
event
Analytics
result
result
Event
Stream
event
Data Ingestion/
Analytics
result
19. Demo Use Case – Truck Sensors
Truck
Data
Ingestion
Geo-Fencing
2016-06-02 14:39:56.605|98|27|Mark
Lochbihler|803014426|Wichita to
Little Rock Route 2|Normal|38.65|-
90.21|5187297736652502631
{"timestamp": "2016-06-02
14:39:56.991", "truckId": 99,
"driverId": 31, "driverName":
"Rommel Garcia", "routeId":
1565885487, "routeName":
"Springfield to KC Via Hanibal",
"eventType": "Normal", "latitude":
37.16, "longitude": "-94.46",
"correlationId":
5187297736652502631}
Reckless Driving
Detector
NEAR
ENTER
Truck
Driver
DashboardMovement Movement
JSON
Reckless
Driver
21. How to design a Streaming Analytics System?
It usually starts very simple … just one data pipeline
Event
Stream
Analyticsevent
Data
Ingestion
22. New Event Stream sources are added …
Event
Stream
Analytics
2nd Event
Stream
3rd Event
Stream
nth Event
Stream
event
event
event
event
Data
Ingestion
2nd Data
Ingestion
3rd Data
Ingestion
Nth Data
Ingestion
23. New Processors are interested in the events …
Event
Stream
Analytics
2nd Event
Stream
3rd Event
Stream
nth Event
Stream
2nd Analyticsevent
event
event
event
Data
Ingestion
2nd Data
Ingestion
3rd Data
Ingestion
Nth Data
Ingestion
24. … and the solution becomes the problem
Event
Stream
Analytics
2nd Event
Stream
3rd Event
Stream
nth Event
Stream
2nd Analytics
3rd Analytics
Nth
Analytics
event
event
event
event
Data
Ingestion
2nd Data
Ingestion
3rd Data
Ingestion
Nth Data
Ingestion
25. … and the solution becomes the problem
Event
Stream
Analytics
2nd Event
Stream
3rd Event
Stream
nth Event
Stream
2nd Analytics
3rd Analytics
Nth
Analytics
event
event
event
event
Data
Ingestion
2nd Data
Ingestion
3rd Data
Ingestion
Nth Data
Ingestion
26. … and the solution becomes the problem
New
Customers
Operational
Logs
Click
Stream
Meter
Readings
event
event
event
event
CDC
Ingestion
Log Ingestion
Click Stream
Ingestion
Senor
Ingestion
Hadoop/Data
Warehouse
Recommendation
System
Log Search
Fraud Detection
27. Decouple event streams from consumers
„Unified Log“
Remember Enterprise
Service Bus (ESB) ?
Enterprise Event Bus Event Stream AnalyticsEvent Stream Ingestion
CDC
Ingestion
Log Ingestion
Click Stream
Ingestion
Senor
Ingestion
Hadoop/Data
Warehouse
Recommendation
System
Log Search
Fraud Detection
What is the
idea of a
Unified Log?
New
Customers
Operational
Logs
Click
Stream
Meter
Readings
28. Unified Log – What is it?
By Unified Log, we do not mean this ….
137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-includes/js/tinymce/wp-tinymce.php?c=1&ver=349-20805 HTTP/1.1" 200 101114
137.229.78.245 - - [02/Jul/2012:13:22:28 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30747
137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "POST /wp-admin/post.php HTTP/1.1" 302 -
137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "GET /wp-admin/post.php?post=387&action=edit&message=1 HTTP/1.1" 200 73160
137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/css/editor.css?ver=3.4.1 HTTP/1.1" 304 -
137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-20805 HTTP/1.1" 304 -
137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30809
… but this
• a structured log (records are numbered beginning with 0 based on order they are written)
• aka. commit log or
journal
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1st
record Next record
written
29. Central Unified Log for (real-time) subscription
Take all the organization’s data (events)
and put it into a central log for subscription
Properties of the Unified Log:
• Unified: “Enterprise”, single
deployment
• Append-Only: events are appended,
no update in place => immutable
• Ordered: each event has an offset,
which is unique within a shard
• Fast: should be able to handle
thousands of messages / sec
• Distributed: lives on a cluster of
machines
0 1 2 3 4 5 6 7 8 9
1
0
1
1
reads
writes
Collector
Consumer
System A
(time = 6)
Consumer
System B
(time = 10)
reads
31. Apache Kafka - Overview
Distributed publish-subscribe messaging system
Designed for processing of real time activity
stream data (logs, metrics collections, social
media streams, …)
Initially developed at LinkedIn, now part of
Apache
Does not use JMS API and standards
Kafka maintains feeds of messages in topics
Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer
32. Apache Kafka - Motivation
LinkedIn’s motivation for Kafka was:
• “A unified platform for handling all the real-time data feeds a large company might
have.”
Must haves
• High throughput to support high volume event feeds.
• Support real-time processing of these feeds to create new, derived feeds.
• Support large data backlogs to handle periodic ingestion from offline systems.
• Support low-latency delivery to handle more traditional messaging use cases.
• Guarantee fault-tolerance in the presence of machine failures.
36. Apache Kafka - Partition offsets
Offset: messages in the partitions are each assigned a unique (per partition) and
sequential id called the offset
• Consumers track their pointers via (offset, partition, topic) tuples
Consumer group C1
37. Apache Kafka - Performance
Kafka at LinkedIn => over 1100 brokers / 60 clusters
Kafka Performance at own setup => 6 brokers (VM) / 1 cluster
• 445’622 messages/second
• 31 MB / second
• 3.0405 ms average latency between producer / consumer
800 billion
messages/day
175 TB produced/day
650 TB consumed/day
13 million messages/second
2.75 GB / second
at busiest time of day
http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
https://engineering.linkedin.com/kafka/running-kafka-scale
38. Demo Use Case – Truck Sensors
Truck
Data
Ingestion
Geo-Fencing
2016-06-02 14:39:56.605|98|27|Mark
Lochbihler|803014426|Wichita to
Little Rock Route 2|Normal|38.65|-
90.21|5187297736652502631
{"timestamp": "2016-06-02
14:39:56.991", "truckId": 99,
"driverId": 31, "driverName":
"Rommel Garcia", "routeId":
1565885487, "routeName":
"Springfield to KC Via Hanibal",
"eventType": "Normal", "latitude":
37.16, "longitude": "-94.46",
"correlationId":
5187297736652502631}
Reckless Driving
Detector
NEAR
ENTER
Truck
Driver
DashboardMovement Movement
JSON
Reckless
Driver
42. StreamSets Data Collector
• Founded by ex-Cloudera, Informatica
employees
• Continuous open source, intent-driven,
big data ingest
• Visible, record-oriented approach fixes
combinatorial explosion
• Batch or stream processing
• Standalone, Spark cluster, MapReduce
cluster
• IDE for pipeline development by ‘civilians’
• Relatively new - first public release
September 2015
• So far, vast majority of commits are from
StreamSets staff
43. Apache NiFi
• Originated at NSA as Niagarafiles
• Open sourced December 2014, Apache
TLP July 2015
• Opaque, file-oriented payload
• Distributed system of processors with
centralized control
• Based on flow-based programming
concepts
• Data Provenance
• Web-based user interface
44. Demo Use Case – Truck Sensors
Truck
Data
Ingestion
Geo-Fencing
2016-06-02 14:39:56.605|98|27|Mark
Lochbihler|803014426|Wichita to
Little Rock Route 2|Normal|38.65|-
90.21|5187297736652502631
{"timestamp": "2016-06-02
14:39:56.991", "truckId": 99,
"driverId": 31, "driverName":
"Rommel Garcia", "routeId":
1565885487, "routeName":
"Springfield to KC Via Hanibal",
"eventType": "Normal", "latitude":
37.16, "longitude": "-94.46",
"correlationId":
5187297736652502631}
Reckless Driving
Detector
NEAR
ENTER
Truck
Driver
DashboardMovement Movement
JSON
Reckless
Driver
49. History of Oracle Stream Analytics
Oracle Complex Event
Processing (OCEP)
Oracle Event Processing (OEP)
Oracle Stream Explorer (SX)
Oracle Event Processing
for Java Embedded
Oracle Stream Analytics (OSA)
Oracle Edge Analytics (OAE)
BEA Weblogic Event Server
Oracle CQL
Oracle IoT Cloud Service
2016
2015
2007
2008
2012
2013
50. OEA
• Filtering
• Correlation
• Aggregation
• Pattern
matching
Devices /
Gateways
Services
Computing Edge Enterprise
“Sea of data”
Macro-event
High-value
Actionable
In-context
EDGE
Analytics
Stream
Analytics
FOG
• High Volume
• Continuous Streaming
• Extreme Low Latency
• Disparate Sources
• Temporal Processing
• Pattern Matching
• Machine Learning
Oracle Stream Analytics: From Noise to Value
• High Volume
• Continuous Streaming
• Sub-Millisecond Latency
• Disparate Sources
• Time-Window Processing
• Pattern Matching
• High Availability / Scalability
• Coherence Integration
• Geospatial, Geofencing
• Big Data Integration
• Business Event Visualization
• Action!
51. Oracle Stream Analytics Platform
What it does
• Compelling, friendly and visually stunning real time
streaming analytics user experience for Business users to
dynamically create and implement Instant Insight solutions
Key Features
• Analyze simulated or live data feeds to determine event
patterns, correlation, aggregation & filtering
• Pattern library for industry specific solutions
• Streams, References, Maps & Explorations
Benefits
• Accelerated delivery time
• Hides all challenges & complexities of underlying real-time
event-driven infrastructure
52. Oracle Stream Analytics - Connecting Everything &
Anything of Interest to the Business
Understanding of CQL Filtering, Correlation, Pattern: NOT NEEDED
Understanding of IT Deployment and Management: NOT NEEDED
Understanding of Development, Java, Best Practices: NOT NEEDED
Understanding of the Event Driven Platform: NOT NEEDED
53. Business accessibility to Geo-Streaming Analytics
Real Time Streaming Solutions face an increasing need to track "assets of interest" and
initiate actions based on encroachment of boundary proximity to fixed and moving
objects and other geographic, temporal, or event conditions.
Geo-Fence, Fence, Polygon
Geo-Streaming
58. Stream Analytics – Terminology for Business Users
Explorer: The Application User Interface Catalog: The repository for browsing resources
59. Stream Analytics – Terminology for Business Users
Stream: incoming flow of events that you
want to analyze (CSV, Kafka, JMS, Rest,
MQTT, …)
Exploration: application that correlates events
from streams and data sources, using filters,
groupings, summaries, ranges, and more
60. Stream Analytics – Terminology for Business Users
Shape: A blueprint of an event in a stream or
data in a data source. How the business data
is represented in the selected stream
Map: collection of geo-fences
Reference: A connection to static data that is
joined to a stream to enrich it and/or to be used in
business logic and output
61. Stream Analytics – Terminology for Business Users
Pattern: A pre-built Exploration that
addresses a particular business scenario in a
focused and simplified User Interface
Connection: collection of metadata required to
connect to an external system
Targets: defines an interface with a downstream
system
62. Demo Use Case – Truck Sensors
Truck
Data
Ingestion
Geo-Fencing
2016-06-02 14:39:56.605|98|27|Mark
Lochbihler|803014426|Wichita to
Little Rock Route 2|Normal|38.65|-
90.21|5187297736652502631
{"timestamp": "2016-06-02
14:39:56.991", "truckId": 99,
"driverId": 31, "driverName":
"Rommel Garcia", "routeId":
1565885487, "routeName":
"Springfield to KC Via Hanibal",
"eventType": "Normal", "latitude":
37.16, "longitude": "-94.46",
"correlationId":
5187297736652502631}
Reckless Driving
Detector
NEAR
ENTER
Truck
Driver
DashboardMovement Movement
JSON
Reckless
Driver
68. Apache Spark
Apache Spark is a fast and general engine for large-scale data processing
• The hot trend in Big Data!
• Originally developed 2009 in UC Berkley’s AMPLab
• Based on 2007 Microsoft Dryad paper
• Written in Scala, supports Java, Python, SQL and R
• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x
faster on disk
• One of the largest OSS communities in big data with over 200 contributors in 50+
organizations
• Open Sourced in 2010 – since 2014 part of Apache Software foundation
85. Discretized Stream (DStream)
time 1 time 2 time 3
message
time n….
f(message 1)
RDD @time 1
f(message 2)
f(message n)
….
message 1
RDD @time 1
message 2
message n
….
result 1
result 2
result n
….
message message message
f(message 1)
RDD @time 2
f(message 2)
f(message n)
….
message 1
RDD @time 2
message 2
message n
….
result 1
result 2
result n
….
f(message 1)
RDD @time 3
f(message 2)
f(message n)
….
message 1
RDD @time 3
message 2
message n
….
result 1
result 2
result n
….
f(message 1)
RDD @time n
f(message 2)
f(message n)
….
message 1
RDD @time n
message 2
message n
….
result 1
result 2
result n
….
Input Stream
Event DStream
MappedDStream
map()
saveAsHadoopFiles()
Time Increasing
DStreamTransformation Lineage
Actions Trigger
Spark Jobs
Adapted from Chris Fregly: http://slidesha.re/11PP7FV
86. Demo Use Case – Truck Sensors
Truck
Data
Ingestion
Geo-Fencing
2016-06-02 14:39:56.605|98|27|Mark
Lochbihler|803014426|Wichita to
Little Rock Route 2|Normal|38.65|-
90.21|5187297736652502631
{"timestamp": "2016-06-02
14:39:56.991", "truckId": 99,
"driverId": 31, "driverName":
"Rommel Garcia", "routeId":
1565885487, "routeName":
"Springfield to KC Via Hanibal",
"eventType": "Normal", "latitude":
37.16, "longitude": "-94.46",
"correlationId":
5187297736652502631}
Reckless Driving
Detector
NEAR
ENTER
Truck
Driver
DashboardMovement Movement
JSON
Reckless
Driver
88. Apache Storm
A platform for doing analysis on streams of data as they come in, so you can react to
data as it happens.
• highly distributed real-time computation system
• Provides general primitives to do
real-time computation
• To simplify working with queues & workers
• scalable and fault-tolerant
Originated at Backtype, acquired by Twitter in 2011
Open Sourced late 2011
Part of Apache since September 2013
89. Apache Storm – Core concepts
Tuple
• Immutable Set of Key/value pairs
Stream
• an unbounded sequence of tuples that can be processed in parallel by Storm
Topology
• Wires data and functions via a DAG (directed acyclic graph)
• Executes on many machines similar to a MR job in Hadoop
Spout
• Source of data streams (tuples)
• can be run in “reliable” and “unreliable” mode
Bolt
• Consumes 1+ streams and produces new streams
• Complex operations often require multiple
steps and thus multiple bolts
Spout
Spout
Bolt
Bolt
Bolt
Bolt
Source of
Stream B
Subscribes: A
Emits: C
Subscribes: A
Emits: D
Subscribes: A & B
Emits: -
Subscribes: C & D
Emits: -
T T T T T T T T
90. Demo Use Case – Truck Sensors
Truck
Data
Ingestion
Geo-Fencing
2016-06-02 14:39:56.605|98|27|Mark
Lochbihler|803014426|Wichita to
Little Rock Route 2|Normal|38.65|-
90.21|5187297736652502631
{"timestamp": "2016-06-02
14:39:56.991", "truckId": 99,
"driverId": 31, "driverName":
"Rommel Garcia", "routeId":
1565885487, "routeName":
"Springfield to KC Via Hanibal",
"eventType": "Normal", "latitude":
37.16, "longitude": "-94.46",
"correlationId":
5187297736652502631}
Reckless Driving
Detector
NEAR
ENTER
Truck
Driver
DashboardMovement Movement
JSON
Reckless
Driver
91. Apache Storm – How does it work ?
Geo
Hashing
Trucks
Movement
Geo
Hashing
{ "timestamp" : "2016-06-02
Shuffle
Grouping
Geo
Hashing
{ "timestamp" : "2016-06-02
12:56:02.362", "truckId" : 35, "driverId" :
26, "driverName" : "Michael Aube",
"routeId" : 1090292248, "eventType" :
"Normal", "latitude" : 40.86, "longitude" :
"-89.91"}
Truck
Movement
{ "timestamp" : "2016-06-02
“geohash” : “dp206n3d“,
94. Apache Storm – Core concepts
Each Spout or Bolt are running N instances in parallel
GeoHashing
nth
Trucks
Movement
GeoFencing
nth
GeoHashing
GeoFencing
1st
Shuffle Fields
Shuffle grouping is random grouping
Fields grouping is grouped by value, such that equal value results in equal task
All grouping replicates to all tasks
Global grouping makes all tuples go to one task
None grouping makes bolt run in the same thread as bolt/spout it subscribes to
Direct grouping producer (task that emits) controls which consumer will receive
Local or Shuffle
grouping
similar to the shuffle grouping but will shuffle tuples among bolt tasks running in the same
worker process, if any. Falls back to shuffle grouping behavior.
ReportGlobal
98. Collecting
Process 1
Collecting
Process 2
Processing A
Process 2
Processing B
Process 2
Processing A
Process 1
Processing B
Process 1
How to scale a Streaming Analytics System?
Event
Stream
Collecting
Process 1
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
Processing A
Thread 1
Q1
e
Processing B
Thread 1
Q1
e
Processing A
Process 2
Processing A
Thread n
Qn
e
99. How to make Streaminig Analytics System reliable?
Faults and stragglers inevitable in large clusters running big data applications
Streaming applications must recover from them quickly
Collecting
Process 2
Processing A
Process 2
Processing B
Process 2
Event
Stream
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
Collecting
Process 2
Processing A
Process 2
Processing B
Process 2
Event
Stream
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
100. How to deal with “Stragglers”
Consumer goes slow
Backpressure Queue upDrop data
Other jobs grind
to a halt L
Run out of
memory L
Spill to diskNo thanks L
101. How to make Streaming Analytics System reliable?
Solution 1: using active/passive system (hot replication)
• Both systems process the full load
• In case of a failure, automatically switch and use the “passive” system
• Stragglers slow down both active and passive system
Stat
e
= State in-memory and/or on-disk
Collecting
Process 2
Processing A
Process 2
Processing B
Process 2
Event
Stream
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
Active
Collecting
Process 2
Processing A
Process 2
Processing B
Process 2
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
Passive
Stat
e
Stat
e
102. How to make Streaming Analytics System reliable?
Solution 2: Upstream backup
• Nodes buffer sent messages and reply them to new node in case of failure
• Stragglers are treated as failures
State = State in-memory and/or on-disk
buffer = Buffer for replay in-memory and/or on-disk
Collecting
Process 2
Processing A
Process 2
Processing B
Process 2
Event
Stream
Collecting
Process 2
Processing A
Thread 2
Q2
e
Processing B
Thread 2
Q2
e
State
103. Message Delivery Semantics
At most once [0,1]
• Messages my be lost
• Messages never redelivered
At least once [1 .. n]
• Messages will never be lost
• but messages may be redelivered
(might be ok if consumer can handle
it)
Exactly once [1]
• Messages are never lost
• Messages are never redelivered
• Perfect message delivery
• Incurs higher latency for transactional
semantics
105. “Traditional Architecture” for Big Data
Data
Collection
(Analytical) Data Processing
Result StoreData
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
Batch
compute
Stage
Result Store
Query
Engine
Computed
Information
Raw Data
(Reservoir)
= Data in Motion = Data at Rest
106. Streaming Analytics Architecture for Big Data
aka. (Complex) Event Processing)
Data
Collection
Batch
compute
Data
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
Logfiles
Sensor
RDBMS
ERP
Mobile
Machine
(Analytical) Real-Time Data Processing
Stream/Event Processing
Result Store
Messaging
Result Store
= Data in Motion = Data at Rest
107. Keep raw event data
Data
Collection
Batch
compute
Data
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
Logfiles
Sensor
RDBMS
ERP
Mobile
Machine
(Analytical) Real-Time Data Processing
Stream/Event Processing
Result Store
Messaging
Result Store
= Data in Motion = Data at Rest
(Analytical) Batch Data Processing
Raw Data
(Reservoir)
108. “Lambda Architecture” for Big Data
Data
Collection
(Analytical) Batch Data Processing
Batch
compute
Result StoreData
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
(Analytical) Real-Time Data Processing
Stream/Event Processing
Batch
compute
Messaging
Result Store
Query
Engine
Result Store
Computed
Information
Raw Data
(Reservoir)
= Data in Motion = Data at Rest
109. “Kappa Architecture” for Big Data
Data
Collection
“Raw Data Reservoir”
Batch
compute
Data
Sources
Messaging
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
Logfiles
Sensor
RDBMS
ERP
Mobile
Machine
(Analytical) Real-Time Data Processing
Stream/Event Processing
Result Store
Messaging
Result Store
Raw Data
(Reservoir)
= Data in Motion = Data at Rest
Computed
Information
110. “Unified Architecture” for Big Data
Data
Collection
(Analytical) Batch Data Processing (Calculate
Models of incoming data)
Batch
compute
Result StoreData
Sources
Channel
Data
Consumer
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
(Analytical) Real-Time Data Processing
Stream/Event Processing
Batch
compute
Messaging
Result Store
Query
Engine
Result Store
Computed
Information
Raw Data
(Reservoir)
= Data in Motion = Data at Rest
Prediction
Models
112. Summary
More and more use cases (such as IoT) make Streaming Analytics necessary
Treat events as events! Infrastructures for handling lots of events are available!
Platforms such as Oracle Stream Analytics enable the business to work directly on
streaming data (empower the business analyst) => User Experience of an Excel Sheet
on streaming data
Platform such as Apache Strom and Apache Spark Streaming provide a highly-scalable
and fault-tolerant infrastructure for streaming analytics => Oracle Stream Analytics can
use Spark Streaming as the runtime infrastructure
Platforms such as Kafka provide a high volume event broker infrastructure, a.k.a. Event
Hub
113. Comparison
Oracle Stream Analytics Spark Streaming Spark Storm
Community n.a. > 280 contributors > 100 contributors
Language Options Java, CQL Java, Scala, Python Java, Clojure, Scala, …
Processing Models Event-Streaming Micro-Batching Event-Streaming
Processing DSL Yes Yes No
Stateful Ops Yes Yes No
Pattern detection Yes No No
Scalability & Reliability limited yes yes
Distributed RPC No No Yes
Delivery Guarantees At Least Once Exactly Once At most once / At least once
Latency sub-second seconds sub-second
”self-service” for Biz Yes No No
Platform OEP server, Spark
Streaming (YARN, Mesos)
YARN, Mesos Standalone,
DataStax EE
Storm Cluster, YARN