SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
Scalable Online Analytics for Monitoring
LISA15, Nov. 13, 2015, Washington, D.C.
Heinrich Hartmann, PhD, Chief Data Scientist, Circonus
I’m Heinrich
· From Mainz, Germany
· Studied Math. in Mainz, Bonn, Oxford
· PhD in Algebraic Geometry
· Sensor Data Mining at Univ. Koblenz
· Freelance IT consultant
· Chief Data Scientist at Circonus
Heinrich.Hartmann@Circonus.com
@HeinrichHartman(n)
Circonus is ...
@HeinrichHartman(n)
· monitoring and telemetry analysis platform
· scalable to millions of incoming metrics
· available as public and private SaaS
· built-in histograms, forecasting, anomaly-detection, ...
@HeinrichHartman(n)
This talk is about...
I. The Future of Monitoring
II. Patterns of Telemetry Analysis
III. Design of the Online Analytics Engine ‘Beaker’
Part I - The Future of Monitoring
Monitor this:
· 100 containers in one fleet
· 200 metrics per-container
· 50 code pushes a day
· For each push all containers are recreated
The “cloud monitoring challenge” 2015
@HeinrichHartman(n)
Line-plots work well for small numbers of nodes
@HeinrichHartman(n)
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
CPU utilization for a DB cluster. Source: Internal.
... but can get polluted easily
@HeinrichHartman(n)
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
CPU utilization for a DB cluster. Source: Internal.
“Information is not a scarce resource. Attention is.”
Herbert A. Simon
@HeinrichHartman(n) Source: http://www.circonus.com/the-future-of-monitoring-qa-with-john-allspaw/
Store Histograms and Alert on Percentiles
@HeinrichHartman(n)
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
90% percentile
CPU Utilization of a db service with a variable number of nodes.
@HeinrichHartman(n)
Anomaly Detection for Surfacing relevant data
Mockup of an metric overview page. Source: Circonus
Part II - Patterns of Telemetry Analysis
@HeinrichHartman(n)
Telemetry analysis comes in two forms
Online Analytics
· Anomaly detection
· Percentile alerting
· Smart alerting rules
· Smart dashboards
Offline Analytics
· Post mortem analysis
· Assisted thresholding
Stream
Processing
‘Big Data’
@HeinrichHartman(n)
New Components in Circonus
BunsenBeaker
@HeinrichHartman(n)
Beaker & Bunsen in the Circonus Architecture
metrics Ernie
Alerting Service
Snowth
Metric Store
alerts
Web-UI
Q
Beaker
Online AE
Bunsen
Offline AE
@HeinrichHartman(n)
Pattern 1: Windowing
Examples
· Supervised Machine Learning
· Fourier Transformation
· Anomaly Detection (etsy)
Remarks
· Tradeoff: window size
rich features vs. memory
· Tradeoff: overlap
latency vs. CPU
windows = window(y_stream)
def results():
for w in windows:
yield z = method(w)
@HeinrichHartman(n)
Pattern 3: Processing Units
processing_unit = {
state = ...
update = function(self, y) ... end
}
Example
· Exponential Smoothing
· Holt Winters forecasting
· Anomaly Detection (Circonus)
Remarks
· Fast updates
· Fully general
· Cost: maintains state
Circonus Anomaly Detection
@HeinrichHartman(n)
local q = 0.9
exponential_smoothing = {
s = 0,
update = function(self, y)
self.s = (y * q) + (self.s * (1 - q))
return self.s
end
}
For
A Processing Unit for Exponential Smoothing
Exponential smoothing applied to a dns-duration metric
More examples on http://heinrichhartmann.com/.../Generative-Models-for-Time-Series
Processing units are convenient
· Primitive transformation are readily implemented:
Arithmetic, Smoothing, Forecasting, ...
· Fully general. Allow window-based processing as well
· Composable. Compose several PUs to get a new one!
@HeinrichHartman(n)
Circonus Analytics Query Language
· Create your own customized processing unit from primitives
· UNIX-inspired syntax with pipes ‘|’
· Native support for histogram metrics
@HeinrichHartman(n)
@HeinrichHartman(n)
CAQL: Example 1 - Low frequency AD
Pre-process a metric before feeding into anomaly detection
metric:average(<>) | rolling:mean(30m) | anomaly_detection()
@HeinrichHartman(n)
CAQL: Example 2 - Histogram Aggregation
Histograms are first-class citizens
metric:average(<uuid>) | window:histogram(1h) | histogram:percentile(95)
Part III - The Design of the Online Analytics
Engine ‘Beaker’
1. Read messages from a message queue
2. Execute a processing units over incoming metrics
3. Publish computed values to message queue
Beaker v. 0.1: Simple stream processing
Beaker Service
output
Q
input
Q
Beaker: Basic Data Flow
Challenge 1: Rollup metrics by the minute
@HeinrichHartman(n)
· Metrics arrive asynchronously on the input queue
· PUs expect exactly one sample per minute
· Rollup logic needs to allow for
· late arrival, e.g. when broker is behind
· out of order arrival
· errors in system time (time-zones, drifts)
Consequence: No real-time processing in Beaker
· Rolling up data in 1m periods causes an average delay
of 30sec for data processing.
· Real-time threshold-based alerting still available.
· Approach: Avoid rolling up data for stateless PUs.
@HeinrichHartman(n)
Challenge 2: Multiple input slots
input metric
input metric
input metric
input metric
input metric
processing unit
processing unit
processing unit output metric
output metric
output metric
Beaker: Logical Data Flow
@HeinrichHartman(n)
Challenge 3:
Synchronize roll-ups for multiple inputs is tricky
@HeinrichHartman(n)
Challenge 4: Fault Tolerance
Definition (Birman): A service is fault tolerant if it
tolerates the failure of nodes without producing wrong
output messages.
Failed nodes must be able to recover from a crash and
rejoin the cluster.
The time to recovery should be as low as possible.
@HeinrichHartman(n) Source: K. Birman - Reliable Distributed Systems, Springer, 2005
Beaker v0.5:
· Automated restarts
a. Service Management Facilities (svcadm) in OmniOS
b. systemd or watchdogs in Linux
· But, recovery can take a long time
State has to be rebuilt from input stream.
· Need a way to recover faster from errors:
a. Persist processing unit state (software updates!)
b. Access persisted metric data
@HeinrichHartman(n)
Beaker v. 0.5: Use db to rebuild state on startup
@HeinrichHartman(n)
Beaker Service
output
Q
input
Q
Snowth db
Challenge 5: High Availability
Definition (Birman): A service is highly available it
continues to publish valid messages during a node failure
after a small reconfiguration period.
For Beaker we require a reconfiguration period of less
than 1 minute. In this time messages may be delayed (e.g.
30sec) and duplicated messages may be published.
@HeinrichHartman(n) Source: K. Birman - Reliable Distributed Systems, Springer, 2005
Beaker v. 0.5 -- HA Cluster
Beaker HA Cluster
Beaker Master
output
Q
input
Q
Beaker
Slave
Beaker
Slave On
Failover:
Master
election
@HeinrichHartman(n)
Snowth db
@HeinrichHartman(n)
Challenge 6: Scalability
Beaker needs to scale in the following dimensions
· Number of processing units up to an unlimited amount.
· In the number of incoming metrics up to ~100M
metrics.
Beaker v.0.6 -- Multiple - HA Clusters
Beaker HA Cluster II (3 slaves)
Beaker HA Cluster I (5 slaves)
..
.
..
Beaker HA Cluster III (1 slave)
BI-M BI-S1 BI-S5
BII-M BII-S1 BII-S3
BIII-M BIII-S1
Meta Service
Msg
Broker
Msg
Broker
BIII-S2
@HeinrichHartman(n)
Snowth db
..
@HeinrichHartman(n)
Done! Great. This works...
Can we do better?
@HeinrichHartman(n)
· Divide Beaker into multiple services
· Avoid master election in processing service, by allowing duplicates
· Upside: Only the processing service needs to scale out
Beaker Service
Beaker v. 1.0 -- Divide and Conquer
@HeinrichHartman(n)
Q Q
Processing
Service
Dedup
Service
Rollup
Service
Scaling the Processing Service is simplified
· No rollup logic in workers
· All workers publish messages
· No failover logic necessary
· Replicate each PUs on multiple
workers
· No crosstalk between nodes
( = 0 in USL).
@HeinrichHartman(n)
Q Q
...
snowth db
Worker
Worker
Worker
Meta Service
Processing Service:
Data Flow
Benchmark results on prototype
· PU type anomaly_detection
PU count 100
Throughput 15 kHz
· PU type anomaly_detection
PU count 10k
Throughput 4.2 kHz
Machine: 6-core Xeon E5 2630 v2 at 2.6 GHz
@HeinrichHartman(n)
Conclusion
· Stateful processing units allow implementation of next
generation of monitoring analytics
· Use CAQL to build your own processing units
· Service orientation facilitates scaling
· Beaker will be out soon
@HeinrichHartman(n)
Credits
Joint work with:
· Jonas Kunze
· Theo Schlossnagle
Image Credits:
Example of Kummer Surface, Claudio Rocchini, CC-BY-SA, https://en.wikipedia.org/wiki/File:Kummer_surface.png
Indian truck, by strudelt, CC-BY, https://commons.wikimedia.org/wiki/File:Truck_in_India_-_overloaded.jpg
Kafka, Public Domain, https://en.wikipedia.org/wiki/File:Kafka_portrait.jpg
Thinker, Public Domain, https://www.flickr.com/photos/mustangjoe/5966894496
@HeinrichHartman(n)

Weitere ähnliche Inhalte

Was ist angesagt?

Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingKostas Tzoumas
 
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...Jonas Traub
 
Monitoring Flink with Prometheus
Monitoring Flink with PrometheusMonitoring Flink with Prometheus
Monitoring Flink with PrometheusMaximilian Bode
 
Flink Forward Berlin 2018: Xingcan Cui - "Stream Join in Flink: from Discrete...
Flink Forward Berlin 2018: Xingcan Cui - "Stream Join in Flink: from Discrete...Flink Forward Berlin 2018: Xingcan Cui - "Stream Join in Flink: from Discrete...
Flink Forward Berlin 2018: Xingcan Cui - "Stream Join in Flink: from Discrete...Flink Forward
 
Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assistin...
Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assistin...Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assistin...
Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assistin...Flink Forward
 
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...Ververica
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...Flink Forward
 
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...Ververica
 
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance AnalysisParallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance AnalysisShah Zaib
 
Using Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systemsUsing Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systemsconfluent
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?confluent
 
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...Ververica
 
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache FlinkTzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache FlinkVerverica
 
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...Ververica
 
Worldsensing: A Real World Use Case for Flux by Albert Zaragoza, CTO & Head o...
Worldsensing: A Real World Use Case for Flux by Albert Zaragoza, CTO & Head o...Worldsensing: A Real World Use Case for Flux by Albert Zaragoza, CTO & Head o...
Worldsensing: A Real World Use Case for Flux by Albert Zaragoza, CTO & Head o...InfluxData
 
Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...
Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...
Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...Flink Forward
 
Continuously Updating Query Results over Real-Time Linked Data
Continuously Updating Query Results over Real-Time Linked DataContinuously Updating Query Results over Real-Time Linked Data
Continuously Updating Query Results over Real-Time Linked DataRuben Taelman
 
Flink Forward Berlin 2018: Tobias Lindener - "Approximate standing queries on...
Flink Forward Berlin 2018: Tobias Lindener - "Approximate standing queries on...Flink Forward Berlin 2018: Tobias Lindener - "Approximate standing queries on...
Flink Forward Berlin 2018: Tobias Lindener - "Approximate standing queries on...Flink Forward
 

Was ist angesagt? (20)

Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
I²: Interactive Real-Time Visualization for Streaming Data with Apache Flink ...
 
Monitoring Flink with Prometheus
Monitoring Flink with PrometheusMonitoring Flink with Prometheus
Monitoring Flink with Prometheus
 
Flink Forward Berlin 2018: Xingcan Cui - "Stream Join in Flink: from Discrete...
Flink Forward Berlin 2018: Xingcan Cui - "Stream Join in Flink: from Discrete...Flink Forward Berlin 2018: Xingcan Cui - "Stream Join in Flink: from Discrete...
Flink Forward Berlin 2018: Xingcan Cui - "Stream Join in Flink: from Discrete...
 
Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assistin...
Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assistin...Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assistin...
Flink Forward Berlin 2018: Krzysztof Zarzycki & Alexey Brodovshuk - "Assistin...
 
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
Keynote: Stephan Ewen - Stream Processing as a Foundational Paradigm and Apac...
 
Stream Processing
Stream Processing Stream Processing
Stream Processing
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
 
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
Aljoscha Krettek - Apache Flink® and IoT: How Stateful Event-Time Processing ...
 
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance AnalysisParallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
 
Using Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systemsUsing Kafka to integrate DWH and Cloud Based big data systems
Using Kafka to integrate DWH and Cloud Based big data systems
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?
 
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
 
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache FlinkTzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
 
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
Aljoscha Krettek - Apache Flink for IoT: How Event-Time Processing Enables Ea...
 
Vlsi 2015 2016 ieee project list-(m)
Vlsi 2015 2016 ieee project list-(m)Vlsi 2015 2016 ieee project list-(m)
Vlsi 2015 2016 ieee project list-(m)
 
Worldsensing: A Real World Use Case for Flux by Albert Zaragoza, CTO & Head o...
Worldsensing: A Real World Use Case for Flux by Albert Zaragoza, CTO & Head o...Worldsensing: A Real World Use Case for Flux by Albert Zaragoza, CTO & Head o...
Worldsensing: A Real World Use Case for Flux by Albert Zaragoza, CTO & Head o...
 
Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...
Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...
Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...
 
Continuously Updating Query Results over Real-Time Linked Data
Continuously Updating Query Results over Real-Time Linked DataContinuously Updating Query Results over Real-Time Linked Data
Continuously Updating Query Results over Real-Time Linked Data
 
Flink Forward Berlin 2018: Tobias Lindener - "Approximate standing queries on...
Flink Forward Berlin 2018: Tobias Lindener - "Approximate standing queries on...Flink Forward Berlin 2018: Tobias Lindener - "Approximate standing queries on...
Flink Forward Berlin 2018: Tobias Lindener - "Approximate standing queries on...
 

Ähnlich wie Scalable Online Analytics for Monitoring

OSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann
OSMC 2016 | Friends and foes in API Monitoring by Heinrich HartmannOSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann
OSMC 2016 | Friends and foes in API Monitoring by Heinrich HartmannNETWAYS
 
OSMC 2016 - Friends and foes by Heinrich Hartmann
OSMC 2016 - Friends and foes by Heinrich HartmannOSMC 2016 - Friends and foes by Heinrich Hartmann
OSMC 2016 - Friends and foes by Heinrich HartmannNETWAYS
 
S4x16 europe krotofil_granular_dataflowsics
S4x16 europe krotofil_granular_dataflowsicsS4x16 europe krotofil_granular_dataflowsics
S4x16 europe krotofil_granular_dataflowsicsMarina Krotofil
 
ODSC 2019: Sessionisation via stochastic periods for root event identification
ODSC 2019: Sessionisation via stochastic periods for root event identificationODSC 2019: Sessionisation via stochastic periods for root event identification
ODSC 2019: Sessionisation via stochastic periods for root event identificationKuldeep Jiwani
 
Counting Elements in Streams
Counting Elements in StreamsCounting Elements in Streams
Counting Elements in StreamsJamie Grier
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Kostas Tzoumas
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemC4Media
 
Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017
Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017
Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017Justin Hayward
 
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...Florian Lautenschlager
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Flink Forward
 
Post quantum cryptography in vault (hashi talks 2020)
Post quantum cryptography in vault (hashi talks 2020)Post quantum cryptography in vault (hashi talks 2020)
Post quantum cryptography in vault (hashi talks 2020)Mitchell Pronschinske
 
Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Dieter Plaetinck
 
HTM & Apache Flink (2016-06-27)
HTM & Apache Flink (2016-06-27)HTM & Apache Flink (2016-06-27)
HTM & Apache Flink (2016-06-27)Eron Wright
 
DEF CON 27 - ANDREAS BAUMHOF - are quantum computers really a threat to crypt...
DEF CON 27 - ANDREAS BAUMHOF - are quantum computers really a threat to crypt...DEF CON 27 - ANDREAS BAUMHOF - are quantum computers really a threat to crypt...
DEF CON 27 - ANDREAS BAUMHOF - are quantum computers really a threat to crypt...Felipe Prado
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of dataconfluent
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsDebasish Ghosh
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
Detecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking DataDetecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking DataJames Sirota
 

Ähnlich wie Scalable Online Analytics for Monitoring (20)

OSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann
OSMC 2016 | Friends and foes in API Monitoring by Heinrich HartmannOSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann
OSMC 2016 | Friends and foes in API Monitoring by Heinrich Hartmann
 
OSMC 2016 - Friends and foes by Heinrich Hartmann
OSMC 2016 - Friends and foes by Heinrich HartmannOSMC 2016 - Friends and foes by Heinrich Hartmann
OSMC 2016 - Friends and foes by Heinrich Hartmann
 
S4x16 europe krotofil_granular_dataflowsics
S4x16 europe krotofil_granular_dataflowsicsS4x16 europe krotofil_granular_dataflowsics
S4x16 europe krotofil_granular_dataflowsics
 
ODSC 2019: Sessionisation via stochastic periods for root event identification
ODSC 2019: Sessionisation via stochastic periods for root event identificationODSC 2019: Sessionisation via stochastic periods for root event identification
ODSC 2019: Sessionisation via stochastic periods for root event identification
 
Counting Elements in Streams
Counting Elements in StreamsCounting Elements in Streams
Counting Elements in Streams
 
S4x16_Europe_Krotofil
S4x16_Europe_KrotofilS4x16_Europe_Krotofil
S4x16_Europe_Krotofil
 
Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016Apache Flink at Strata San Jose 2016
Apache Flink at Strata San Jose 2016
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
 
Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017
Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017
Global C4IR-1 Masterclass Adryan - Zuehlke Engineering 2017
 
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in ...
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
 
Post quantum cryptography in vault (hashi talks 2020)
Post quantum cryptography in vault (hashi talks 2020)Post quantum cryptography in vault (hashi talks 2020)
Post quantum cryptography in vault (hashi talks 2020)
 
Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016Next generation alerting and fault detection, SRECon Europe 2016
Next generation alerting and fault detection, SRECon Europe 2016
 
HTM & Apache Flink (2016-06-27)
HTM & Apache Flink (2016-06-27)HTM & Apache Flink (2016-06-27)
HTM & Apache Flink (2016-06-27)
 
DEF CON 27 - ANDREAS BAUMHOF - are quantum computers really a threat to crypt...
DEF CON 27 - ANDREAS BAUMHOF - are quantum computers really a threat to crypt...DEF CON 27 - ANDREAS BAUMHOF - are quantum computers really a threat to crypt...
DEF CON 27 - ANDREAS BAUMHOF - are quantum computers really a threat to crypt...
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of data
 
Nexmark with beam
Nexmark with beamNexmark with beam
Nexmark with beam
 
Approximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming ApplicationsApproximation Data Structures for Streaming Applications
Approximation Data Structures for Streaming Applications
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Detecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking DataDetecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking Data
 

Mehr von Heinrich Hartmann

Linux System Monitoring with eBPF
Linux System Monitoring with eBPFLinux System Monitoring with eBPF
Linux System Monitoring with eBPFHeinrich Hartmann
 
Seminar on Motivic Hall Algebras
Seminar on Motivic Hall AlgebrasSeminar on Motivic Hall Algebras
Seminar on Motivic Hall AlgebrasHeinrich Hartmann
 
GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONS
GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONSGROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONS
GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONSHeinrich Hartmann
 
Related-Work.net at WeST Oberseminar
Related-Work.net at WeST OberseminarRelated-Work.net at WeST Oberseminar
Related-Work.net at WeST OberseminarHeinrich Hartmann
 
Pushforward of Differential Forms
Pushforward of Differential FormsPushforward of Differential Forms
Pushforward of Differential FormsHeinrich Hartmann
 
Dimensionstheorie Noetherscher Ringe
Dimensionstheorie Noetherscher RingeDimensionstheorie Noetherscher Ringe
Dimensionstheorie Noetherscher RingeHeinrich Hartmann
 
Hecke Curves and Moduli spcaes of Vector Bundles
Hecke Curves and Moduli spcaes of Vector BundlesHecke Curves and Moduli spcaes of Vector Bundles
Hecke Curves and Moduli spcaes of Vector BundlesHeinrich Hartmann
 
Dimension und Multiplizität von D-Moduln
Dimension und Multiplizität von D-ModulnDimension und Multiplizität von D-Moduln
Dimension und Multiplizität von D-ModulnHeinrich Hartmann
 
Nodale kurven und Hilbertschemata
Nodale kurven und HilbertschemataNodale kurven und Hilbertschemata
Nodale kurven und HilbertschemataHeinrich Hartmann
 
Local morphisms are given by composition
Local morphisms are given by compositionLocal morphisms are given by composition
Local morphisms are given by compositionHeinrich Hartmann
 
Notes on Intersection theory
Notes on Intersection theoryNotes on Intersection theory
Notes on Intersection theoryHeinrich Hartmann
 
Cusps of the Kähler moduli space and stability conditions on K3 surfaces
Cusps of the Kähler moduli space and stability conditions on K3 surfacesCusps of the Kähler moduli space and stability conditions on K3 surfaces
Cusps of the Kähler moduli space and stability conditions on K3 surfacesHeinrich Hartmann
 

Mehr von Heinrich Hartmann (19)

Linux System Monitoring with eBPF
Linux System Monitoring with eBPFLinux System Monitoring with eBPF
Linux System Monitoring with eBPF
 
Geometric Aspects of LSA
Geometric Aspects of LSAGeometric Aspects of LSA
Geometric Aspects of LSA
 
Geometric Aspects of LSA
Geometric Aspects of LSAGeometric Aspects of LSA
Geometric Aspects of LSA
 
Seminar on Complex Geometry
Seminar on Complex GeometrySeminar on Complex Geometry
Seminar on Complex Geometry
 
Seminar on Motivic Hall Algebras
Seminar on Motivic Hall AlgebrasSeminar on Motivic Hall Algebras
Seminar on Motivic Hall Algebras
 
GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONS
GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONSGROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONS
GROUPOIDS, LOCAL SYSTEMS AND DIFFERENTIAL EQUATIONS
 
Topics in Category Theory
Topics in Category TheoryTopics in Category Theory
Topics in Category Theory
 
Related-Work.net at WeST Oberseminar
Related-Work.net at WeST OberseminarRelated-Work.net at WeST Oberseminar
Related-Work.net at WeST Oberseminar
 
Komplexe Zahlen
Komplexe ZahlenKomplexe Zahlen
Komplexe Zahlen
 
Pushforward of Differential Forms
Pushforward of Differential FormsPushforward of Differential Forms
Pushforward of Differential Forms
 
Dimensionstheorie Noetherscher Ringe
Dimensionstheorie Noetherscher RingeDimensionstheorie Noetherscher Ringe
Dimensionstheorie Noetherscher Ringe
 
Polynomproblem
PolynomproblemPolynomproblem
Polynomproblem
 
Hecke Curves and Moduli spcaes of Vector Bundles
Hecke Curves and Moduli spcaes of Vector BundlesHecke Curves and Moduli spcaes of Vector Bundles
Hecke Curves and Moduli spcaes of Vector Bundles
 
Dimension und Multiplizität von D-Moduln
Dimension und Multiplizität von D-ModulnDimension und Multiplizität von D-Moduln
Dimension und Multiplizität von D-Moduln
 
Nodale kurven und Hilbertschemata
Nodale kurven und HilbertschemataNodale kurven und Hilbertschemata
Nodale kurven und Hilbertschemata
 
Local morphisms are given by composition
Local morphisms are given by compositionLocal morphisms are given by composition
Local morphisms are given by composition
 
Notes on Intersection theory
Notes on Intersection theoryNotes on Intersection theory
Notes on Intersection theory
 
Cusps of the Kähler moduli space and stability conditions on K3 surfaces
Cusps of the Kähler moduli space and stability conditions on K3 surfacesCusps of the Kähler moduli space and stability conditions on K3 surfaces
Cusps of the Kähler moduli space and stability conditions on K3 surfaces
 
Log Files
Log FilesLog Files
Log Files
 

Kürzlich hochgeladen

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

Scalable Online Analytics for Monitoring

  • 1. Scalable Online Analytics for Monitoring LISA15, Nov. 13, 2015, Washington, D.C. Heinrich Hartmann, PhD, Chief Data Scientist, Circonus
  • 2. I’m Heinrich · From Mainz, Germany · Studied Math. in Mainz, Bonn, Oxford · PhD in Algebraic Geometry · Sensor Data Mining at Univ. Koblenz · Freelance IT consultant · Chief Data Scientist at Circonus Heinrich.Hartmann@Circonus.com @HeinrichHartman(n)
  • 3. Circonus is ... @HeinrichHartman(n) · monitoring and telemetry analysis platform · scalable to millions of incoming metrics · available as public and private SaaS · built-in histograms, forecasting, anomaly-detection, ...
  • 4. @HeinrichHartman(n) This talk is about... I. The Future of Monitoring II. Patterns of Telemetry Analysis III. Design of the Online Analytics Engine ‘Beaker’
  • 5. Part I - The Future of Monitoring
  • 6. Monitor this: · 100 containers in one fleet · 200 metrics per-container · 50 code pushes a day · For each push all containers are recreated The “cloud monitoring challenge” 2015 @HeinrichHartman(n)
  • 7. Line-plots work well for small numbers of nodes @HeinrichHartman(n) Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 CPU utilization for a DB cluster. Source: Internal.
  • 8. ... but can get polluted easily @HeinrichHartman(n) N N N N N N N N N N N N N N N N N N N N CPU utilization for a DB cluster. Source: Internal.
  • 9. “Information is not a scarce resource. Attention is.” Herbert A. Simon @HeinrichHartman(n) Source: http://www.circonus.com/the-future-of-monitoring-qa-with-john-allspaw/
  • 10. Store Histograms and Alert on Percentiles @HeinrichHartman(n) N N N N N N N N N N N N N N N N N N N N N N N 90% percentile CPU Utilization of a db service with a variable number of nodes.
  • 11. @HeinrichHartman(n) Anomaly Detection for Surfacing relevant data Mockup of an metric overview page. Source: Circonus
  • 12. Part II - Patterns of Telemetry Analysis
  • 13. @HeinrichHartman(n) Telemetry analysis comes in two forms Online Analytics · Anomaly detection · Percentile alerting · Smart alerting rules · Smart dashboards Offline Analytics · Post mortem analysis · Assisted thresholding Stream Processing ‘Big Data’
  • 14. @HeinrichHartman(n) New Components in Circonus BunsenBeaker
  • 15. @HeinrichHartman(n) Beaker & Bunsen in the Circonus Architecture metrics Ernie Alerting Service Snowth Metric Store alerts Web-UI Q Beaker Online AE Bunsen Offline AE
  • 16. @HeinrichHartman(n) Pattern 1: Windowing Examples · Supervised Machine Learning · Fourier Transformation · Anomaly Detection (etsy) Remarks · Tradeoff: window size rich features vs. memory · Tradeoff: overlap latency vs. CPU windows = window(y_stream) def results(): for w in windows: yield z = method(w)
  • 17. @HeinrichHartman(n) Pattern 3: Processing Units processing_unit = { state = ... update = function(self, y) ... end } Example · Exponential Smoothing · Holt Winters forecasting · Anomaly Detection (Circonus) Remarks · Fast updates · Fully general · Cost: maintains state Circonus Anomaly Detection
  • 18. @HeinrichHartman(n) local q = 0.9 exponential_smoothing = { s = 0, update = function(self, y) self.s = (y * q) + (self.s * (1 - q)) return self.s end } For A Processing Unit for Exponential Smoothing Exponential smoothing applied to a dns-duration metric More examples on http://heinrichhartmann.com/.../Generative-Models-for-Time-Series
  • 19. Processing units are convenient · Primitive transformation are readily implemented: Arithmetic, Smoothing, Forecasting, ... · Fully general. Allow window-based processing as well · Composable. Compose several PUs to get a new one! @HeinrichHartman(n)
  • 20. Circonus Analytics Query Language · Create your own customized processing unit from primitives · UNIX-inspired syntax with pipes ‘|’ · Native support for histogram metrics @HeinrichHartman(n)
  • 21. @HeinrichHartman(n) CAQL: Example 1 - Low frequency AD Pre-process a metric before feeding into anomaly detection metric:average(<>) | rolling:mean(30m) | anomaly_detection()
  • 22. @HeinrichHartman(n) CAQL: Example 2 - Histogram Aggregation Histograms are first-class citizens metric:average(<uuid>) | window:histogram(1h) | histogram:percentile(95)
  • 23. Part III - The Design of the Online Analytics Engine ‘Beaker’
  • 24. 1. Read messages from a message queue 2. Execute a processing units over incoming metrics 3. Publish computed values to message queue Beaker v. 0.1: Simple stream processing Beaker Service output Q input Q Beaker: Basic Data Flow
  • 25. Challenge 1: Rollup metrics by the minute @HeinrichHartman(n) · Metrics arrive asynchronously on the input queue · PUs expect exactly one sample per minute · Rollup logic needs to allow for · late arrival, e.g. when broker is behind · out of order arrival · errors in system time (time-zones, drifts)
  • 26. Consequence: No real-time processing in Beaker · Rolling up data in 1m periods causes an average delay of 30sec for data processing. · Real-time threshold-based alerting still available. · Approach: Avoid rolling up data for stateless PUs. @HeinrichHartman(n)
  • 27. Challenge 2: Multiple input slots input metric input metric input metric input metric input metric processing unit processing unit processing unit output metric output metric output metric Beaker: Logical Data Flow @HeinrichHartman(n)
  • 28. Challenge 3: Synchronize roll-ups for multiple inputs is tricky @HeinrichHartman(n)
  • 29. Challenge 4: Fault Tolerance Definition (Birman): A service is fault tolerant if it tolerates the failure of nodes without producing wrong output messages. Failed nodes must be able to recover from a crash and rejoin the cluster. The time to recovery should be as low as possible. @HeinrichHartman(n) Source: K. Birman - Reliable Distributed Systems, Springer, 2005
  • 30. Beaker v0.5: · Automated restarts a. Service Management Facilities (svcadm) in OmniOS b. systemd or watchdogs in Linux · But, recovery can take a long time State has to be rebuilt from input stream. · Need a way to recover faster from errors: a. Persist processing unit state (software updates!) b. Access persisted metric data @HeinrichHartman(n)
  • 31. Beaker v. 0.5: Use db to rebuild state on startup @HeinrichHartman(n) Beaker Service output Q input Q Snowth db
  • 32. Challenge 5: High Availability Definition (Birman): A service is highly available it continues to publish valid messages during a node failure after a small reconfiguration period. For Beaker we require a reconfiguration period of less than 1 minute. In this time messages may be delayed (e.g. 30sec) and duplicated messages may be published. @HeinrichHartman(n) Source: K. Birman - Reliable Distributed Systems, Springer, 2005
  • 33. Beaker v. 0.5 -- HA Cluster Beaker HA Cluster Beaker Master output Q input Q Beaker Slave Beaker Slave On Failover: Master election @HeinrichHartman(n) Snowth db
  • 34. @HeinrichHartman(n) Challenge 6: Scalability Beaker needs to scale in the following dimensions · Number of processing units up to an unlimited amount. · In the number of incoming metrics up to ~100M metrics.
  • 35. Beaker v.0.6 -- Multiple - HA Clusters Beaker HA Cluster II (3 slaves) Beaker HA Cluster I (5 slaves) .. . .. Beaker HA Cluster III (1 slave) BI-M BI-S1 BI-S5 BII-M BII-S1 BII-S3 BIII-M BIII-S1 Meta Service Msg Broker Msg Broker BIII-S2 @HeinrichHartman(n) Snowth db ..
  • 37. Can we do better? @HeinrichHartman(n)
  • 38. · Divide Beaker into multiple services · Avoid master election in processing service, by allowing duplicates · Upside: Only the processing service needs to scale out Beaker Service Beaker v. 1.0 -- Divide and Conquer @HeinrichHartman(n) Q Q Processing Service Dedup Service Rollup Service
  • 39. Scaling the Processing Service is simplified · No rollup logic in workers · All workers publish messages · No failover logic necessary · Replicate each PUs on multiple workers · No crosstalk between nodes ( = 0 in USL). @HeinrichHartman(n) Q Q ... snowth db Worker Worker Worker Meta Service Processing Service: Data Flow
  • 40. Benchmark results on prototype · PU type anomaly_detection PU count 100 Throughput 15 kHz · PU type anomaly_detection PU count 10k Throughput 4.2 kHz Machine: 6-core Xeon E5 2630 v2 at 2.6 GHz @HeinrichHartman(n)
  • 41. Conclusion · Stateful processing units allow implementation of next generation of monitoring analytics · Use CAQL to build your own processing units · Service orientation facilitates scaling · Beaker will be out soon @HeinrichHartman(n)
  • 42. Credits Joint work with: · Jonas Kunze · Theo Schlossnagle Image Credits: Example of Kummer Surface, Claudio Rocchini, CC-BY-SA, https://en.wikipedia.org/wiki/File:Kummer_surface.png Indian truck, by strudelt, CC-BY, https://commons.wikimedia.org/wiki/File:Truck_in_India_-_overloaded.jpg Kafka, Public Domain, https://en.wikipedia.org/wiki/File:Kafka_portrait.jpg Thinker, Public Domain, https://www.flickr.com/photos/mustangjoe/5966894496 @HeinrichHartman(n)