2. #1 risk
Slow or bad (500) API responses
Auto-healing
because humans are slow
SLA, Failover, Degradation, Throttling
Alerting
Detect, Filter, Alert, Diagnostics
3. SLA
Performance Data Loss Business Logic
TX Processing Low Latency Nope Best Effort
Stream
Processing High Throughput Best Effort Best Effort
Batch Processing High Volume Nope Reconciliation
4. Automatic Failover
http fencing (Incapsula)
http load balancing (ELB)
instance restart (Scaling Group)
process restart (upstart)
exceptions bubble up and crash
6. Throttling (without back-pressure)
request priority reduced when TX/sec > thresh
Different priority â
Different queue â
Different worker
lower priority inside queue for test probes
25. Riemann's Index
key (host+service) event TTL
199.25.1.1-1234 {"state":"loaded"} 300
199.25.2.1-4567 {"state":"downloaded"} 300
199.25.3.1-8901 {"state":"loaded"} 300
For our use case:
host=browser-ip, service=cookie
26. Riemann's state machine
(index)
stores last event and creates expired events (TTL)
(by [:host :service] stream)
creates a new stream for each host/service
(by-host-service stream) - forter's fork only
also closes stream when TTL expires
32. Alert me if
the probability that we decline
more than k out of n transactions
given probability p
is 1 in a million (t=0.0001%)
n number of tx (30 minutes)
k number of declined txs (30 minutes)
p per customer declined/total (24 hours)
t alert threshold
riemann is an event streaming processing server written in clojure tailored for monitoring. It receives input from the infrastructure (OS/queues/DBs) and instrumented applications (all prog languages) processes the events (enrichement, filtering, aggregation) and forwards them for visualization and alerting.
explain the pros/cons of infra monitoring (plugins), application monitoring (custom events), and system probes. the need for processing before a machine wakes you up at night (the role of pagerduty in all of this) and the need for visualization when you wake up at night.
riemann is an event streaming processing server written in clojure tailored for monitoring. It receives input from the infrastructure (OS/queues/DBs) and instrumented applications (all prog languages) processes the events (enrichement, filtering, aggregation) and forwards them for visualization and alerting.
explain the pros/cons of infra monitoring (plugins), application monitoring (custom events), and system probes. the need for processing before a machine wakes you up at night (the role of pagerduty in all of this) and the need for visualization when you wake up at night.
riemann is an event streaming processing server written in clojure tailored for monitoring. It receives input from the infrastructure (OS/queues/DBs) and instrumented applications (all prog languages) processes the events (enrichement, filtering, aggregation) and forwards them for visualization and alerting.
explain the pros/cons of infra monitoring (plugins), application monitoring (custom events), and system probes. the need for processing before a machine wakes you up at night (the role of pagerduty in all of this) and the need for visualization when you wake up at night.
riemann is an event streaming processing server written in clojure tailored for monitoring. It receives input from the infrastructure (OS/queues/DBs) and instrumented applications (all prog languages) processes the events (enrichement, filtering, aggregation) and forwards them for visualization and alerting.
explain the pros/cons of infra monitoring (plugins), application monitoring (custom events), and system probes. the need for processing before a machine wakes you up at night (the role of pagerduty in all of this) and the need for visualization when you wake up at night.
riemann is an event streaming processing server written in clojure tailored for monitoring. It receives input from the infrastructure (OS/queues/DBs) and instrumented applications (all prog languages) processes the events (enrichement, filtering, aggregation) and forwards them for visualization and alerting.
explain the pros/cons of infra monitoring (plugins), application monitoring (custom events), and system probes. the need for processing before a machine wakes you up at night (the role of pagerduty in all of this) and the need for visualization when you wake up at night.
On the left: example for cloudwatch alert for the load balancer detecting REST API error 500 (internal server error)
On the right: example for escalation policies. alert for a system test (probe) that failed.
On the left: example for cloudwatch alert for the load balancer detecting REST API error 500 (internal server error)
On the right: example for escalation policies. alert for a system test (probe) that failed.
PagerDuty has an alerts panel. An alert is either triggered or resolved. PD assigns the alert to whoever is on duty and then notifies him based on predefined rules (for example call if the alert has not been resolved for 15 minutes).
However, PagerDuty API has a throttler. We cannot send every test result to PD every time the test runs. In essence we want to replicate the riemann state into PD. Only when the riemann state changes (a test that previously passed now fails or vica versa) we update PD. Riemann makes it easy to define a state machine by storing the last's event per [host+service] combination.
PagerDuty has an alerts panel. An alert is either triggered or resolved. PD assigns the alert to whoever is on duty and then notifies him based on predefined rules (for example call if the alert has not been resolved for 15 minutes).
However, PagerDuty API has a throttler. We cannot send every test result to PD every time the test runs. In essence we want to replicate the riemann state into PD. Only when the riemann state changes (a test that previously passed now fails or vica versa) we update PD. Riemann makes it easy to define a state machine by storing the last's event per [host+service] combination.
But what happens if we manually resolved the alert on PD and the test keeps failing? You would want the alert to be repoened again. This means that we need to filter passed test events using state machine, but not filter failed tests at all. We just throttle them to no more than 1 per minute per test permutation [host service] combination.
But what happens if we manually resolved the alert on PD and the test keeps failing? You would want the alert to be repoened again. This means that we need to filter passed test events using state machine, but not filter failed tests at all. We just throttle them to no more than 1 per minute per test permutation [host service] combination.
filtering widgets that were not been possible without riemann enrichmenent.
The quick filters allow fast zoom-in on the branch you are working on, and werether you would like to see test probes or not.