video: https://www.youtube.com/watch?v=IBC9gcYqNR4
In this talk Efim Dimenstein, Chief Architect at Liveperson will cover the rules and guidelines of building resilient systems, implementing them in real life and lessons learned during the process. The talk will focus on achieving resilience in real life and will feature a lot of examples and lessons learned from building systems currently in production running at extreme scale.
Efim will talk about:
· General resilience guidelines
· How they are implemented in practice
· What changes needed to be implemented to achieve
resilience
· Lessons learned
· Summary
44. resilience @ scale
● multi layered solution
● requires monitoring and testing
● ingrained in the company culture
● keep things simple
● trust and empower your engineers
● break stuff
Reduce a lot of services into several types
Group them into layers - Biz. critical -> Mission Critical
Focus on mission critical
Go in order of priority
Dependency only downward
Service dependencies
Unpredictable behaviours
Domino failures
Partial failures
Retries
Define a retry mechanism (Client to server & server to server)
Never give up - recovery after failure
Beware DDOS on yourself
Fallback
Fake it until you make it
provide a fallback
Scoring - flip a coin,
use previous value,
tell the client to come back soon
write somewhere else (file instead of DB)
Cache
simplest resilience technique
might be used as a fallback or as an abstraction level over a service
up to now in the talk we talked about things from 30K feet up
time to descend to ground level
let’s see what is required on a day to day basis to make this work
so resilience requires an ongoing never-ending effort
get company wide buy-in
show the current damage without pointing fingers at anyone
trust your engineers
u can’t do it alone!
example:
lessons-learned sessions including… + follow up
architectural resilience evaluation of design
periodic re-evaluation of services with score-cards
provide R&D training during new employee onboarding
encourage transition from Java to functional immutable code using Scala
bi-monthly meetings with tech-leads and architects
resilience guidelines
CI and test environments to simulate production
E2E tests run 24/7 to make sure entire system works
every build passes through CI
every release is first passed through a canary like prod. env. before GA
dev teams manage their services’ deployment packages
operations deploy to clusters
changes made only through deployments
no work and changes on a per machine basis
remember the scale Efim mentioned in the start of the talk?
metrics collected in realtime & processed by Zabbix
pushed in realtime from web and app via Kafka to logstash -> ElasticSearch
±250 tests
run over Jenkins
user experience monitoring
errors includes video of UI
historical data is saved to ElasticSearch and presented...
Support Tier1-3, NOC, ScS, experts, monitoring, E2E
visual dashboards showing data of processed information bit from all inputs
very early on we realized: best preparation for worst => break things
you don’t want to rely on statements of your engineers “everything will be ok”
where to test?
what to test for?
when and how often to test?
system and service readiness
concentrate on
stuff that happens most in production
the big important things (they matter the most)
the low hanging fruit (easy wins)
where to start? => mission critical, business critical
base on data-flow
initial focus on clients
so we want to break things… OMG
how do we get visibility inside and high resolution and granularity
API monitors:
as small as can be
run at high frequency
broad coverage of perimeter services (user experience)
vmware based
automation
vmware
API-M
E2E
app & web logs
Kafka -> logstash -> ElasticSearch -> Kibana
runs on test env.
the entire system runs there
process => nights + weekends + holidays
results
complicated scenarios that can’t reproduce in env are tested in production DR
teams need to opt-in
process
if you are looking to get resilience at a large scale, keep those things in mind
multi layered solution
requires monitoring and testing
you need resilience to be ingrained in the culture of your company in all levels to get not support and cooperation but contribution and initiative
remember to keep things simple
trust and empower your engineers
and go break stuff