Guidelines for mere mortals. These are a collection of guidelines picked up in the field... hopefully they would help developers and SREs building or modernizing apps ensuring the highest level of availability to their applications.
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
Availability in a cloud native world v1.6 (Feb 2019)
1. IBM Services – Continuous Availability
Availability in a Cloud-Native
World. Guidelines for mere
mortals.v1.6 Tuesday, February 26, 2019
Haytham Elkhoja
Global Tech Leader and Chief Architect
IBM Services Continuous Availability – IBM Services
haytham.elkhoja@ibm.com
@haythamelkhoja
Herbie Pearthree
Chief Technical Officer, Senior Technical Staff Member
IBM Services Continuous Availability – IBM Services
hpear3@us.ibm.com
@herbiepear3
6. Definition
Availability. Everything breaks,
you should plan on it. Business
must be active in multi-availability
zones to mitigate failures3 (fires,
floods and fools).
It also allows zero downtime for planned changes and
minimizes maintenance windows.
7. Definition
Availability in a Cloud Native
World.
Cloud Native and Microservices
- Parallel, agile, polyglot development.
- Choose the right tool for the job.
- Microservices and Loosely-Coupled Components.
- Pet vs Cattle.
Continuous Availability / Always On / Zero Downtime
- First impression, last impression.
- Cost of downtime, there are 8,760 hours in a year, make them count.
- Availability, resilience, performance and scalability go hand in hand.
- Blue Green and canary deployments per region/cloud for non-disruptive change management.
- Redirect users to their closest region/cloud, right cloud/region for the right job.
- No HA and stretched clustering = no failure domains.
- 3 regions/clouds cheaper than 2.
v/s
15. Guideline
Speaking of fallacies here’s a
bunch:
- Network is reliable.
- Latency is zero.
- Bandwidth is infinite.
- The network is secure.
- Topology doesn't change.
- There is one administrator.
- Transport cost is zero.
- The network is homogeneous.
16. Guideline
Bleeding edge is an attitude.
Technology is changing every day.
What you knew yesterday is
already legacy (or deprecated).
17. Guideline
Understand consistency.
Consistency
Weak
• After a write, reads may or may not see it.
• Best effort only.
• Memcache, VoIP, live video streaming.
Eventual
• After a write, reads will eventually see it.
• Write will happen... Eventually.
• Object Storage, SMTP, DNS.
• Asynchronous data replication.
Strong
• After a write, reads will see it.
• Don’t continue unless commit.
• Filesystems, RDBMS.
• Synchronous data replication.
18. Guideline
CAP Theorem decisions
early on.
Knowing that Partition Tolerance cannot be sacrificed.
Pick Consistency or Availability.
Consistency
All distributed nodes have a single up-to-date copy of all data at all times.
Availability
Every request receives a success or failure response.
Partition-tolerance
System continues to run despite arbitrary message loss or failure of part of the system.
C
A P
Pick two
Cassandra, CouchDB, HBase etc…
MongoDB,
Redis
etc…
Oracle,
DB2,
MySQL
etc…
Distributed systems data persistence decisions
C+A
To have consistent and available data,
partitioning tolerance must be sacrificed.
This means that data can only be consistent
in a single place at any moment in time.
C+P
To ensure data consistency and partitioning
tolerance, availability must be sacrificed.
This means that data is accessible only if
all data nodes are available.
A+P
To ensure availability and partition
tolerance, consistency must be sacrificed.
This means some data nodes aren’t necessarily
in sync in case of a networking disruption.
19. Guideline
Love DevOps? Wait till you meet
SRE.
https://landing.google.com/sre/
“SRE is what happens when you ask a
software engineer to design an
operations team. ”
50. Guideline
Rolling updates strategies for
zero downtime deployments.
Accounting for the time the application needs to start up.
Deploy by adding an instance, then remove
an old one
Deploy by removing an instance, then add a
new one
Deploy by updating instances as fast as
possible
52. Guideline
You don’t choose Chaos Monkey.
Chaos Monkey chooses you.
“Chaos Engineering the discipline of
experimenting on a distributed system in
order to build confidence in the system's
capability to withstand turbulent
conditions in production.”
https://principlesofchaos.org
53. Guideline
When pursuing Chaos
Engineering, start small and
observe and learn.
# of instances
E.g. Latency attack
200
400
600
800
100
0
0
Latency
(ms)
0 20 40 60 80 100
start here
I. Plan an experiment II. Contain the Blast
Radius
III. Scale or Squash
How to conduct Chaos Engineering attacks:
• Test (latency, DNS, leap seconds, disk fill, kill
processes, etc…).
• Expected results?
• Observed results.
• Document.
Remember to start small and gradually increase blast radius.
then increase radius
54. Guideline
Data patterns differ. Not all data
are created equal.
Messaging
BPM
CEP
APP
Active standby
or active/query
Hot standby
or configured
active/active for
fast switchover
Multi-master
or peer-to-peer
write anywhere
Data distribution
filter and push
Data warehouse
integration and
federation
Data through
messaging filter
and push
distribution
56. PUBLIC NETWORK CLOUD NETWORK ENTERPRISE NETWORK
TRANSFORMATION &
CONNECTIVITY
GLOBAL LOAD
BALANCER
USER
ENTERPRISE
DATABASE
ENTERPRISE
DATABASE
FIREWALL
TRANSFORMATION &
CONNECTIVITY
TRANSFORMATION &
CONNECTIVITY
DATACENTER 1
DATACENTER 2
LEGEND
Application
Infrastructure
Data Store
Security
Devops
User
Scalable
FIREWALL
APPLICATION
CLOUD SITE 1
MICROSERVICE
APPLICATION 1
NOSQL
DATABASE
MICROSERVICE
APPLICATION 2
6APPLICATION
CLOUD SITE 2
MICROSERVICE
APPLICATION 1
NOSQL
DATABASE
MICROSERVICE
APPLICATION 2
APPLICATION
CLOUD SITE 3
MICROSERVICE
APPLICATION 1
NOSQL
DATABASE
MICROSERVICE
APPLICATION 2
GLOBAL LOAD
BALANCER
GLOBAL LOAD
BALANCER
1
3-Active Microservices Systems of Engagement w/Active-Active Enterprise SoR
1. Global LoadBalancer responds to DNS request and points user to best responding site
2. User Request is sent to best site to consume the business service application
3. Cloud Native Microservice #1 (using circuit breaker) connects to best Enterprise SoR
4. Cloud Native Microservice #2 performs CRUD on NoSQL Database in site
5. NoSQL database replication set performs operation on each of it’s peers
6. Enterprise SoR replication set performs CRUD on it’s peer
2
3
34
4
5
99.99%
99.999%