Preparing for Multi-Cloud

©2019, Intechsystems SIA
Preparing for Multi-Cloud Operation
1
20.06.2019
By Konstantin Tjuterev & Oleg Andreyev

Why 2 speakers?
• Less slides to prepare for each of us
• We’d like to show our case from 2 perspectives
• Business and high-level architecture
• Nitty-gritty details of implementation
2

About Oleg Andreyev
• Senior Software Architect @ Intexsys
• 7+ years in Software Development
3

About Konstantin Tjuterev
• Founder and Chief Architect @ Intexsys
• 20+ years in Software Development
4

Agenda
• Why do we need Multi-cloud?
• Challenges – why it’s complicated
• Addressing the challenges
5

Initial state
• Top 500 Online Retailer in USA
• Existing Proprietary E-commerce Platform
• Multi-component stack (PHP/Symfony, MySQL, Elasticsearch,
Cassandra, RabbitMQ, HAProxy, Varnish, Nodejs)
• 15 online stores
• 1 000 000+ items sold
• $300 million annual turnover
• Hosted on AWS since 2018
6

Goals
• Average of $820K daily sales
• Downtime cost is at least $500/minute (820K/24/60)
• In reality, it can go as high as $5000/minute during Black Friday
7

What IF?
• What if AWS goes down?
• Never happened?
• But it DID
• And multiple times
8

What AWS outage causes
The four-hour AWS outage caused S&P 500 companies to lose $150
million, Cyence, a startup that models the economic impact of cyber
risk, estimated, a Cyence spokeswoman said via email. US financial
services companies lost $160 million, the firm estimated.
That estimate doesn’t include countless other businesses that rely on
S3, on other AWS services that rely on S3, or on service providers that
built their services on Amazon’s cloud
https://www.datacenterknowledge.com/archives/2017/03/02/aws-outage-that-broke-the-internet-caused-by-mistyped-command
9

What happened?
10

What IF?
• What if we have a major problem in one of the (clustered) services?
• Elasticsearch cluster issue
• MySQL master issue
• What if we push a wrong button in some infrastructure/deployment
automation tool?
12

Disaster Recovery Options
• Restoring from back-ups
• Snapshots of virtual machines/Database dumps
• Will have to spin up the whole infrastructure
• Cold stand-by
• A set of prepared but stopped virtual machines
• Database can be started, but dump must be restored
• Hot stand-by
• A set of running virtual machines not serving the traffic
• Running database replicas
13

Criteria
• Single/Shared points of failure
• Time to recovery / potential losses from outage ($5K/minute)
• Time to switching back after restoring operation of the primary
infrastructure / potential losses
• Cost of implementation / Complexity
• Cost of maintenance
• Additional benefits
14

Comparison
Option Point of failure Time to
recovery/switching back
Complexity/Costs
Backup If backups are in the same
cloud (AWS) or potential
restoring is to the same
cloud - single
Very Long – 24h at best
(spinning up and
reconfiguring the whole
infrastructure)
Low / Very low (just
storage)
Cold stand-by Depends (can be put in a
different cloud/data-
center)
Medium – 12+h (database
restore) if the cold
infrastructure is up to
date
High / Medium
Hot stand-by Depends (can be put in a
different cloud/data-
center)
Low – less than 1h Very High / High
15

What is Multi-Cloud Operation?
• Not a Disaster Recovery – just always running production traffic from
multiple independent clouds
• No single point of failure
• Almost instant recovery in case of Cloud outage - just all traffic is
served by surviving Cloud
• No “failover/switching back” – when Cloud is restored after outage,
we’ll just start sending traffic there
• High complexity/cost, but much better reliability
• Continuously live-tested (monitoring, deployment, real customers)
16

Additional benefits
• Blue/green deployments on the whole infrastructure scale
• Running infrastructure related experiments in isolated, but
production environment
• Ability to benefit from cost differences between cloud providers
(given that we’re paying for disaster recovery anyway)
17

Why not just AWS Multi-AZ?
• Sometimes AWS fails in all Availability Zones
• Vendor lock
• Complexity of Multi-AZ setup is similar to Multi-Cloud, just shifted
• Single cloud setup becomes easier (just use 1 AZ)
• Cross-cloud setup becomes more complicated
• With the same overall complexity we can get better results
• Better protection – no single point failure
18

Challenges
• Pushing source data to Multiple Clouds
• Data Synchronization between Clouds
• Deployment
• Dependencies
• Scheduled jobs
• Traffic balancing
• Monitoring/Alerting
19

Pushing data
• RabbitMQ in the office
• Clouds pulling messages and updating data in real-time
• Incoming traffic in Clouds is free
• Read-only databases replication from the office
20

Data Synchronization between Clouds
• This is the most challenging part
• We need to replicate relational data (such as orders, users) between
multiple clouds
• We’re using MySQL and are not planning to change that
• So, how to replicate data between clouds?
21

MySQL Real Master-Master replication
• Master MySQL nodes running in different clouds
• Both writing Binary logs and executing from each other
• With Multi-Cloud we need to support writes from both clouds
• Initially we were using Auto-increment primary key (as everyone
does)
• It won’t work with Master-Master
22

What will happen if…
• John Doe and Peter Doe will both create an account/order
• Requests will be handled by different Cloud

Replication conflict
• Replication will stop
• Replication can be fixed manually by ignoring error
• Multi-Cloud is out of sync

How to avoid such situation?
• Setup MySQL Cluster
• or setup Percona XtraDB Cluster
• or setup MariaDB Galera Cluster

But...
• “all or nothing approach”
• Your application needs to handle COMMIT
• COMMIT slowness = slowest node in cluster
• Network round-trip time / Certification time / Local apply
• We are not building a cluster…

Other solutions
• Primary Key Auto Increment step for each server (even/odd)
• Primary Key that will not collide

Universally unique identifier - UUID
• It’s a 128-bit number
• It’s a 32 hexadecimal digits (128/4)
• Can be referred as GUID

Versions of UUID
• Nil UUID – special case of UUID which is equal to NULL and all zeros
• UUID v1 – generated from a time and a node id (MAC address)
• UUID v2 – generated from an identifier, time, and a node id
• UUID v3 – generated by hashing a namespace name-based (md5)
• UUID v4 – generated using a random or pseudo-random number.
• UUID v5 – same as v3 but using sha1
• UUID v6 – optimized version of UUID v1 (unofficial)

UUID v1
• It is time based (sorting will not suffer much)
• It can be stored optimized in 16-bytes
• Maximal Average Rate 163 billion per second per node
• Can be tracked back to the server that created it
• Optimized B-Tree
• Less storage required for 16-bytes then for 32 characters
• To UUID or not to UUID ?
• Storing UUID Values in MySQL

UUID v1 Structure
33

Optimized UUID v1
34

But conflicts are still possible…
• Conflicts are possible but not with PK
• Conflicts can be caused by other unique key

But it’s very unlikely to happen because
• Normal replication delay < 1s
• Customer cannot send requests that fast with same data to different
cloud

What data to replication between Clouds?
• Each information has it’s source – need to clearly understand that
• Data which is generated by end user (customer/or server)
• Data which is pushed into Cloud by us

How to do Database Migrations
• Follow “zero” downtime migrations practices
• Avoid table locking
• Use ALGORITHM=INPLACE, LOCK=NONE when possible
• Do not deploy code that writes into column first
• Always think about Backward compatibility usually without revert
• Run DROP and RENAME after you are fully satisfied
• It’s better to run ALTER manually - more predictable
• Always remember that you are running in Multi-Cloud/Hot-standby

Another safety-check for developers
• Create separate users for two types of tables with DDL
• Table that are populated by customer
• Table that are populated by us
• Remove DDL permissions from main user
• Group migrations by “category”
• Before deploying to another Cloud make sure it has SBM = 0

How to deploy to Multi-Cloud
• Make sure your application is Cloud agnostic
• Store config in the environment (The Twelve Factors)
• Do not deploy application to all Clouds simultaneously
• Backward compatibility

How to deploy assets (JS/CSS)
• Figure out assets lifetime
• Make sure you support few old versions of assets (cache)
• Make sure your assets are Backward Compatible
• If you have some persisted data in Customer space (cookies, local
storage) make sure it compatible between versions
• Monitor and logs your assets
• Make sure that assets hash is auto generated

Asset Lifetime
43

Distributed CRON
• Do not directly configure CRON on servers
• Scheduling MUST be delegated to independent system
• Determine your clients
• Handle VM “death” – you should be able to switch job fast
• https://mesos.github.io/chronos/
• https://dkron.io

Other facts
• We had to upgrade MySQL twice within 6 months, 5.5 -> 5.7 -> 8.0
• 5.7 – GTID, Replication channels
• 8.0 – Replication Filter per Channel
• Use GTID (Global Transaction ID) for consistency
• Use AUTO_POSITION for replication (only with GTID)

Brief Summary
• Use UUID to avoid conflicts with Primary Key
• Determine what data needs to be synced
• Monitor your replication with all possible tools
• Use distributes CRON
• Monitor and log your JS/CSS
• Remember about CAP theorem
46

Traffic balancing
• DNS – weight-based with health checks
• WAF/CDN + Rules Engine (on CDN Edges)
• Location stickiness
47

Routing with cloud stickiness
AWS AZURE
Weight-based DNS
WAF / СDN
Sticky Cookie
Present?
www.site.com
Alive
NO
Request Cookie = AWS?Yes
Yes
No
Set-Cookie: cloud=aws
Set-Cookie: cloud=azure
balanced.site.com
Health-
based
DNS
90%
aws.site.com
AWS Outage
Health-
based
DNS
10%
azure.site.com
Alive
AzureOutage
CDN/Edges
Response
48

Summary
• Not everyone needs Multi-cloud
• You need to have clear reasons to do go Multi-cloud
• Disaster recovery
• Speed (geo-based)
• It’s challenging and costly
• But doable even with basic tools/stack (PHP/MySQL)
49

Q&A
50

Preparing for Multi-Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Preparing for Multi-Cloud

Similar to Preparing for Multi-Cloud (20)

Recently uploaded

Recently uploaded (20)

Preparing for Multi-Cloud

Editor's Notes