MySQL topology healing at OLA.

MySQL Topology Healing
Ola Story
Anil Yadav
Krishna R

Motivation
● Uncertainties in Public Cloud
● Business Continuity
● Data Consistency

High availability objectives
● How much outage time can you tolerate?
● How reliable is crash detection? Can you tolerate false positives (premature
failovers)?
● How reliable is failover? Where can it fail?
● How well does the solution work cross-data-center? On low and high latency
networks?
● Can you afford data loss? To what extent?

MHA
● Pros
■ Adoption
■ Data Healing
● Cons
■ Dormant community
■ Topology Awareness
■ Compatibility with Maxscale

MaxScale
● Pros
○ Resident in our Architecture
○ Pluggable
○ Backed By MariaDB
● Cons
○ Latency
○ Topology Awareness
○ No Community

ProxySQL
● Pros
○ Feature Rich
○ Vibrant Community
○ Percona Backed
● Cons
○ Latency
○ Topology awareness

The Chosen One
● MySQL Orchestrator
○ Pros
■ Adoption
■ Topology Awareness
■ Large Installations
● Booking.com
● Github
○ Cons
■ Needs GTID or MaxScale for healing

Building Blocks
● MySQL Orchestrator
● MaxScale Binlog Servers
● Semi Sync Replication
● NVme Storage

Orchestrator In Action
● Pre-Failover Process
● Healing
● Post-Failover Process

orchestrator.conf.json
"FailureDetectionPeriodBlockMinutes": 5,
"RecoveryPeriodBlockSeconds": 1800,
"RecoveryIgnoreHostnameFilters": [‘slave’],
"RecoverMasterClusterFilters": ["orch-master"],
"RecoverIntermediateMasterClusterFilters": ["orch-master"],
"OnFailureDetectionProcesses": [
"echo 'Detected {failureType} on {failureCluster}. Affected replicas: {countSlaves}, We dont panic' >> /usr/local/orchestrator/recovery.log","/eni_modules/orch_sendmail.py 'Master {failedHost} detected for {failureType}'"
],
"PreFailoverProcesses": [
"echo 'Will recover from {failureType} on {failureCluster}, Failed Host is : {failedHost}' >> /usr/local/orchestrator/recovery.log","/eni_modules/eni_detach.sh {failedHost} {failureType}>> /usr/local/orchestrator/recovery.log"
],
"PostFailoverProcesses": [
"echo '(for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}, Recovered from faliure>> /usr/local/orchestrator/recovery.log"
],
"PostUnsuccessfulFailoverProcesses": [],
"PostMasterFailoverProcesses": [
"echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /usr/local/orchestrator/recovery.log","/eni_modules/eni_attach.sh {failedHost} {successorHost}
>>/usr/local/orchestrator/recovery.log"
],
"PostIntermediateMasterFailoverProcesses": [
"echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /usr/local/orchestrator/recovery.log"
],

● Read only set On MySQL Master.
● ENI detached from the Master through AWS CLI.
○ This prevents the chances of split-brain
● Connections are killed.
Pre-Failover Process

Healing
● The most ahead binlog server is chosen
● Other binlog servers are grouped under it
○ This makes the topology consistent

Healing
● The new candidate master is chosen
○ This happens through “PromotionIgnoreHostnameFilters” setting, eg :
"PromotionIgnoreHostnameFilters": ["slave","lytic","backup"]
● The new Master’s binlog is flushed and the binlog servers are pointed under it

Post-Failover Process
● ENI is attached to the new master through AWS CLI.
● Connections can be seen on the new master at this point.
● This marks the end of the recovery process.

Challenges
● Orchestrator’s upstream does not support Maxscale Binlog servers
● Had to move to the previous version
○ https://github.com/outbrain/orchestrator
● A dead master because of Ec2 failure can reach the state -
“checkAndRecoverUnreachableMasterWithStaleSlaves”.
● It was patched to arrive at the state - “checkAndRecoverDeadMaster”
● Orchestrator’s force takeover was failing, so it was patched to follow the same
path as a “DeadMaster”
● The forked branch with these changes is at -
https://github.com/varunarora123/orchestrator

We are expanding our team.
Reach us out @anil.yadav1@olacabs.com / @krishna.r@olacabs.com

MySQL topology healing at OLA.

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie MySQL topology healing at OLA.

Ähnlich wie MySQL topology healing at OLA. (20)

Mehr von Mydbops

Mehr von Mydbops (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (9)

MySQL topology healing at OLA.