The talk will elaborate on how to detect and Heal your MySQL topology with MySQL Orchestrator .This talk was delivered on Mydbops database Meetup on 27-04-2019 by Anil Yadav, Lead Database Engineer with OLA and Krishna Ramanathan Database Administrator III with OLA.
4. High availability objectives
● How much outage time can you tolerate?
● How reliable is crash detection? Can you tolerate false positives (premature
failovers)?
● How reliable is failover? Where can it fail?
● How well does the solution work cross-data-center? On low and high latency
networks?
● Can you afford data loss? To what extent?
9. The Chosen One
● MySQL Orchestrator
○ Pros
■ Adoption
■ Topology Awareness
■ Large Installations
● Booking.com
● Github
○ Cons
■ Needs GTID or MaxScale for healing
10. Building Blocks
● MySQL Orchestrator
● MaxScale Binlog Servers
● Semi Sync Replication
● NVme Storage
12. orchestrator.conf.json
"FailureDetectionPeriodBlockMinutes": 5,
"RecoveryPeriodBlockSeconds": 1800,
"RecoveryIgnoreHostnameFilters": [‘slave’],
"RecoverMasterClusterFilters": ["orch-master"],
"RecoverIntermediateMasterClusterFilters": ["orch-master"],
"OnFailureDetectionProcesses": [
"echo 'Detected {failureType} on {failureCluster}. Affected replicas: {countSlaves}, We dont panic' >> /usr/local/orchestrator/recovery.log","/eni_modules/orch_sendmail.py 'Master {failedHost} detected for {failureType}'"
],
"PreFailoverProcesses": [
"echo 'Will recover from {failureType} on {failureCluster}, Failed Host is : {failedHost}' >> /usr/local/orchestrator/recovery.log","/eni_modules/eni_detach.sh {failedHost} {failureType}>> /usr/local/orchestrator/recovery.log"
],
"PostFailoverProcesses": [
"echo '(for all types) Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}, Recovered from faliure>> /usr/local/orchestrator/recovery.log"
],
"PostUnsuccessfulFailoverProcesses": [],
"PostMasterFailoverProcesses": [
"echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /usr/local/orchestrator/recovery.log","/eni_modules/eni_attach.sh {failedHost} {successorHost}
>>/usr/local/orchestrator/recovery.log"
],
"PostIntermediateMasterFailoverProcesses": [
"echo 'Recovered from {failureType} on {failureCluster}. Failed: {failedHost}:{failedPort}; Successor: {successorHost}:{successorPort}' >> /usr/local/orchestrator/recovery.log"
],
13. ● Read only set On MySQL Master.
● ENI detached from the Master through AWS CLI.
○ This prevents the chances of split-brain
● Connections are killed.
Pre-Failover Process
14. Healing
● The most ahead binlog server is chosen
● Other binlog servers are grouped under it
○ This makes the topology consistent
15. Healing
● The new candidate master is chosen
○ This happens through “PromotionIgnoreHostnameFilters” setting, eg :
"PromotionIgnoreHostnameFilters": ["slave","lytic","backup"]
● The new Master’s binlog is flushed and the binlog servers are pointed under it
16. Post-Failover Process
● ENI is attached to the new master through AWS CLI.
● Connections can be seen on the new master at this point.
● This marks the end of the recovery process.
17. Challenges
● Orchestrator’s upstream does not support Maxscale Binlog servers
● Had to move to the previous version
○ https://github.com/outbrain/orchestrator
● A dead master because of Ec2 failure can reach the state -
“checkAndRecoverUnreachableMasterWithStaleSlaves”.
● It was patched to arrive at the state - “checkAndRecoverDeadMaster”
● Orchestrator’s force takeover was failing, so it was patched to follow the same
path as a “DeadMaster”
● The forked branch with these changes is at -
https://github.com/varunarora123/orchestrator