Challenging architecture design, and proof of concept on a real case of study using Syncrhomous solution.
Customer asks me to investigate and design MySQL architecture to support his application serving shops around the globe.
Scale out and scale in base to sales seasons.
Scaling API-first – The story of a global engineering organization
Scaling with sync_replication using Galera and EC2
1. Scaling MySQL using multi master
synchronous replication
Marco “the Grinch” Tusa
Percona Live London2013
2. About Me
Introduction
Marco “The Grinch”
• Former Pythian cluster technical leader
• Former MySQL AB PS (EMEA)
• Love programming
• History of religions
• Ski; Snowboard; scuba diving; Mountain
trekking
3. Agenda
• Customer requirements
• Installation and initial setup
• Applying the customer scenario to solution
• Numbers, and considerations.
• Scaling out test and efforts
• Scaling in test and efforts
• Geographic distribution
Introduction
4. Many Galera Talks
• PERCONA XTRADB CLUSTER IN A NUTSHELL :
HANDS ON TUTORIAL
Tutorial Monday
• Galera Cluster 3.0 New Features. Seppo Jaakola
Presentation Tuesday
• HOW TO UNDERSTAND GALERA REPLICATION
Alexey Yurchenko Presentation Tuesday
Introduction
5. A journey started 2 yeas ago
• First work done as POC in November 2011
• First implementation in production January 2012
• Many more after
• Last done 12 clusters of 5 nodes with 18 to 24
application server attached
Introduction
6. Historical Real life case
Customer mentions the need to scale for writes.
My first though went to NDB.
Customer had specific constrains:
• Amazon EC2;
• No huge instances (medium preferred);
• Number of instances Increase during peak seasons;
• Number of Instances must be reduced during regular period;
• Customer use InnoDB as storage engine in his current platform and
will not change;
Customer requirements
7. Refine the customer requirements
Challenging architecture design, and proof of concept on a real case of study using Synchronous
solution.
Customer asks us to investigate and design MySQL architecture to support his application serving
shops around the globe.
Scale out and scale in base to sales seasons. We will share our experience presenting the results of
our POC High level outline
Customer numbers:
• Range of operation/s from 20 to 30,000 (5.000 inserts/sec)
• Selects/Inserts 70/30 %
• Application servers from 2 to ∞
• MySQL servers from 2 to ∞
• Operation from 20 bytes to max 1Mb (text)
• Data set dimension 40GB (Old data is archive every year)
• Geographic distribution (3 -> 5 zones), partial dataset
Customer requirements
9. Scaling Up vs. Out
Scaling Up Model
• Require more investment
• Not flexible and not a good fit with MySQL
Scaling Out Model
• Scale by small investment
• Flexible
• Fits in MySQL model (HA, load balancing etc.)
10. Scaling Reads vs Write
•
Read Easy way of doing in MySQL if % of write is low
Write
Read
•Write
• Replication is not working
• Single process
• No data consistency check
• Parallel replication by schema is not
• Semi synchronous replication is not
the solution
the solution as well
11. Synchronous Replication in MySQL
MySQL cluster, I
NDBCluster
•
Really synchronous
•
Data distribution and Internal partitioning
•
The only real solution giving you 9 9. 9 9 9 % (5 minutes) max
downtime
•
NDB Cluster is more then a simple storage engine (use API if you can)
Galera replication
•
Virtually Synchronous
•
No data consistency check (optimistic lock)
•
Data replicated by Commit
•
Use InnoDB
Options Overview
12. Choosing the solution
Did I say I
NDB Cluster?
–But not a good fit here because:
•EC2 dimension (1 CPU 3.7GB RAM);
•Customer does not want to change from InnoDB;
•Need to train the developer to get out maximum from it;
–Galera
could be a better fit because:
•Can fit in the EC2 dimension;
•Use InnoDB;
•No additional knowledge when develop the solution;
Options Overview
14. Architecture Design
Application layer
in the cloud
Load Balancer distributing
request in RR
Data
layer in
the cloud
MySQL instance
geographically
distributed
Architecture AWS blocks
EC2 small
instance
EC2 medium
instance
15. Instances EC2
Web servers
• Small instance
• Local EBS
Data Servers
• Medium instance 1 CPU 3.7GB RAM
• 1 EBS OS
• 6 EBS RAID0 for data
Be ready to scale OUT
• Create an AMI
• Get AMI update at regular base
Architecture EC2 blocks
16. Why not ephemeral storage
RAID0 against 6 EBS is performing faster;
•
RAID approach will mitigate possible temporary degradation;
•
Ephemeral is … ephemeral, all data will get lost;
Numbers with rude comparison
(ebs) Timing buffered disk reads:
768 MB in
3.09 seconds = 248.15 MB/sec
(eph)Timing buffered disk reads:
296 MB in
3.01 seconds =
(ebs)Timing O_DIRECT disk reads:
814 MB in
3.20 seconds = 254.29 MB/sec
(eph)Timing O_DIRECT disk reads:
2072 MB in
Architecture Installation and numbers
98.38 MB/sec
3.00 seconds = 689.71 MB/sec
17. Why not ephemeral storage (cont.)
Architecture Installation and numbers
18. Why not ephemeral storage (cont.)
Architecture Installation and numbers
19. Why not ephemeral storage (cont.)
Architecture Installation and numbers
20. Storage on EC2
Multiple EBS RAID0
Or
USE Provisioned IOPS
Amazon EBS Provisioned IOPS volumes:
Amazon EBS Standard volumes:
$0.10 per GB-month of provisioned storage $0.125 per GB-month of provisioned storage
$0.10 per provisioned IOPS-month
$0.10 per 1 million I/O requests
Architecture Installation and numbers
21. Instances EC2
How we configure the EBS.
• Use Amazon EC2 API Tools
(http://aws.amazon.com/developertools/351)
• Create 6 EBS
• Attach them to the running instance
Run mdadm as root (sudo mdadm --verbose --create /dev/md0 --level=0 --chunk=256 --
raid-devices=6 /dev/xvdg1 /dev/xvdg2 /dev/xvdg3 /dev/xvdg4 /dev/xvdg5 /dev/xvdg6 echo 'DEVICE
/dev/xvdg1 /dev/xvdg2 /dev/xvdg3 /dev/xvdg4 /dev/xvdg5 /dev/xvdg6' | tee -a /etc/mdadm.conf sudo
)
Create an LVM to allow possible easy increase of data size
Format using ext3 (no journaling)
Mount it using noatime nodiratime
Run hdparm –t [--direct] <device> to check it works properly
mdadm --detail --scan | sudo tee -a /etc/mdadm.conf
Installation
22. Instances EC2 (cont.)
You can install MySQL using RPM, or if you want to have a better
life and upgrade (or downgrade) faster do:
•Create a directory like /opt/mysql_templates
•Get
MySQL binary installation and expand it in the
/opt/mysql_templates
•Create
symbolic link /usr/local/mysql against the version you
want to use
•Create
the symbolic links also in the /usr/bin directory ie
(for bin in `ls -D /usr/local/mysql/bin/`; do ln -s /usr/local/mysql/bin/$bin /usr/bin/$bin; done)
Installation
23. Create the AMI
•
Once I had the machines ready and standardized.
o
o
•
Create AMI for the MySQL –Galera data node;
Create AMI for Application node;
AMI will be used for expanding the cluster and or in case of
crashes.
Installation
24. Problem in tuning - MySQL
MySQL optimal configuration for the environment
•
•
Dirty page;
•
Innodb write/read threads;
•
Binary logs (no binary logs unless you really need them);
•
Doublebuffer;
•
Setup
Correct Buffer pool, InnoDB log size;
Innodb Flush log TRX commit & concurrency;
25. Problem in tuning - Galera
Galera optimal configuration for the environment
evs.send_window Maximum messages in replication at a time
• evs.user_send_window Maximum data messages in replication
at a time
• wsrep_slave_threads which is the number of threads used by
Galera to commit the local queue
• gcache.size
• Flow Control
• Network/keep alive settings and WAN replication
•
Setup
26. Applying the customer scenario
How I did the tests. What I have used.
Stresstool (my development) Java
•
•
•
•
•
•
•
•
•
Test application
Multi thread approach (each thread a connection);
Configurable number of master table;
Configurable number of child table;
Variable (random) number of table in join;
Can set the ratio between R/W/D threads;
Tables can have Any data type combination;
Inserts can be done simple or batch;
Reads can be done by RANGE, IN, Equal;
Operation by set of commands not single SQL;
27. Applying the customer scenario
(cont.)
How I did the tests.
• Application side
•
I have gradually increase the number of thread per instance of
stresstool running, then increase the number of instance.
• Data layer
•
Start with 3 MySQL;
•
Up to 7 Node;
• Level of request
•
From 2 Application blocks to 12;
•
From 4 threads for “Application” block;
•
To 64 threads for “Application” block (768);
Test application
29. Numbers in Galera replication
What happened to the replication?
Bad commit behavior
30. Changes in replication settings
Problem was in commit efficiency & Flow Control
Reviewing Galera documentation I choose to change:
•
evs.send_window=1024 (Maximum packets in replication at a
time.);
•
evs.user_send_window=1024 (Maximum data packets in
replication at a time);
•
wsrep_slave_threads=48;
Bad commit behavior
31. Numbers After changes (cont.)
Table with numbers (writes) for 3-5-7 nodes and increase traffic
Using MySQL 5.5
32. Numbers After changes (cont.)
Table with numbers (writes) for 3-5-7 nodes and increase traffic
Using MySQL 5.5
38. FC on real HW
From 4 – 92 threads
Tests & numbers Real HW
39. How to scale OUT
The effort to scale out is:
• Launch a new instance from AMI (IST Sync if
wsrep_local_cache_size big enough otherwise SST);
•
Always add nodes in order to have ODD number of nodes;
Modify the my.cnf to match the server ID and IP of the master
node;
•
•
Start MySQL
•
Include node IP in the list of active IP of the load balancer
•
The whole operation normally takes less then 30 minutes.
Scaling
40. How to scale IN
The effort to scale IN is minimal:
Remove the data nodes IP from load balancer (HAProxy);
• Stop MySQL
• Stop/terminate the instance
•
Scaling
41. How to Backup:
If using provisioning and one single volumes
contains al, snapshot is fine.
Otherwise I like the Jay solution:
http://www.mysqlperformanceblog.com/2013/10/08
/taking-backups-percona-xtradb-cluster-withoutstalls-flow-control/
Using wsrep_desync=OFF
42. Failover and HA
With MySQL and Galera, unless issue all the nodes should contain
the same data.
Performing failover will be not necessary for the whole service.
Cluster in good health
Cluster with failed node
So the fail over is mainly an operation at load balancer (HAProxy
works great) and add another new Instance (from AMI).
43. Geographic distribution
With Galera it is possible to set the cluster to replicate cross
Amazon’s zones.
I have tested the implementation of 3 geographic location:
• Master location (1 to 7 nodes);
•
First distributed location (1 node to 3 on failover);
•
Second distributed location (1 node to 3 on failover);
No significant delay were reported, when the distributed nodes remain
passive.
•
Good to play with:
keepalive_period inactive_check_period suspect_timeout
inactive_timeout install_timeout
Geographic distribution
44. Problems with Galera
During the tests we face the following issues:
•
MySQL data node crash auto restart, recovery (Galera in loop)
•
Node behind the cluster, replica is not fully synchronous, so the
local queue become too long, slowing down the whole cluster
•
Node after restart acting slowly, no apparent issue, no way to
have it behaving as it was before clean shutdown, happening
randomly, also possible issue due to Amazon.
Faster solution build another node and attach it in place of the
failing.
Conclusions
45. Did we match the expectations?
Our numbers were:
•
From 1,200 to ~10,000 (~3,000 in prod) inserts/sec
•
27,000 reads/sec with 7 nodes
•
From 2 to 12 Application servers (with 768 request/sec)
•
EC2 medium 1 CPU and 3.7GB!!
o
In Prod Large 7.5GB 2 CPU.
I would say mission accomplished!
Conclusions
46. Consideration about the solution
Pro
•
Flexible;
•
Use well known storage engine;
•
Once tuned is “stable” (if Cloud permit it);
Cons
•
!WAS! New technology not included in a official cycle of development;
•
Some times fails without clear indication of why, but is getting better;
•
Replication is still not fully Synchronous (on write/commit);
Conclusions
47. Monitoring
Control what is going on is very important
Few tool are currently available and make sense for me:
Jay Janssen
https://github.com/jayjanssen/myq_gadgets/blob/master/myq_status
ClusterControl for MySQL
http://www.severalnines.com/resources/user-guide-clustercontrol-mysql
Percona Cacti monitor template
Conclusions
49. Thank you
To contact Me
marco.tusa@percona.com
marcotusa@tusacentral.net
To follow me
http://www.tusacentral.net/
https://www.facebook.com/marco.tusa.94
@marcotusa
http://it.linkedin.com/in/marcotusa/
Conclusions