1. MySQL for large scale
social games
Yoshinori Matsunobu
Principal Infrastructure Architect, Oracle ACE Director at DeNA
Former APAC Lead MySQL Consultant at MySQL/Sun/Oracle
Yoshinori.Matsunobu@gmail.com, Twitter: @matsunobu
http://yoshinorimatsunobu.blogspot.com/ 1
2. Table of contents
Easier maintenance and automating failover
Non-stop master migration
Automated master failover
New Open Source Software: “MySQL MHA”
Optimizing MySQL for faster H/W
2
3. Company Introduction: DeNA
One of the largest social game providers in Japan
Both social game platform and social games themselves
Subsidiary ngmoco:) in San Francisco
Japan localized phone, Smart Phone, and PC games
2-3 billion page views per day
25+ million users
1000+ MySQL servers, 150+ {master, slaves} pairs
1.3B$ revenue in 2010
3
4. Games are expanding / shrinking rapidly
It is very difficult to predict social game workloads
Sometimes unexpectedly high traffics, sometimes much lower than
expected
Each social game traffic tends to go down after months / years
For expanding games
Adding slaves
Adding more shards
– It’s possible to add shards without stopping services
Scaling up master’s H/W
– More RAM, HDD->SSD/PCI-E SSD, Faster NW, etc
For shrinking games
Decreasing slaves
Migrating master to lower-spec machine
Consolidating a few masters/slaves within single machine
4
5. Desire for Easier Operations
We want to move master servers more easily
Scaling-up: Increasing RAM, replacing with faster SSD
Upgrading MySQL: Results in 10 minute or more downtime to fill in
buffer pool
Scaling-down: Moving unpopular games to lower spec servers
Working around for power outage: Moving games to remote datacenter
If you can allocate maintenance downtime, it’s easy, but we
can’t do so many times
Announcing to users, coordinating with customer support, etc
Longer downtime reduces revenue
Operating staffs will be exhausted by too many midnight work
Reducing maintenance time is important to manage hundreds
or thousands of MySQL servers 5
6. Switching master in seconds
If we can switch a master in less than 3 seconds, it is
acceptable in most of our cases
Stopping updates on the master
Waiting until at least one of the slaves (new master) has
synced with the current master
Granting writes, allocating virtual ip (etc) to the new master
All the rest slaves start replication from the new master
6
7. Blocking writes on master
MySQL provides several commands/solutions to block writes, but
not all of them are safe
FLUSH TABLES WITH READ LOCK
– Clients will wait forever, unless setting timeouts on client side
– Running transactions will be aborted in the end
“Updating master1 -> updating master 2 -> committing master1 -> getting error on
committing master 2” will result in data inconsistency
– Flushing all tables sometimes takes very long time
Run “FLUSH NO_WRITE_TO_BINLOG TABLES” beforehand
SET GLOBAL read_only = 1
– Getting errors immediately
– Running transactions will be aborted
Dropping MySQL user (used from applications)
– Can not establish new MySQL connection from applications
– Current sessions are NOT terminated until disconnect
– Current sessions do not encounter errors
– Works with non-persistent connections only 7
8. Trade-off between safeness and performance
What we are now doing at DeNA is..
Checking there is not any long running updates
– 100 seconds of updates will take 100 seconds on slaves
Dropping app user -- starting downtime
Waiting for a while (2 seconds maximum) until all active
application sessions are disconnected
– Ignoring replication threads, sessions sleeping 1 second or more (highly
likely daemon program or unused sessions, which can be killed safely)
– Not killing active sessions immediately
Executing FLUSH TABLES WITH READ LOCK when there
are no active sessions or 2 seconds have passed
Starting slave promotion -- ending donwtime
At most 1 second is enough to do all processes 8
9. Our solution
From: Developing “MySQL-MHA: Master High
host1 (current master) Availability manager and tools”
+--host2 (backup)
http://code.google.com/p/mysql-master-ha
+--host3 (slave)
+--host4 (slave) This is automated failover tool, but can also
+--host5 (remote) be used for fast online master switch
To:
host2 (new master) Switching original master to new master
+--host3 (slave) gracefully
+--host4 (slave) We have switched 10+ masters so far. We
+--host5 (remote)
could switch in 0.5 – 1 second of
downtime
9
10. Master Failover: What makes it difficult?
Writer IP MySQL replication is asynchronous.
master It is likely that some (or none of) slaves have
not received all binary log events from the
id=99 crashed master.
id=100
id=101
id=102 It is also likely that only some slaves have
received the latest events.
1. Save binlog events that
exist on master only
In the left example, id=102 is not replicated to
any slave.
slave1 slave2 slave3 slave 2 is the latest between slaves, but
id=99 id=99 id=99
slave 1 and slave 3 have lost some events.
id=100 id=100 id=100 It is necessary to do the following:
id=101 id=101 id=101 - Copy id=102 from master (if possible)
id=102 id=102 id=102
2. Identify which events are not sent - Apply all differential events, otherwise data
3. Apply lost events inconsistency happens. 10
11. Current stable HA solutions and issues
Pacemaker(Heartbeat) + DRBD (or shared disk)
Cost: Additional passive master server (not handing any application traffic)
Performance: To make HA really work on DRBD replication environments, innodb-
flush-log-at-trx-commit and sync-binlog must be 1. But these kill write performance
Otherwise necessary binlog events might be lost on the master. Then slaves can’t
continue replication, and data consistency issues happen
MySQL Cluster
MySQL Cluster is really Highly Available, but unfortunately we use InnoDB
Others
Unstable, too complex, too hard to operate/administer, wrong/no document
Not working with standard MySQL (are you saying we have to migrate all 150+
applications to bleeding edge distributions?)
not working with remote datacenter, etc
11
12. Our solution: Developing MySQL-MHA
Manager
master
MySQL-MasterHA-Manager
- masterha_manager
- other helper commands
slave1 slave2 slave3
master MySQL-MasterHA-Node
- save_binary_logs
- apply_diff_relay_logs
- purge_relay_logs
slave1 slave2 slave3
MySQL Master High Availability manager and tools
http://code.google.com/p/mysql-master-ha
Manager pings master availability
When detecting master failure, promoting one of slaves to the new
master, fixing consistency issues between slaves
12
13. Internals: steps for recovery
Dead Master Latest Slave Slave(i)
Wait until SQL thread
executes all events
Final Relay_Log_File,
Relay_Log_Pos
(i1) Partial Transaction
Master_Log_File
Read_Master_Log_Pos
(i2) Differential relay logs from each slave’s read pos to
the latest slave’s read pos
(X) Differential binary logs from the latest slave’s read pos
to the dead master’s tail of the binary log
On slave(i),
Wait until the SQL thread executes events
Apply i1 -> i2 -> X
– On the latest slave, i2 is empty 13
14. Advantages of MySQL MHA
Master failover and slave promotion can be done very quickly
Total downtime can be 10-30 seconds
Master crash does not result in data inconsistency
No need to modify current MySQL settings
We use MHA for 150+ normal MySQL 5.0/5.1/5.5 masters, without
modifying anything
Problems of MHA do not result in MySQL failure
You can install/uninstall/upgrade/downgrade/restart without stopping
MySQL
No need to increase lots of servers
No performance penalty
Works with any storage engine
Can also be used for failback (fast online master switch) 14
15. MySQL MHA Project Info
Project top page
http://code.google.com/p/mysql-master-ha/
Documentation
http://code.google.com/p/mysql-master-ha/wiki/TableOfContents?tm=6
Source tarball and rpm package (stable release)
http://code.google.com/p/mysql-master-ha/downloads/list
The latest source repository (dev release)
https://github.com/yoshinorim/MySQL-MasterHA-Manager (Manager
source)
https://github.com/yoshinorim/MySQL-MasterHA-Node (Per-MySQL
server source)
SkySQL provides commercial support for MHA
15
16. Table of contents
Easier maintenance and automating failover
Non-stop master migration
Automated master failover
New Open Source Software: “MySQL MHA”
Optimizing MySQL for faster H/W
16
17. Per-server performance is important
To handle 1 million queries per second..
1000 queries/sec per server : 1000 servers in total
10000 queries/sec per server : 100 servers in total
Additional 900 servers will cost 10M$ initially, 1M$
every year
If you can increase per server throughput, you can
reduce the total number of servers, which will decrease
TCO
Sharding is not everything
17
18. History of MySQL performance improvements
H/W improvements
HDD RAID, Write Cache
Large RAM
SATA SSD、PCI-Express SSD
More number of CPU cores
Faster Network
S/W improvements
Improved algorithm (i/o scheduling, swap control, etc)
Much better concurrency
Avoiding stalls
Improved space efficiency (compression, etc) 18
19. 32bit Linux
Updates
2GB RAM 2GB RAM 2GB RAM
HDD RAID HDD RAID HDD RAID
(20GB) (20GB) (20GB)
+ Many slaves + Many slaves + Many slaves
Random disk i/o speed (IOPS) on HDD is very slow
100-200/sec per drive
Database easily became disk i/o bound, regardless of disk size
Applications could not handle large data (i.e. 30GB+ per server)
Lots of database servers were needed
Per server traffic was not so high because both the number of users and data
volume per server were not so high
Backup and restore completed in short time
MyISAM was widely used because it’s very space efficient and fast
19
20. 64bit Linux + large RAM + BBWC
16GB RAM
+ Many slaves
HDD RAID
(120GB)
Memory pricing went down, and 64bit Linux went mature
It became common to deploy 16GB or more RAM on a single linux machine
Memory hit ratio increased, much larger data could be stored
The number of database servers decreased (consolidated)
Per server traffic increased (the number of users per server increased)
“Transaction commit” overheads were extremely reduced thanks to battery backed
up write cache
From database point of view,
InnoDB became faster than MyISAM (row level locks, etc)
Direct I/O became common 20
21. Side effect caused by fast server
After 16-32GB RAM became common, we could
run many more users and data per server
Write traffic per server also increased
Master 4-8 RAID 5/10 also became common, which
improved concurrency a lot
On 6 HDD RAID 10, single thread IOPS is around
HDD RAID 200, 100 threads IOPS is around 1000-2000
Good parallelism on both reads and writes on master
On slaves, there is only one writer thread (SQL
thread). No parallelism on writes
6 HDD RAID10 is as slow as single HDD for
writes
Slave
Slaves became performance bottleneck earlier than
HDD RAID master
Serious replication delay happened (10+ minutes at
21
peak time)
22. Using SATA SSD on slaves
IOPS differences between master (1000+) and slave
(100+) have caused serious replication delay
Is there any way to gain high enough IOPS from
single thread?
Master
Read IOPS on SATA SSD is 3000+, which should
be enough (15 times better than HDD)
HDD RAID Just replacing HDD with SSD solved replication
delay
Overall read throughput became much better
Using SSD on master was still risky
Using SSD on slaves (IOPS: 100+ -> 3000+) was
more effective than using on master (IOPS: 1000+ ->
3000+)
Slave We mainly deployed SSD on slaves
SATA SSD The number of slaves could be reduced
From MySQL point of view:
Good concurrency on HDD RAID has been required :
InnoDB Plugin 22
23. How about PCI-Express SSD?
Deploying on both master and slaves?
If PCI-E SSD is used on master, replication delay will happen again
– 10,000IOPS from single thread, 40,000+ IOPS from 100 threads
10,000IOPS from 100 threads can be achieved with SATA SSD
Parallel SQL threads should be implemented in MySQL
Deploying on only slaves?
If using HDD on master, SATA SSD should be enough to handle workloads
– PCI-Express SSD is much more expensive than SATA SSD
How about running multiple MySQL instances on single server?
– Virtualization is not fast
– Running multiple MySQL instances on single OS is more reasonable
Does PCI-E SSD have enough storage capacity to run multiple instances?
On HDD environments, typically only 100-200GB of database data can be stored
because of slow random IOPS on HDD
FusionIO SLC: 320GB Duo + 160GB = 480GB
FusionIO MLC: 1280GB Duo + 640GB = 1920GB
tachIOn SLC: 800GB x 2 = 1600GB
23
24. Running multiple slaves on single box
Before After
M M B M M B
B S1 S2 S3 B S1 S2 S3
S1, S1 S2, S2
S1, S1 S2, S2
M M
B S1 S2 S3 B S1 S2 S3 B M M B
Running multiple slaves on a single PCI-E slave
Master and Backup Server are still HDD based
Consolidating multiple slaves
Since slave’s SQL thread is single threaded, you can gain better
concurrency by running multiple instances
The number of instances is mainly restricted by capacity
24
25. Our environment
Machine
HP DL360G7 (1U), or Dell R610
PCI-E SSD
FusionIO MLC (640GB Duo + 320GB non-Duo)
tachIOn SLC (800GB x 2)
CPU
Two sockets, Nehalem 6-core per socket, HT enabled
– 24 logical CPU cores are visible
– Four socket machine is too expensive
RAM
60GB or more
Network
Broadcom BCM5709, Four ports
Using four network cables + bonding mode 4 + link aggregation
– BONDING_OPTS="miimon=100 mode=4 lacp_rate=1 xmit_hash_policy=1"
HDD
4-8 SAS RAID1+0
For backups, redo logs, relay logs, (optionally) doublewrite buffer
25
26. Benchmarks on our real workloads
Consolidating 7 instances on FusionIO (640GB MLC Duo + 320GB MLC)
Let half of SELECT queries go to these slaves
6GB innodb_buffer_pool_size
Peak QPS (total of 7 instances)
61683.7 query/s
37939.1 select/s
7861.1 update/s
1105 insert/s
1843 delete/s
3143.5 begin/s
CPU Utilization
%user 27.3%, %sys 11%(%soft 4%), %iowait 4%
C.f. SATA SSD:%user 4%, %sys 1%, %iowait 1%
Buffer pool hit ratio
99.4%
SATA SSD (single instance/server): 99.8%
No replication delay
No significant (100+ms) response time delay caused by SSD 26
27. CPU loads
22:10:57 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
22:11:57 all 27.13 0.00 6.58 4.06 0.14 3.70 0.00 58.40 56589.95
…
22:11:57 23 30.85 0.00 7.43 0.90 1.65 49.78 0.00 9.38 44031.82
CPU utilization was high, but should be able to handle more
%user 27.3%, %sys 11%(%soft 4%), %iowait 4%
Reached storage capacity limit (960GB). Using 1920GB MLC should be fine to handle more
instances
Network became the first bottleneck
Recv: 14.6MB/s, Send: 28.7MB/s
CentOS5 + bonding is not good for network requests handling (only single CPU core can handle
requests) (I got the above result when I tested with normal bond0)
We are now using link aggregation + bond4 with 4 network cables, then the CPU bottleneck
went away 27
28. Things to consider
To run multiple MySQL instances in single server,
you need to allocate different IP addresses or port numbers
Administration tools are also affected
We allocated different (virtual) IP addresses because some of existing
internal tools depend on “port=3306”
bind-address=“virtual ip address” in my.cnf
Creating separated directories and files
Socket files, data directories, InnoDB files, binary log files etc should
be stored on different location each other
Storing some files on HDD, others on SSD
Binary logs, Relay logs, Redo logs, error/slow logs, ibdata0 (files
where doublewrite buffer is written), backup files on HDD
Others on SSD
28
29. Optimizing for Social Game workloads
Easily increasing millions of users in a few days
Database size grows rapidly
– Especially if PK is “user_id + xxx_id” (i.e. item_id)
– Increasing GB/day is not uncommon
Scaling reads is not difficult
Adding slaves or adding caching servers
Scaling writes is not trivial
Sharding, scaling up
Solutions depend on what kinds of tables we’re using,
INSERT/UPDATE/DELETE workloads, etc
29
30. INSERT-mostly tables
History tables such as access logs, diary, battle history
INSERT and SELECT mostly
Secondary index is needed (user_id, etc)
Table size becomes huge (easily exceeding 1TB)
Locality (Most of SELECT go to recent data)
INSERT performance in general
Fast in InnoDB (Thanks to “Insert Buffering”. Much faster than MyISAM)
To modify index leaf blocks, they have to be in buffer pool
When index size becomes too large to fit in the buffer pool, disk reads
happen
In-memory workloads -> disk-bound workloads
– Suddenly suffering from serious performance slowdown
– UPDATE/DELETE/SELECT also getting much slower
Any faster storage devices can not compete with in-memory workloads
30
31. INSERT gets slower
Time to insert 1 million records (InnoDB, HDD)
600
500 2,000 rows/s
Seconds
400
Sequential order
300
Random order
200
100 10,000 rows/s
0
1 13 25 37 49 61 73 85 97 109 121 133 145
Existing records (millions)
Index size exceeded buffer pool size
Secondary index size exceeded innodb buffer pool size at 73 million
records for random order test
Gradually taking more time because buffer pool hit ratio is getting worse
(more random disk reads are needed)
For sequential order inserts, insertion time did not change.
No random reads/writes 31
32. INSERT performance difference
In-memory INSERT throughput
15000+ insert/s from single thread on recent H/W
Exceeding buffer pool, starting disk reads
Degrading to 2000-4000 insert/s on HDD, single thread
6000-8000 insert/s on multi-threaded workloads
Serious replication delay often happens
Faster storage does not solve everything
At most 5000 insert/s on fastest SSDs such as tachIOn/FusionIO
– InnoDB actually uses CPU resources quite a lot for disk i/o bound inserts (i.e.
calculating checksum, malloc/free)
It is important to minimize index size so that INSERT can
complete in memory 32
33. Approach to complete INSERT in memory
Partition 1 Partition 2
Single big physical table(index)
Partition 3 Partition 4
Range partition by datetime
Started from MySQL 5.1
Index size per partition becomes total_index_size / number_of_partitions
INT or TIMESTAMP enables hourly based partitions
– TIMESTAMP does not support partition pruning
Old partitions can be dropped by ALTER TABLE .. DROP PARTITION
33
34. Optimizing UPDATE, DELETE, SELECT
Using SSD is really, really helpful
IOPS difference is significant
– Updates in memory: 15,000/s
– On HDD : 300/s
– On SATA SSD: 1,800/s
– On PCI-E SSD : 4,000/s
We have used SATA SSD with RAID0 on slaves
Now we are gradually increasing PCI-E SSD (FusionIO and tachIOn),
consolidating 6-10 MySQL instances
If all data fit in memory and traffics are very high, using
NoSQL is helpful
We use HandlerSocket on user’s database (pk: user_id)
– Database size is less than InnoDB buffer pool size
Check Oracle’s memcached API project. Should be very easy to use
34
35. Large-HDD servers and SSD servers
“History Shard”
Putting history data (comments, logs, etc) here
Using range partitioning
Large enough HDD with RAID 10
– 900GB (10K RPM) x 8 or 300GB (15K RPM) x 10 HDD
Data size tends to be huge, but doesn’t matter so much
“Application Shard”
Middle range SSD (including SATA SSD), or PCI-E SSD
Data size matters a lot
35
36. Our near-future deployments
PCI-E or SATA/SAS SSD servers Large HDD servers
Game1_shard1
Game1_shard2 Game1_history_shard1
Game1_shard3 Game1_history_shard2
Game1_shard4 Game1_history_shard3
Game2_shard1 Game1_history_shard4
Master Master
Game2_shard2
Slave/Backup Slave/Backup
By moving history tables, application data size can be decreased significantly
(less than 30%), so PCI-E servers can consolidate shards a lot
Mostly in-memory workloads on HDD servers, so they can consolidate good
numbers of shards
Server crash causes multiple shards failure
Automated failover is important 36
37. Summary
Automated master failover and easier master
maintenance is important to manage hundreds of
master servers
Scaling up, scaling down, version up, etc
Using MHA will help a lot
– Configuring MHA does not require MySQL settings changes
– Master failover in 10-30 seconds, without passive server
– Moving master can be done in 0.5-2 seconds of downtime
Optimizing MySQL for faster H/W
Deploying history tables (insert-mostly tables, hundreds of
GBs) on HDD
Deploying application tables on PCI-E SSD
Consolidating multiple MySQL instances on single box
37