Online games pose a few interesting challenges on their backend: A single user generates one http call every few seconds and the balance between data read and write is close to 50/50 which makes the use of a write through cache or other common scaling approaches less effective.
Starting from a rather classic Ruby on Rails application as the traffic grew we gradually changed it in order to meet the required performance. And when small changes no longer were enough we turned inside out parts of our data persistency layer migrating from SQL to NoSQL without taking downtimes longer than a few minutes.
Follow the problems we hit, how we diagnosed them, and how we got around limitations. See which tools we found useful and which other lessons we learned by running the system with a team of just two developers without a sysadmin or operation team as support.
21. We added a few applicaKon servers over Kme
lb
app app app app app app app app app
db db
22. 250K daily users and no problems
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
#!!$!!!"
Life was good
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
23. Life was well and I went on a nice vacaKon
TO DO
<picture: Jesper in clot
canyon>
35. AcKveRecord’s checks caused 20% extra DB
Checking connecKon state
MySQL process list full of ‘status’ calls
=> Fixed by 1 line of code
36. I/O on MySQL masters sKll was the bo^leneck
New Relic: 60% of all UPDATEs on ‘Kles’ table
37. Tiles are part of the core game loop
Core game loop
1) plant
2) wait
3) harvest
38. We started to shard on model, too
Adding new shards
old old
master slave
39. We started to shard on model, too
Adding new shards
1) Setup new masters as slaves of old ones
old old new
master slave master
40. We started to shard on model, too
Adding new shards
1) Setup new masters
old old new new
master slave master slave
41. We started to shard on model, too
Adding new shards
1) Setup new masters
2) Start using new masters
old old new new
master slave master slave
42. We started to shard on model, too
Adding new shards
1) Setup new masters
2) Start using new masters
3) Cut replica<on
old old new new
master slave master slave
43. We started to shard on model, too
Adding new shards
1) Setup new masters
2) Start using new masters
3) Cut replica<on
4) Truncate
old old new new
master slave master slave
44. 4 DB masters and a few more servers
lb
app app app app app app app app
app app app app app app app app
<les <les
db db
db db
45. Sharding by model brought us to 400K DAU
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
#!!$!!!"
Shard by model
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
50. Sharding gem circumvented AR’s internal cache
AcKveRecord caches SQL queries...
... only in our development environment!
51. Sharding gem circumvented AR’s internal cache
AcKveRecord caches SQL queries...
... only in our development environment!
=> Fixed by 2 lines of code
52. I/O sKll was not fast enough
If 2 + 2 is not enough, ...
53. I/O sKll was not fast enough
If 2 + 2 is not enough, ...
… perhaps 4 + 4 masters will do?
54. It’s no fun to handle 8+8 MySQL DBs
lb
app app app app app app app app app
app app app app app app app app app
<les <les
db db
db db
55. It’s no fun to handle 8+8 MySQL DBs
lb
app app app app app app app app app
app app app app app app app app app
<les <les <les <les
db db db db
db db db db
56. At 500K DAU we were at a dead end
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
57. At 500K DAU we were at a dead end
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
61. Redis is fast but goes beyond simple key/value
Redis is a key-‐value store
Hashes, Sets, Sorted Sets, Lists
Atomic opera<ons like set, get, increment
62. Redis is fast but goes beyond simple key/value
Redis is a key-‐value store
Hashes, Sets, Sorted Sets, Lists
Atomic opera<ons like set, get, increment
50,000 transacKons/s on EC2
Writes are as fast as reads
69. Migrate on the fly -‐ and clean up later
1. Let migraKon run unKl everything cools down
70. Migrate on the fly -‐ and clean up later
1. Let migraKon run unKl everything cools down
2. Migrate the rest manually
71. Migrate on the fly -‐ and clean up later
1. Let migraKon run unKl everything cools down
2. Migrate the rest manually
3. Remove migraKon code
72. Migrate on the fly -‐ and clean up later
1. Let migraKon run unKl everything cools down
2. Migrate the rest manually
3. Remove migraKon code
4. Wait unKl no fallback necessary
73. Migrate on the fly -‐ and clean up later
1. Let migraKon run unKl everything cools down
2. Migrate the rest manually
3. Remove migraKon code
4. Wait unKl no fallback necessary
5. Remove SQL table
74. A journey to 1,000,000 daily users
Start of the journey
6 weeks of pain
Paredise (or not?)
Conclusion
75. Again: Tiles are part of the core game loop
Core game loop
1) plant
2) wait
3) harvest
76. Size ma^ers for migraKons
MigraKon check overload
Migra<on only on startup
77. Size ma^ers for migraKons
MigraKon check overload
Migra<on only on startup
Overlooked an edge case
Only migrate 1% of users
Con<nue if everything is ok
78. In-‐memory DBs don’t like to dump to disk
Dumping to disk
SAVE is blocking
BGSAVE needs free RAM
79. In-‐memory DBs don’t like to dump to disk
Dumping to disk
SAVE is blocking
BGSAVE needs free RAM
Latency increase by 100%
80. In-‐memory DBs don’t like to dump to disk
Dumping to disk
SAVE is blocking
BGSAVE needs free RAM
Latency increase by 100%
=> BGSAVE on slaves every 15 minutes
82. Redis replicaKon starts with a BGSAVE
BGSAVE on master
Slave imports dumped file
=> No RAM means no new slaves
83. Redis had a memory fragmenKon problem
44 GB
in 8 days
24 GB
84. Redis had a memory fragmenKon problem
38 GB
in 3 days
24 GB
85. If MySQL is a truck
Fast enough
Disk based
Robust
Fast enough disk based robust
86. If MySQL is a truck, Redis is a race car
Super fast
RAM based
Fragile
Super fast RAM based fragile
87. Big and staKc data in MySQL, rest goes to Redis
256 GB data 60 GB data
10% writes 50% writes
hSp://www.flickr.com/photos/erix/245657047/
88. Lots of boxes, but automaKon helps a lot!
lb lb
app app app app app app app app app app app app app
app app app app app app app app app app app app app
app app app app app app app app app app app app app
db db db db db redis redis redis redis redis
89. We reached 1 million daily users!
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
1,000,000 -‐ Big party!
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
90. We started archiving inacKve users
&$!!!$!!!"
%$#!!$!!!"
50% DB
%$!!!$!!!"
reducKon
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
91. We even survived a complete data center loss
&$!!!$!!!"
EBS no
%$#!!$!!!"
more!
%$!!!$!!!"
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
92. We improved our MySQL schema on-‐the-‐fly
&$!!!$!!!"
30% DB
%$#!!$!!!"
reducKon
%$!!!$!!!"
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
93. Will we reach 2 million daily users?
&$!!!$!!!"
%$#!!$!!!"
%$!!!$!!!"
#!!$!!!"
!"
'()*%!" +,-*%!" ./0*%!" +12*%%" '()*%%" +,-*%%" ./0*%%"
94. A journey to 1,000,000 daily users
Start of the journey
6 weeks of pain
Paredise (or not?)
Conclusion