Presentation about the Spil Storage Platform (SSP) written in Erlang. This talk was first given at the Erlang User Group Netherlands in July 2012 hosted at Spilgames in Hilversum.
3. Background
Mission Spil Games: “ unite the world in play “
• localized social-gaming platforms
• focus on : teens, girls and family
• many portals:
• girlsgogames.com
• agame.com
3
5. Background
• Over 200 countries, 15+ different languages
• On average 85 minutes per month per user
• Over 4000 online games
• 200 million unique users per month
5
6. Background
• Traditional LAMP stack
• Tweaked over time to keep up with growth
• Reaching limits of current system
• One of largest problems is the database
6
7. Problems: the database
• Not all developers are DB experts
• security
• performance
• caching
• Changing requirements
• Difficult to shard the databases
7
8. Wish list
1. Transparent scalability
• Sharding data
• Scalable applications on top of sharded data
2. Multi-database transactions
• atomic operations across machines
3. Fast enough (low-ish latency, high throughput)
4. Highly available (central system)
5. Can handle large dataset
6. Offer flexibility (trade consistency for speed for instance)
7. Use MySQL (experience in-house DB-team)
8. Don’t expose SQL to devs, offer business-specific model
• Storage specific security measures (character escaping)
9. Allow changes to storage layer without affecting business (versioning)
10. Centralize ownership of caching
8
10. Solution
• No matching Open Source projects
• So we want a massively scalable, soft real-time,
highly available system
• Implement it ourselves: Erlang obvious candidate
Not the first to think of this:
• Amazon SimpleDB
• Riak
• Use Open Source where possible
10
11. Solution : mindset
1. Our system should be always on
2. No global locks
3. Inconsistencies are the norm
• Hardware breaks down (power failures etc)
• Version mismatches (upgrading system non atomic)
• State mismatches (adding new machine)
11
13. SSP : Overview
• Bucket is a list of records of a specific type.
Structured data! A bucket can map to one or several
MySQL database tables and offers a CRUD-like
interface (with filters)
• All data is identified by a unique GID (64 bit integer)
• All requests for a particular GID are handled by one
Pipeline process (sequentially)
13
15. SSP: Pipeline
• Why do we need Pipelines?
• Sequential = bottleneck !?!
• Don’t you guys know Erlang is
about PARALLELIZING work?
15
16. SSP: Pipeline
• Drawbacks:
• For hotspots (game with a gazillion ) sequential (read)
access is bad indeed
• Optimization: allow dirty read (try local cache first , outside
pipeline), other solutions possible.
• Advantages:
• Facilitates scalability (no global locks, but per bucket/GID sync)
• Pipelines make multi-database consistency easier
Requests to most GIDs (users) are evenly distributed
16
18. SSP: Bucket
• Each bucket is an OTP application
• Buckets are largely generated
• XML -> SQL + PIQI -> Erlang
– Using XSLT
– Piqic
19
19. Piqi?
• PIQI is
• data definition language
• cross-language data serialization system
compatible with Protocol Buffers
• Piqi-RPC — an RPC-over-HTTP system for Erlang
• Would be better if transport was pluggable
• http://piqi.org/
20
23. SSP: bucket implementation
• bucketX.erl
– include_lib(“…/bucketX_accessors.hrl”)
– verify_record(R)
– start/0 and start_link/0
– init/1
– get_fun(Version), del_fun(V), insert_fun(V),…
• bucketX_v1.erl
– del, insert, … (Gid, Shard, Filters)
– get mysql pool
– build some SQL
– emysql:execute(Poolname, Sql)
24
24. SSP: Versions
1. A bucket is versioned. The interface of a bucket is
stable, but implementation can vary
2. We can go up or down a version, migration is automatic
• Mirror-mode is introduced so we can write to multiple
versions (but read from only one version)
25
25. SSP: Shards (storage level)
1. GIDs (eg users) are sharded automatically.
• Each version might have multiple shards
2. Redundancy (of data) is handled by MySQL
{bucket, GID} -> {Version, Shard} mapping
• Version default: config
• Shard default: default rule GID % shards
• Actual version/shard per GID stored in DB (cached)
26
26. SSP: Cache
• Each node has a private Memcached instance
• We store all data for a GID/bucket in this cache
• Filters applied after retrieving data from cache
• Don´ change data in storage outside of the SSP!
t
27
29. Challenge: controlled shutdown node
How do we shutdown a node without losing jobs?
• Shutdown bucketX application on a node
• stop pipeline factories on this node (for bucketX)
• hand over work to other PF (on other nodes)
– couple of mnesia ring reads
– move ETS table contents to new PF
– remember which PF took over (so we can forward)
• If we go to another node, clone Pipeline (gen2 pri)
• remove this node from the lookup ring
• all PFs fix their hash range based on ring
• Because there is a race condition handing over many
to one (non-continuous blocks) PF
• Sleep a while (actually wait for pipeline handovers)
30
30. Note: shutdown application
• if you terminate an application, all processes that
were started (even if not linked) are terminated!
• bit hidden in documentation of application:start/2
and stop/1
• so we need to explicitly set the group_leader to
something that never shuts down:
init(#state{} = S ) ->
group_leader(whereis(init), self()),
{ok, S}.
31
31. Challenge: shutdown pipeline
• The Pipeline process that we spawn per Gid needs
to shutdown when done (less memory)
• When is it actually done?
• Work might be assigned to the Pipeline just when
the Pipeline decides it is done: race conditions!
32
32. Challenge: shutdown pipeline (2)
• All requests for a GID are handled by a single
Pipeline Factory
• The pipeline will issue a ‘work done’ command to
the PF with a ‘CommandCounter’
• PF maintains an ETS table
• Lookup if the registered CommandCounter for
that GID is the same as the reported number
• If so: tell the Pipeline to die
33
33. Challenge: high uptime
• We want continuous usage of SSP
– Even while upgrading bucket versions
– So there can be multiple versions running
simultaneously
• Take care of creating closures
• Atomic behavior per GID
34
36. Performance
• Currently we run SSP in ´ shadow´mode, so no real
data yet. Making realistic benchmarks is quite a lot
of work.
• Latency (local machine):
– 6-26ms to do a GET request on a primary key (cache miss)
– 0.6ms with a cache hit
– Cache stores Erlang terms currently (term_to_binary)
• Always read from cache
– Does not detect changes in storage done outside SSP
37
37. Performance
• Requests (local):
– Getting from cache at about 13.5K req/sec
• elibs_benchmark:test_fun(gidlog_get, fun() ->
gidlog:get(123456) end, 10, 10000).
– Getting from mysql about 615 req/sec incl cache miss
• elibs_benchmark:test_fun(gidlog_get, fun() -> {_,_,C} =
os:timestamp(), gidlog:get(C) end, 10, 100).
– ~2 SSP machines can saturate a MySQL machine
– 8K writes/sec for 2 MySQL + 4 SSP machines (old
hardware)
38
39. Lessons learned (1)
• There are many good Open Source libraries
• Emysql : we have added transaction support
• Eep0018 : fast json encoder/decoder (yajl c++)
• Estatsd : graphite-capable monitoring
• Poolboy : Erlang worker pool factory (for
memcached)
• Twig/Lager : logging (syslog)
40
40. Lessons learned (2)
• Mnesia is great to replicate state across machines
• Faster local lookups
• Less error prone
• Encapsulate all Mnesia usage in a module
• Adding nodes to Mnesia
• Use ram_copies
• Transactions are great
• We deploy an Erlang cluster (with Mnesia
replication) only inside a single DataCenter
• Not across unreliable connections!
41
41. Lessons learned (3)
• XML + XSD + XSLT are great to define API
• They might have a bad name, but work great
• Can transform in any other format
• Used to generate documentation
Todo:
• generate more code (Buckets)
• write gen_bucket behaviour
• don´ start with generating code
t
42
42. Lessons learned (4)
• Rebar is great
• Compilation is pretty convenient, but the best part
are the “dependencies”
• Also the worst part
• We have proposed two improvements:
• Allow different projects to share dependencies
(major speedup for compiling)
• Smarter version conflict resolution (semantic
versioning: [ “>= 1.3.1”, “< 2.0.0” ] )
43
43. Lessons learned (5)
• We use #records{} for all APIs
– Piqi input/output
– Stable and well-defined
– Will move to ProtocolBuffers
• Use OTP applications everywhere
– Start/stop stuff
– See started apps: application:which_applications()
• Terminate on fatal errors
– Memcached down : terminate all buckets, don´t
try to recover (prevent overload DB)
44