When we were preparing for PlayStation4 launch we faced 1 hard problem: how to make our games and videos storage system super-fast, highly available and fault tolerant. Moving from relation database to Cassandra is not easy, and it even harder if you want to support different search and query use cases. Join our talk to learn how we managed to build highly available platform, that supports tens of millions of active users and can execute multiple user specific queries in less than a millisecond. And all of it without using Solr or ElasticSearch.
About the Speaker
Alexander Filipchik, Sony
Alex spent last 4 years of his life building the next generation of PlayStation Network. He is honored to be a part of a small team of engineers who managed to build from scratch a platform that scaled from 0 users to 1 million PS4s in just 1 day and have being landing 1.5 million of new devices per month since and now reached tenth of millions of active users. He is passionate about technology, innovations, walking his dog and building scalable software using Cassandra.
2. Who are we?
Alexander Filipchik (PSN: LaserToy)
Principal Software Engineer at Sony Interactive Entertainment
Dustin Pham (PSN: quibfan)
Principal Software Engineer at Sony Interactive Entertainment
3. The Rise of PlayStation4
PlayStation Network is big and growing.
– Over 65 million monthly active users.
– Hundreds of millions of users.
– A Lot of Services.
– More than 40M devices
4. PlayStation 4 growth
• Pre warm – November 2013, couple
thousands PS4s for Taco Bell.
• Launch Day – 1,000,000 PS4s several days
later.
• Adding 1.3 Millions devices a month.
6. 2009 MySql
Year Unicorn’s Tech Our Tech
2011 MongoDB/MySql
2012 Redis/MySql PS3: MySQL + Memcached, Solr
2013 Redis/Postgres MySQL + Memcached/Cassandra, Solr
2014 Redis/Shards For Postgres + MySql MySQL + Memcached/Cassandra, Solr
2015 Riak/Shards For Postgres + MySql MySQL + Memcached/Cassandra + Redis,
Solr
2016 Who knows what/Cassandra MySQL + Memcached/Cassandra + Redis,
Solr
7.
8. What is it?
• It is an online Games store for PlayStation
• To give you an idea:
– Revenue went from 800M per year 4 years ago to
almost 5B last year
– It is making more than all of Nintendo
• And it is not just eCommerce, it is a whole set of
services – Video Streaming, Game Streaming,
Social, etc
9. Some Challenges
• We are not Amazon, so content should be delivered
right away
• What you bought is not just a transactional record
that user checks once in a while. Multiple services
need access to this information in real time
• Which means it should be
– highly available
– fast
– and easy to scale
10. The Problem
• So, legacy System uses well known Relational DB
to handle our transactions.
• It is state of the art software that doesn’t scale
well in our circumstances.
• We wanted to allow client to run any queries
without consulting with hundreds of DBAs first.
• Sharding sounds like a pain.
• Multiple regions should be easy.
16. Some observations
• For us most load comes from user-centric
activities
• So, we mostly query within a user’s dataset
• Which means we don’t need to join across
users often
18. So, we came up with Schema
Account1 Json 1 Json 2 …. Json n
Now it horizontally scalable
We have in row transactions
Read is very fast – no joins
Now we need to propagate user purchases
from Relational DB to C*
And figure out how to support queries
19. Solving the Puzzle
• There are number of ways we can use to
notify C* about account level changes in the
source of truth - let’s not talk about it for now.
• Let’s talk about queries.
20. Going deeper
• What client wants:
– Search, sort, filter
• What can we do:
– Use secondary Index
– Fetch everything in memory and process it
– How about…
21. Solr?
• Can we use it to support our flexible user level
query requirement?
• Not really:
– Data has high cardinality properties
– And it will not be very fast because Solr is optimized
for a different use case
– It will be another set of system to support and scale
22. What can We Do?
• We can index, and writing indexer sounds like
a lot of fun
• Wait, someone already had the fun and made:
23. Account1 Json 1 Json 2 …. Json n
Schema v2
Account1 Json 1 Json n Version
Now We can Search on anything inside the row that represents the user
Index is small and it is fast to pull it from C*
But we still pulling all this bytes all he time
And what if 2 servers write to the same row?
24. Distributed Cache?
• It is nice to keep things as close to our MicroService as
possible
• In something that can do fast reads
• And we have a lot of RAM these days
• So we can have a beefy Memcached/Redis/Aerospike
deployment
• And Still pay Network penalty and think about scaling them
• What if
25. Soft State Pattern
• Cache lives inside the MicroService, so no network penalty
• Requests for the same user are processed on the same
instance, so we can save network roundtrip and also have
some optimizations done (sequencing)
• Changes to State also are replicated to the storage (C*) and
are identified with some version number
• If instance goes down, user session will be moved to
another alive instance automatically
• It is much easier to scale up Microservices than C*
26. Or in Other Words
Account 1
Version
Account 2
Version
Account 3
Version
Account 4
Version
Account 5
Version
Account 6
Version
Account1 jsons Version
Account2 jsons Version
Account3 jsons Version
Account4 jsons Version
Account5 jsons Version
…. … … …
Account n jsons Version
Instance 1
Instance 2
Instance 3
Cassandra
27. But what if cross user data changes?
• Product was renamed
• Game image just got updated
• And so on…
29. Cross User Data sync
• A process that can detect a change in the data
and notify all the affected users
• Simple solution: a reverse lockup table from data
to users
• And you can optimize it
• Users don’t have to see updates in the same time
• Updates account’s version, so lazy reindexing can
be done
30. High level
Account1 jsons Version
…. … … …
Account n jsons Version
Accounts Cassandra
Account 1 Version
…. …
Account n Version
MetaData Versions
Account 1
Version
Account 6
Version
Product 1 Account 1 … Account 2312
…. … … …
Product n Account 26 … Account 123
MetaData Cassandra
Data-sync microservice
Is meta UpdatedAccount update
MetaData Updates
31. Was Dustin Wrong?
• Tens of billions of documents
• Average API latency is below 10ms
• Actual search latency is in microseconds
• Hundreds thousands of documents are indexed per second
• Another system which is based o the same idea indexes
million of documents per second on 18 servers
• And most importantly:
– No major incidents in production.