PlayFab runs a LiveOps backend services platform that handles more than 35 million monthly active players, on more than 450 live games, from studios and publishers that include Miniclip, Rovio, Hyper Hippo, Capcom, Bandai-Namco, and Atari. Getting to that level of scalability hasn’t been easy, and this talk describes the times when PlayFab nearly went down – and what architecture changes we needed to make each time to reach the next level of growth. This talk also shares some of the unique challenges of operating a shared platform, where problems are often not PlayFab’s fault, but always PlayFab’s responsibility, including game bugs that look like DDoS attacks, platform partners who break their APIs, and the joys of cascading server failures.
9. Create and manage item catalog
• Items can have:
– Limited uses
– An expiration time
– Custom data
– Default prices in multiple
currencies
– Tags to help organize
• Limited edition items have
enforced scarcity
• Catalogs can be imported /
exported as JSON data
• Update catalog from server
dynamically at any time
1/18/17 PlayFab Confidential 9
11. Time-based leaderboards for tournaments
• Leaderboards can be reset on a fixed
schedule (daily, weekly, monthly) or
manually at any time
• When leaderboards reset, the list of
players at the time of the reset is
archived
• Use leaderboard standing at time of
reset to issue prizes, determine
tournament winners
1/18/17 PlayFab Confidential 11
{"PlayerId":"4AC350E4134A36C8","Value":620}
{"PlayerId":"3CC3A4D866D9580A","Value":620}
{"PlayerId":"D15EFFB805045CFA","Value":620}
{"PlayerId":"B8271B32A8035722","Value":620}
{"PlayerId":"B188B845940ED6D3","Value":620}
{"PlayerId":"321DBA3528144483","Value":500}
{"PlayerId":"EA141B9B63B53583","Value":500}
{"PlayerId":"DC01857A8D90B2F5","Value":500}
12. Host session-based game servers
• Upload custom game server builds
• Configure multiplayer game modes
• Select regions where servers should
be hosted
• Servers will scale automatically based
on load
1/18/17 PlayFab Confidential 12
13. Server-hosted JavaScript
• Write server-based code without a
dedicated game server
• Easy upload of JavaScript custom logic
• Make changes to your game behavior
without requiring client updates
• Server authentication protects against
client-side cheating
• Access the more powerful Server API (with
features not available on the client)
• GitHub integration for easy revision control
1/18/17 PlayFab Confidential 13
Applications include:
• Granting player rewards
• Validating player actions
• Resolving interactions between players
• Managing asynchronous game turns
14. Trigger actions from real-time events
• Trigger actions in response to real-
time events
• Events can come from client,
server, or third party vendors
• Rich set of actions including
running CloudScript or sending
push notifications
1/18/17 PlayFab Confidential 14
15. Scheduled jobs & bulk player actions
• Schedule jobs to run in the
background
• Run once, or on a recurring basis
• Schedule now, or in the future
• Run tasks for each player in a
segment, or for the title
1/18/17 PlayFab Confidential 15
16. Full-text event search
• Filter and search through recent
event history
• Zoom in on specific time period
• Look for specific players, event
types, or error conditions
1/18/17 PlayFab Confidential 16
17. Remotely manage game configuration
• Store game configuration on the
server to modify behavior over time
• Coming soon: change configuration
based on player segment
1/18/17 PlayFab Confidential 17
19. Daily Active Players (2016)
-
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
4,500,000
Jan 1
Jan 8
Jan 15
Jan 22
Jan 29
Feb 5
Feb 12
Feb 19
Feb 26
Mar 4
Mar 11
Mar 18
Mar 25
Apr 1
Apr 8
Apr 15
Apr 22
Apr 29
May 6
May 13
May 20
May 27
Jun 3
Jun 10
Jun 17
Jun 24
Jul 1
Jul 8
Jul 15
Jul 22
Jul 29
Aug 5
Aug 12
Aug 19
Aug 26
Sep 2
Sep 9
Sep 16
Sep 23
Sep 30
Oct 7
Oct 14
Oct 21
Oct 28
Nov 4
Nov 11
Nov 18
Nov 25
Dec 2
Dec 9
Dec 16
Dec 23
Dec 30
20.
21. Tips to running a live service with a small team
• Fully leverage the cloud and other SAAS services
• Continuous integration
• Frequent and automated deployment to live
• All engineers take turns being “on-call”
30. How the cloud has changed deployments
Scenario A: Successful deployment
Dedicated
hardware
Cloud
Build A
Build A Build B
Build BDowntime
31. How the cloud has changed deployments
Scenario B: Rollback needed
Dedicated
hardware
Cloud
Build A
Build A
Build B
Build B
Build A
Downtime
32. Thinking about #fails
• Not all failure is created equal
• #fails range from praiseworthy to blameworthy
• Types of failure:
– Failures in routine operations which can be prevented
– Failures in complex operations which can’t be avoided, but can
be managed so they don’t turn into catastrophe
– Unwanted outcomes in research, which generate knowledge
• Goals with failure should be:
– Detect early
– Analyze deeply
– Design experiments or pilots to produce them
• Everyone on team must feel safe admitting & reporting failures
Source: Strategies for Learning from Failure. Harvard Business Review. April 2011.
33. Our most common sources of failure
• Operator errors (e.g., mis-configuration)
• Design errors (e.g., cascading failures)
• Unexpected situations (e.g., surprising customer actions)
34. Misconfiguration failure
• Failure:
– Matchmaker server was down for 13 minutes
• Cause:
– We have a primary and a “hot” backup
– In the primary fails, traffic should switch to backup
– Route 53 was misconfigured to route traffic (correctly) to
primary, but check health on the backup (incorrectly)
– When the primary did finally fail, traffic didn’t switch
• Solution:
– Short-term: Fix the configuration
– Long-term: Automate health—check integrity
Route 53
(DNS service)
Matchmaker
Primary
Matchmaker
Backup
Traffic Health Check
X
35. Design failure
• Failure: 2-minute system-wide outage
• Cause:
– A game was running a test of item consumption
– Design issue: calling ”consume” loaded entire inventory
– Result: 100+ requests for 13K item inventory in 1 minute
– This blocked API servers, waiting for the database
– Dynamo DB then auto-scaled, so calls unblocked, leading API
servers to then peg CPU to 100% processing load
– This meant servers stopped responding to health checks
– Servers were all then auto-terminated
• Solution:
– Short-term: we rolled back, which redeployed servers
– Long-term: API throttles, paging data requests
36. Complexity failure
• Failure: ElasticSearch went down for 2 days
• Cause:
– AWS ElasticSearch was trying to scale our cluster
– Instead of adding nodes, it replaces them
– This requires moving all data from node to node
– They don’t throttle data moves, so CPUs went to 100%
– This triggered health check fails, and node termination
– But new nodes trigger index rebalancing, but they were already
rebalancing because of the scaling
– At one point we were losing 4 nodes every 30 minutes
• Solution:
– Short-term #1: turn off writes to catch-up; not enough
– Short-term #2: spin up new cluster, aim writes at new cluster,
back-fill w/ data from Kinesis queue
– Long-term: Move off AWS ES onto our own ES cluster; customize
configuration based on experience
ES Writes / sec
ES Delay in seconds
37. Unexpected failure
• Failure:
– Sudden surge of traffic to our doc site, resembling a DDoS attack
• Cause:
– Customer was pinging our doc site repeatedly as a “health check”
– They ran a customer acquisition campaign so traffic spiked
– We saw these unusual queries, with a strange user agent string
– We assumed it was an attack, so quarantined that user agent string
– This had the affect of taking down their game!
• Solution:
– Restore their traffic; explain to developer why this is a bad idea
38. Other common customer fails
• Not using receipt validation
• “We’ll wait, and if it’s successful, we’ll invest in tools”
• Not running events
• Launching without real-time data
39. Fake receipts are a big problem
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
160,000
180,000
200,000
Legit Receipts Fake Receipts
41. Setting up a live event
• Move art/assets into Unity AssetBundle
• Move planet config to Google Sheets
• Export data to PlayFab catalogs
42.
43. LiveOps depends on tools
How much can you do without writing code?
The Live Ops Tools ContinuumWriting SQL
Hacking game DB
Web tools
Modify game params