While the worlds of ecommerce, search, and application platforms might seem as far from the gaming industry as one might imagine, lessons learned in those environments are surprisingly applicable to online games. Real-time games in particular face many of the same challenges faced -- and solved -- by companies like eBay and Google. They are extremely latency-sensitive, are subject to unpredictable growth and scalability curves, and exhibit extremely spiky load profiles. The real-time player experience is critical to the success of a game -- if a game is down or slow, players will leave and never come back. This session will discuss how experiences with large-scale websites like eBay and Google have informed our approach to building, testing, and operating real-time games at KIXEYE.
This session tells several war stories from eBay and Google around Scaling Code, Scaling Infrastructure, Scaling Performance, and Scaling DevOps. It further puts it all together by connecting those experiences with what we are now doing in our next-generation gaming platform at KIXEYE.
See also Part 1 of this topic, presented at QCon San Francisco 2013.
Everything I Learned About Scaling Online Games I Learned at Google and eBay [Part 2, QConBeijing 2014]
1. Everything I Learned About
Scaling Online Games I
Learned at Google and
eBay
Randy Shoup
@randyshoup
linkedin.com/in/randyshoup
2. Background
CTO at KIXEYE
⢠Real-time strategy games for web and mobile
Director of Engineering for Google App
Engine
⢠Worldâs largest Platform-as-a-Service
Chief Engineer at eBay
⢠Multiple generations of eBayâs real-time
search infrastructure
3. Real-Time Strategy Games are
⌠⢠Real-time
⢠Spiky
⢠Computationally-
intensive
⢠Constantly evolving
⢠Constantly pushing
boundaries
ď¨ Technically and
operationally demanding
4. How to Scale
Scaling Code
Scaling Infrastructure
Scaling Performance
Scaling DevOps
5. How to Scale
Scaling Code
Scaling Infrastructure
Scaling Performance
Scaling DevOps
6. Embrace Open Source
Try someone elseâs code first
⢠Faster to get started, lower development cost
⢠Open source projects are often higher quality,
more extensible, better tested
⢠Take advantage of talent outside your company
Avoid âNot-Invented-Hereâ Attitude
⢠(-) Google and eBay âexceptionalismâ
⢠Default has been to write it in-house instead of
reuse and contribute
7. Embrace Standard Data
Formats
Use standard formats
⢠Well-tested and widely-used
⢠Internationalization from the beginning
Time in UTC
⢠(-) eBay and Google use local US-Pacific time
ď
8. Embrace Standard Data
Formats
Character set in UTF-8
⢠(-) 5+ years to convert eBay site from ISO-
8859-1 (Western European only) to Unicode
ď
Structured data format
⢠Explicit structure with associated schema
⢠(+) Google uses protocol buffers for schema,
serialization, storage
9. Development Discipline
Quality, Reliability, Scalability are âPriority-0
featuresâ
⢠Equally important to users as product features
and engaging user experience
Developers responsible for
⢠Features
⢠Quality
⢠Performance
⢠Reliability
⢠Manageability
10. Development Discipline
Developers write tests and code together
⢠Continuous testing of features, performance, load
⢠Confidence to make risky changes
⢠Catch bugs earlier, fail faster
âDonât have time to do it rightâ ?
⢠WRONG ď â Donât have time to do it twice (!)
⢠The more constrained you are on time and
resources, the more important it is to do it solidly
the first time
11. How to Scale
Scaling Code
Scaling Infrastructure
Scaling Performance
Scaling DevOps
15. Reactive Servers
Minimize request latency
⢠Respond as rapidly as possible to client
Functional Reactive + Actor model
⢠Highly asynchronous, never block (!)
⢠Queue events / messages for complex work
⢠Heavy use of Scala / Akka and RxJava at
KIXEYE
⢠(-) eBay uses highly synchronous model
⢠(-) Google uses complicated callback-based
asynchronous model
16. Client Liveness
Default to background processing
⢠Refresh assets
⢠Save client state
Client continues seamlessly if disconnected
⢠Parallel simulation on client and server
⢠Gameplay more important than constant
synchronization
17. How to Scale
Scaling Code
Scaling Infrastructure
Scaling Performance
Scaling DevOps
18. Scalability and Performance
Measure, Measure, Measure
⢠Instrument everything: client, services, network,
DB
⢠Measurement beats intuition every time
⢠My own intuition is usually wrong ď
Attack the first bottleneck
⢠Theory of Constraints: attacking *any* other
problem does not improve throughput of the
system
Repeat until performance is good enough
⢠âWhen you solve problem one, problem two gets
a promotionâ
19. Small Details Matter
In the very large, the very small matters a
*lot*
⢠Subatomic physics and cosmology are inter-
related
⢠Particles and forces at the subatomic level
controlled formation and evolution of the
entire universe
Discipline is deciding *which* details matter
(!)
20. eBay Search Index
Compression
Search Engine constrained by index size
⢠Smaller index size reduces memory, CPU, I/O
⢠Smaller index means fewer nodes, fewer shards
Inverted Index
⢠âPosting Listâ: all occurrences of [term] in documents
⢠Monotonically-increasing series of integers, traversed
in order
ď¨ Delta compression + Variable-byte encoding
⢠Store deltas, not absolute numbers
⢠Encode deltas so smaller numbers use fewer bits
21. TOME Combat Server
Scalability limits in TOME combat server
⢠Unable to push single server beyond several
hundred simultaneous players
⢠All system and OS-level measurements OK
⢠CPU, memory usage, I/O, threads, locking
⢠Needed to use CPU-level analyzer (Intel VTune)
Bottleneck: memory cache contention
⢠Multiple cores contending on L2 cache memory
⢠40% scalability increase from six characters âŚ
⢠static Foo; ď¨ const static Foo;
22. Measurement and Distributions
⢠Applies only to
quantities constrained
on both sides, clustered
around a mean
⢠E.g., adult height and
weight
⢠Applies only to near-
homogeneous
populations
⢠E.g., adult male height in
North America, vs.
female, vs. China, etc.
Gaussian (âNormalâ) distribution is *not*
normal
23. Measurement and Distributions
Power Law (âLong Tailâ) distribution *much*
more common
⢠Latency and performance measurements
⢠Popularity, income, human connections, etc.
⢠Minimum is 0; maximum is infinite
⢠The more you have, the more you get
24. Measurement and Distributions
Mean and Standard Deviation often misleading
⢠Encourages you to remove outliers, even though
outliers represent the real problems (!)
⢠Encourages you to concentrate on the average
case, not the worst case
⢠âMean is meaninglessâ ď
ď¨ Use percentiles instead (!)
⢠Can reasonably characterize any distribution
⢠Measure 90%ile, 99%ile, 99.9%ile
⢠Highlight and focus on the *real* problems
25. How to Scale
Scaling Code
Scaling Infrastructure
Scaling Performance
Scaling DevOps
26. Automate Everything
Humans are always at a premium
⢠Humans are too valuable for repetitive tasks
⢠Machines will happily do things over and over
Automated operations
⢠Provisioning
⢠Deployment
⢠Alerting
⢠Self-healing
27. Autoscaling
Games are very spiky
⢠Very unpredictable
⢠Huge variability between peak and trough
⢠Hits are self-reinforcing
Services and clients have to âflexâ
⢠Clients back off in response to latency
⢠Services grow / shrink based on load
28. App Engine Autoscaling
Autoscaling as part of the Platform
⢠Gracefully handle spiky application load
⢠Maximize utilization of the infrastructure
World-class application scheduler
⢠Consider request rate, processing time, max wait time
⢠Also instance startup time, application budget
⢠Predictive model pre-provisions and proactively
scales
⢠Reactive autoscaling in response to load
⢠Instantaneous autoscaling on request: spin up new
instance(s) *while a request is coming in*
29. Google and DevOps
Ops Support is a privilege, not a right
⢠Developers carry pager for first 6+ months
⢠Service âgraduatesâ to SRE after intensive review
of monitoring, reliability, resilience, etc.
⢠SRE collaborates with service to move forward
Everyoneâs incentives are aligned
⢠Everyone is responsible for production
⢠Everyone strongly motivated to have solid
instrumentation and monitoring
30. Recap: How to Scale
Scaling Code
Scaling Infrastructure
Scaling Performance
Scaling DevOps