Just In Time Scalability Agile Methods To Support Massive Growth Presentation

Just-In-Time Scalability: Agile
Methods to Support Massive
Growth

Behind the scenes...

IMVU is LAMP, plus...
• Perlbal
• Memcached
• Solr
• MogileFS
• plus...
• ADODB
• b2evolution
• Audiere
• BuildBot • Coppermine
• Boost
• eAccelerator • feed2js
• Cal3D
• Linux (Debian) • FreeTag
• CFL
• memcached • Incutio XML-RPC
• NSIS
• Nagios • jrcache
• Pixomatic
• Perl • JSON-PHP
• Python
• Roundup • Magpie
• pywin32
• rrd • osCommerce
• SCons
• Subversion • phpBB
• wxPython
• Phorum
• SimpleTest
• Selenium

Before and After Architecture

Before After

We started with a small site, a We ended with a large site, a
mess of open source, and a medium sized team, and an
small team that didn't know architecture that has scaled.
much about scaling.

We never stopped. We used a roadmap and a compass, made
weekly changes in direction, regularly shipped code on
Wednesday to handle the next weekend's capacity constraints,
and shipped new features the whole time.

Before and After Architecture (1/4)

November


December


February


May

Advanced planning vs. fast response
“Rocket ship” “Driving”

• Figure out in advance what • Continuously figure out
is going to go wrong what is going to go wrong
soon
• Build a plan that prevents
those things from • Quickly fix it, without
happening breaking something else
• Execute your plan • Get feedback along the
way
• Get feedback when done

Questions to ask
“Rocket ship” “Driving”

• Are you sure you know • How do you know you will
what is going to happen? be able to fix the problem
in time?
• Are you sure you can
• How can you be sure you
execute?
won't cause collateral
• Can you afford it? damage?
• Do you need feedback? • How can you be sure you
won't code yourself into a
corner?

Continuous Ship
• Deploy new software quickly
• At IMVU time from check-in to production = 20 minutes

• Tell a good change from a bad change (quickly)

• Revert a bad change quickly

• Work in small batches
• At IMVU, a large batch = 3 days worth of work

• Break large projects down into small batches

• Don't have the same problem twice – fix the root cause of each
class of problems

IMVU pushes code to production 20-30 times every day

Cluster Immune System
What it looks like to ship one piece of code to production:
• Run tests locally (SimpleTest, Selenium)
Everyone has a complete sandbox
o

• Continuous Integration Server (BuildBot)
o All tests must pass or “shut down the line”
Automatic feedback if the team is going too fast
o

• Incremental deploy
Monitor cluster and business metrics in real-time
o
Reject changes that move metrics out-of-bounds
o

• Alerting & Predictive monitoring (Nagios)
Monitor all metrics that stakeholders care about
o
If any metric goes out-of-bounds, wake somebody up
o
Use historical trends to predict acceptable bounds
o

When customers see a failure:
Fix the problem for customers
o
Improve your defenses at each level
o

Case Study: Sharding

Problem: Spread write queries across multiple databases

Solution:
•Intercept and redirect queries based on SQL comments
• Move one table or sub-system at a time
• Our experience was one engineer horizontally partitions one table or
small sub-system in one week

•New engineers figure this out in about 5 minutes
db_query(“INSERT INTO inventory (customers_id, products_id)
VALUES ($customer_id, $product_id)quot;);

db_query(quot;/*shard customer://$customer_id */
INSERT INTO inventory (customers_id, products_id)
VALUES ($customer_id, $product_id)quot;);

•Learning: cross shard joins & transactions aren’t required

Case Study: Caching
Problem: Cache frequently read data to memcached

Solution:
•Intercept and cache queries based on SQL comments
db_query_cache(BUDDY_CACHE_TIME,
quot;/*shard customer://$customer_id */
/*cache-class customer://$customer_id/buddies */
SELECT friend_id, buddy_order FROM customers_friends
WHERE customers_id=$customer_idquot;);

-----------------

db_query(“/*shard customer://$customer_id */
DELETE FROM customers_friends
WHERE customers_id = $customer_id
AND friend_id = $friend_id”);
db_flush_cacheclass(quot;customer://$customer_id/buddies”);

•Learning: Flushing cache critical to users and performance
–When a customer spends $24.95, they want the benefits immediately

•Learning: Test the cache behavior for critical systems

Case Study: Steering Data Design

Problem: Improve database schemas and data design to meet
scalability requirements without downtime

Solution:
•Measure to find the real problems (harder than it sounds)
•Migrate to new design that takes advantage of sharding and/or
caching

Problem: You can’t bulk move large frequently accessed data
Solution:
•Copy on read
–Use when you are read bound
–Reads check cache, new location, and copy to new location if missing
–Writes go to new location if data has been migrated, otherwise old

•Copy on write
–Use when you are write bound
–Reads check cache, new location, then old location
–Writes go to new location, copying to new location if missing

•Copy all
–Use when file system fills up
–Reads & writes go to new location, falling back to old location if missing
–Cron copies data a few records at a time

“Thank You for Listening!”

Just In Time Scalability Agile Methods To Support Massive Growth Presentation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Just In Time Scalability Agile Methods To Support Massive Growth Presentation

Ähnlich wie Just In Time Scalability Agile Methods To Support Massive Growth Presentation (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Just In Time Scalability Agile Methods To Support Massive Growth Presentation