Rate Limits at Scale SANS AppSec Las Vegas.
Rate Limit Everything All the time using a quantized time system with Memcache or Redis. Use this protect resources or discover anomalies.
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
1. Rate-Limiting
at Scale
SANS AppSec Las Vegas 2012
Nick Galbreath @ngalbreath nickg@etsy.com
2. Who is Etsy? nick?
• “Marketplace for Small Creative
Businesses”
• Alexa says #51 for USA traffic
• > $500MM transaction volume last year
• Billions and Billions of page views
• Nick Galbreath Director of Engineering
focusing on Security, Fraud, and other fun
stuff
3. What’s a Rate Limit?
Maximum number of events
per (brief) period per user
after which the resource is denied.
e.g. “no more than 2 logins per minute”
5. Robots gone Wild
• Robots / Crawlers (not always an intended
DDoS)
• 20,000 items in shopping cart
• spam attack!
• Can crush sites very quickly, at almost no
cost. Especially when crawl generates load
or writes to the database
6. Humans are Resources too
• Rate limits needed for anything that gets
reviewed by humans such as customer
service requests.
• CRMs are typically bad at dealing with
spammy stuff
7. Anything Involving
Money
• Without rate limits on credit card
authorizations your site becomes a card
skimmer site.
• Using a website is much easier than going
to the gas station pump or other
anonymous card reader
9. Do Rate Limits Stop all
Fraud? No, but...
• Eliminates false positives and punks
• Allows you to focus on more sophisticated
attacks
• Protects against damaging bursts of activity
(malicious or not)
10. Rate Limits are needed
on anything that
depends on an external
resource
This is almost everything!
16. Ouch!
• At scale, this is really painful for databases
to handle.
• Constant binary-tree index churn
• Use in-memory database (or run off
ramdisk) if trying this out
17. Quantized Rate Limits
• Stores a count in a time-window or bucket.
• Map current time to a bucket
• (int) (NOW()/period) e.g.
NOW()/3600 is gives the hour bucket.
19. Direct Lookup
• Everything is a primary key lookup.
userid-event-period-bucketid
60min: “nickg-login-3600-5589007547”
10min: “nickg-login-600-33534045284”
• Multiple time-frames require multiple
buckets, which means multiple inserting and
checking.
20. Quantized RL Accuracy
Not exact.
If you set N per Period, quantized rate-limits
may go as high as:
(n-1)x2 per Period.
e.g. 10 per minute --> 18 per minute
Yikes. Maths!
22. Rate-Limits at Scale
• We traded exact accuracy and flexibility for
scaling.
• Implementation using Memcache or Redis
(and perhaps SQL)
set nickg-login-60-212331231 += 1
• Well known sharding techniques
• Auto-expiration of old buckets
• Each set/get takes 1/10 or less of
millisecond. Almost invisible.
23. Memory
• Say 256 bytes per bucket
• 10,000,000 buckets is a lot of bucket
• But is only 2G, and fixed
• This is easy on one machine.
25. Please write unit tests!
• Easy to get wrong, and consequences can
be unpleasant
• Edge cases and race conditions
• memcache doesn’t have a “insert or
increment” operation. Need to do
multiple steps and check error
conditions.
26. Please make an API
• Make it simple for anyone to add rate
limiting to their code.
• Make it one line
// event, period, max events
if (rate_limit_exceed("signin", 60, 5)) {
// do something
}
27. Rollout
• Once in production start with guestimates
on rate limits
• If rate limit is triggered, take no action and
only log/graph
• Does volume match expectations?
• Wash, Rinse, Repeat until tuned
appropriately
28. oh yeah, don’t forget
Put your
rate-limit
datastore
behind the
firewall
29. So a user hit a rate
limit. Now what?
a dialog with product, customer service and engineering
• Do you let them know? (visible indicator)
• Do you start CAPTCHA-ing?
• Do you black hole it? (silent)
Also keep logging and graphing. You’ll need these
to debug when things go awry.
31. I feel bad if I don’t use a
graph in a presentation
CAPTCHA
Etsy API
32. How we do it
• We use Graphite for real-time graphing
http://graphite.wikidot.com/
• We use StatsD as our API
http://etsy.me/dQwVXi
https://github.com/etsy/statsd
• Our apps do this
StatsD::increment('signins');
UDP based -- can’t break the application
33. Division Built-in!
Combine, Mix and Match data in Graphite to
discover new insights.
Seasonal data.
Hard to alert on
But ratio of them is
nearly constant.
Easy to alert on.
Who knew 1 in 5 logins
are failures is universal?!
p.s. Holt-Winters exponential smoothing is also built in
35. Laddering
• Use laddering to do rate limits at different
time scales for the same event.
• Set a short period and high rate to prevent
bursts
• Then set a longer period with lower rate to
prevent slow crawls robots.
36. Ladder longer periods
to have a smaller rate
Negative example:
2 per Minute ( ~0.033 events per sec )
or 2x60 = 120 per Hour
so laddering with
300 per Hour (~ 0.083 events per sec)
does nothing, but
100 per Hour (~ 0.028) is good.
oh no! the maths again!
37. In Pictures...
Rate limit of “3 per 1 box” - ok
Rate Limit 5 per 3 boxes -- alert! (good)
but, say, rate limit 100 per 3 boxes does nothing
and is impossible to trigger
39. Anonymous Users
• hash of (IP + appropriate HTTP headers)
• order of headers matters
different browsers order them differently
• Spoofed user agents don’t always get the
order right
Different type of
Anonymous User
40. Rate Limit Every IP?
• Probably just Class C (only 16M of them)
• Maybe useful for just alerting
• Probably need whitelisting (e.g. AOL)
41. Rate Limit Datacenters
http://github.com/client9/ipcat
Datacenter / Rent-A-Slice / “hands not on
keyboard” / leaseable CPU and network
How much traffic is coming
from them?
43. • Almost every action on Etsy has laddered
rate-limit
• We learn the hard way what is not limited
• Virtually no performance impact at scale
• Should we open source the driver?