3. Braze empowers you to
humanize your brand’s
relationships with your
customers at scale.
1 Trillion
DATA POINTS
PROCESSED PER
QUARTER
1B+
MESSAGES
SENT DAILY
1.6B
MONTHLY
ACTIVE USERS
5. We saw high CPU utilization on an API layer for one of our clusters
6. Throughput sampled at ~33%, response time was ~5x normal
This computation was taking up most of the API call
7. Triage
• Our on-call engineer increased the API
autoscale group server count per runbook
• Starting with 57 c4.4xlarge servers, we
added capacity to try to resolve Apdex
• Despite 123 more servers ($71,669/mo
additional cost!), Apdex did not go away
• Adding more servers made things worse
• API continued to throw errors
8. Braze in-app messaging architecture
• Braze SDKs have a business rule engine for when to
show in-app messages (“IAM”s)
• The client requests IAMs from the API on the app
open for that session
• The API reads possible IAMs from the database or
Memcached
• The API computes IAM target criteria against user
profile and stores calculated target criteria in
Memcached with a TTL of 90 seconds
• The API returns a set of possible IAMs to the client
device
Client Device
User 123
IAMs
API Servers
Database
CACHE
9. 14 seconds to compute?!?!
Happening ~6k times every 90 seconds?!?!
10. What was going on?
• High volume of API requests (~20,000/second)
• The customer had added a lot of new IAMs with sophisticated targeting
rules
• Every 90 seconds, ~6,000 API calls took 14 seconds to complete
• Cache stampeding herd issue: once the cache expired, ~6,000 requests
immediately attempted to populate it back
• Computation is CPU-intensive
• Of course this won’t scale!
12. Redis cache control
• We used Redis to control a refresh of the cache using SETNX
locks
• We extended Memcached TTL to 180 seconds, with 1 process
refreshing the cache every 90 seconds
Full code available at https://github.com/jonhyman/redisconf2019