In this talk, Appboy co-founder and CIO Jon Hyman will discuss various schemas that Appboy has evolved to use on MongoDB, remaining agile as Appboy has grown to massive scale. Jon will discuss topics such as random sampling of documents, multivariate testing and multi-arm bandit optimization of such tests, field tokenization, and how Appboy stores multi-dimensional data on an individual user basis to be able to quickly optimize for the best time to deliver messages to end users. Appboy is the global leader in Marketing Automation for Apps, helping clients such as Urban Outfitters, Shutterfly, Kixeye, PicsArt, USA Today Sports, and iHeartRadio increase engagement through automated messaging. Each month, Appboy collects tens of billions of data points from hundreds of millions of monthly active users.
5. • Prior to 2013, scaled vertically
• Sharded in Q2 2013
• Added write buffering with Redis (transactional)
• In 2014, started splitting out collections to more clusters
• By MongoDB World 2014, Appboy handled over 4 billion data points per month
Appboy’s growth on MongoDB
MongoDB World 2014 Recap
7. • Approximately 22 billion events per month
• Handling spikes of 2B+ events per day
• We anticipate tracking over 1B unique
users in Q3
• 11 clusters, over 160 shards
Appboy’s growth on MongoDB
Appboy’s Growth in 2015
8. • Statistical analysis in read queries
• Random rate limiting and A/B testing
• Flexible schemas, tokenizing field names
• Schemas for data intensive algorithms at Appboy
Agenda
Today at MongoDB World 2015!
10. A group of users who match some set of filters.
User Segmentation
11. Appboy shows you segment membership in real-time as
you add/edit/remove filters.
How do we do it quickly?
We estimate the population sizes of segments when using
our web UI.
Counting Quickly
12. Goal: Quickly get the count() of an arbitrary query
Problem: MongoDB counts are slow, especially
unindexed ones
Counting Quickly
14. 10 million documents that represent people:
• How many people like blue?
• How many live in NYC and love pizza?
• How many men have a shoe size less than 10?
{
favorite_color: “blue”,
age: 29,
gender: “M”,
favorite_food: “pizza”,
city: “NYC”,
shoe_size: 11,
attractiveness: 10,
...
}
Counting Quickly
15. Big Question:
How do you estimate
counts?
Answer:
The same way news
networks do it.
With confidence.
16. Add an index on the random number:
{
random: 4583,
favorite_color: “blue”,
age: 29,
gender: “M”,
favorite_food: “pizza”,
city: “NYC”,
shoe_size: 11,
attractiveness: 10,
...
}
Add a random number in a known range to each document.
Say, between 0 and 9999.
db.users.ensureIndex({random:1})
Counting Quickly
17. Step 1: Get a random sample
I have 10 million documents. Of my 10,000 random “buckets”,
I should expect each “bucket” to hold about 1,000 users.
E.g.,
db.users.find({random: 123}).count() == ~1000
db.users.find({random: 9043}).count() == ~1000
db.users.find({random: 4982}).count() == ~1000
Counting Quickly
18. Step 1: Get a random sample
Let’s take a random 100,000 users. Grab a random range that
“holds” those users. These all work:
Tip: Limit $maxScan to 100,000 just to be safe
db.users.find({random: {$gt: 0, $lt: 101})
db.users.find({random: {$gt: 503, $lt: 604})
db.users.find({random: {$gt: 8938, $lt: 9039})
db.users.find({$or: [
{random: {$gt: 9955}},
{random: {$lt: 56}}
])
Counting Quickly
20. Step 3: Do the math
Population: 10,000,000
Sample size: 100,000
Num matches: 11,302
Percentage of users who matched: 11.3%
Estimated total count: 1,130,000 +/- 0.2% with
95% confidence
Counting Quickly
21. Step 4: Optimize
• Limit $maxScan to (100,000/numShards) to be even faster
• Cache the random range for a few hours (keep sample set warm)
• Add more RAM (or shards)
• Cache results to not hit the database for the same query
• Don’t use explain(). Get more than one count: use the
aggregation framework on top of the population’s sample size
Counting Quickly
22. Counting Quickly
Goal is to handle scale,
do things that work for any size user base
Random sampling is a good way to do this
24. • Want to send different messages to users in a cohort and measure
against a control (a set of users in the cohort who do not receive any
message)
• Who receives the message should be random
• If you have 1M users and want to send a test to 50k, want to select a
random 50k (and another random 50k for control)
• If you target the same 1M user cohort with 50k test sizes, different users
should be in each test
• Generically, this is the same as “random rate limiting”
• If you wanted to limit delivery to 50k, who receives it should be random
A/B Testing
27. • Parallel processes process users
across different “random” ranges
• Be sure to handle all “random”
values (for apps with fewer than
10,000 users)
• Keep track of global rate limited
state to know when to stop
processing
• Users randomly receive variations
based on send probability (more on
this later), also randomly chosen to
be in control
Randomly scan and select users based on “random” value
29. • Use statistical analysis to look at random user samples based on
“random” value
• A/B tests send on random users based on “random” value
• You just biased yourself when retargeting by overloading - need
another “random” value and use different ones for each case
Statistical Sampling and A/B Testing
34. {
first_name: “Sherika”,
email: “sherika+demo@appboy.com”,
dob: 1994-10-24,
gender: “F”,
custom: {
brands_purchased: “Puma and Asics”,
credit_card_holder: true,
shoe_size: 11,
...
},
...
}
Custom attributes can go alongside other fields!
db.users.find(…).update({$set: {“custom.loyalty_program”:true}})
Extensible User Profiles
35. • Easily extensible to add any number of fields
• Don’t need to worry about type (bool, string, integer, float, etc.):
MongoDB handles it all
• Can do atomic operations like $inc easily
• Easily queryable, no need to do complicated joins against the right
value column
• Can take up a lot of space
“this_is_my_really_long_custom_attribute_name_weeeeeee”
• Can end up with mismatched types across documents
{ visited_website: true }
{ visited_website: “yes” }
Pros
Cons
Extensible User Profiles
36. Space Concern
Tokenize values, use a field map:
{
first_name: “Sherika”,
email: “sherika+demo@appboy.com”,
dob: 1994-10-24,
gender: “F”,
custom: {
0: true,
1: 11,
2: “Alex & Ani”,
...
},
...
}
{
loyalty_program: 0,
shoe_size: 1,
favorite_brand: 2
}
You should also limit the length of values
Extensible User Profiles - How to Improve the Cons
37. Type Constraints
Handle in the client, store expected types in a map and
coerce/reject bad values
{
loyalty_program: Boolean,
shoe_size: Integer,
favorite_brand: String
}
(also need a map for display names of fields…)
Extensible User Profiles - How to Improve the Cons
38. • Use arrays to store items in map, index in array is “token”
• 1+ document per customer that has array field list
• Atomically push new custom attribute to end of array, get
index (“token”) and cache value for fast retrieval later
Field Map
[“Loyalty Program”, “Shoe Size”, “Favorite Color”]
0 1 2
39. • Avoid document growing unbounded
• We cap how many array elements we store before
generating a new document (say, 100)
• Have field least_value in document that represents
token value of index 0 in “list”
• $push if list.99 does not exist, use $findAndModify
to create a new document atomically and retry $push
Field Map
[“Loyalty Program”, “Shoe Size”, “Favorite Color”]
100 101 102
40. • Adds indirection and complexity, but worth it
• Small field name size in each document
• Compression in WiredTiger makes this not an issue anymore
from storage perspective, but still has benefits for field names
• Easy identifiers to pass around in code for custom attributes
Field Map Summary
43. • Appboy customers run multivariate tests of message
campaigns for a long duration
• Goal is to, in the shortest period of time, find the variation
which we are statistically certain provides the highest
conversion
• Customers check in on results and make determination
Multivariate Testing
45. Think of it like you are at a row of slot machines, each has a
random reward across a specific distribution not known in
advance. Need to maximize reward.
Multi-arm Bandit Multivariate Testing
"Las Vegas slot machines". Licensed under CC BY-SA 3.0 via Wikipedia
http://en.wikipedia.org/wiki/File:Las_Vegas_slot_machines.jpg#/media/File:Las_Vegas_slot_machines.jpg
46. “[The bandit problem] was formulated during the [second world]
war, and efforts to solve it so sapped the energies and minds of
Allied analysts that the suggestion was made that the problem
be dropped over Germany, as the ultimate instrument of
intellectual sabotage.”
Multi-arm Bandit Multivariate Testing
- Peter Whittle, 1979
47. Appboy inspired by paper from U. Chicago Booth - http://
faculty.chicagobooth.edu/workshops/marketing/pdf/pdf/
ExperimentsInTheServiceEconomy.pdf
“Multi-armed bandit experiments in the online service economy”
Steven L. Scott, Harvard PhD., Senior Economic Analyst at Google
Multi-arm Bandit Multivariate Testing
48. • Twice per day, Appboy will automatically go in and optimize
send distributions for each variation using algorithm
• Requires a lot of observed data
• For each variation:
• Unique recipients who received it
• Conversion rate
• Timeseries of this data
Multi-arm Bandit Multivariate Testing
50. {
company_id: BSON::ObjectId,
campaign_id: BSON::ObjectId,
date: 2015-05-31,
message_variation_1: {
unique_recipient_count: 100000,
total_conversion_count: 5000,
total_open_rate: 8000,
hourly_breakdown: {
0: {
unique_recipient_count: 1000,
total_conversion_count: 40,
total_open_rate: 125,
...
},
...
},
...
},
message_variation_2: {
...
}
}
• Pre-aggregated stat lets us
pull back entirety of
experiment extremely quickly
• Shard on company ID so we
can pull back all their
campaigns at once and
optimize together
• Pre-aggregated stats need
special care to build to avoid
write overload
Multi-arm Bandit Multivariate Testing
52. • Appboy analyzes the optimal time to
send a message to a user
• If Alice is more likely to engage at
night and Bob in the morning, they’ll
get notifications at those windows
“Comparing overall open rates before and after using it, we've seen over 100% improvement
in performance. Our one week retention campaigns targeted at male Urban On members
improved 138%. Additionally, engaging a particularly difficult segment, users who have been
inactive for three months, has improved 94%.”
- Jim Davis, Director of CRM and Interactive Marketing at Urban Outfitters
Intelligent Delivery
53. • Algorithm is data-intensive on a per-user basis
• Appboy Intelligent Delivery sends tens to hundreds of
millions of messages each day, need to compute optimal
time on a per-user basis quickly
Intelligent Delivery
55. {
_id: BSON::ObjectId of user,
dimension_1: [DateTime, DateTime, …],
dimension_2: [DateTime, DateTime, …],
dimension_3: [DateTime, DateTime, …],
dimension_4: [Float, Float, …],
dimension_5: […],
}
• When dimensional data for a user comes in, record a copy of it in a document
• Shard on {_id: “hashed”} for optimal distribution across shards and best write
throughput
• When needing to Intelligently Deliver to a user, query back one document to
get all the data to input into the algorithm. This is super fast.
• MongoDB’s flexible schemas make adding new dimensions trivial
Intelligent Delivery
56. • Consolidating data for fast
retrieval is a huge win
• MongoDB’s flexible schemas
make this possible
• Choose the right shard key for
the document access pattern
• (Not a catch all, be sure to still
store data non pre-aggregated)
Data Intensive Algorithms Summary