Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas

Jon Hyman, Co-Founder & CIO
MongoDB World 2015
@appboy @jon_hyman
REMAINING AGILE WITH BILLIONS OF DOCUMENTS:
APPBOY’S CREATIVE MONGODB SCHEMAS

Appboy is a marketing
automation platform
for mobile apps

We work with top brands and apps

• Prior to 2013, scaled vertically
• Sharded in Q2 2013
• Added write buffering with Redis (transactional)
• In 2014, started splitting out collections to more clusters
• By MongoDB World 2014, Appboy handled over 4 billion data points per month
Appboy’s growth on MongoDB
MongoDB World 2014 Recap

• Approximately 22 billion events per month
• Handling spikes of 2B+ events per day
• We anticipate tracking over 1B unique
users in Q3
• 11 clusters, over 160 shards
Appboy’s growth on MongoDB
Appboy’s Growth in 2015

• Statistical analysis in read queries
• Random rate limiting and A/B testing
• Flexible schemas, tokenizing field names
• Schemas for data intensive algorithms at Appboy
Agenda
Today at MongoDB World 2015!

The importance of randomness:
STATISTICAL ANALYSIS

A group of users who match some set of filters.
User Segmentation

Appboy shows you segment membership in real-time as
you add/edit/remove filters.
How do we do it quickly?
We estimate the population sizes of segments when using
our web UI.
Counting Quickly

Goal: Quickly get the count() of an arbitrary query
Problem: MongoDB counts are slow, especially
unindexed ones
Counting Quickly

10 million documents that represent people:
Counting Quickly
{
favorite_color: “blue”,
age: 29,
gender: “M”,
favorite_food: “pizza”,
city: “NYC”,
shoe_size: 11,
attractiveness: 10,
...
}

10 million documents that represent people:
• How many people like blue?
• How many live in NYC and love pizza?
• How many men have a shoe size less than 10?
{
age: 29,
gender: “M”,
city: “NYC”,
shoe_size: 11,
attractiveness: 10,
...
}
Counting Quickly

Big Question:
How do you estimate
counts?
Answer:
The same way news
networks do it.
With confidence.

Add an index on the random number:
{
random: 4583,
age: 29,
gender: “M”,
city: “NYC”,
shoe_size: 11,
attractiveness: 10,
...
}
Add a random number in a known range to each document.
Say, between 0 and 9999.
db.users.ensureIndex({random:1})
Counting Quickly

Step 1: Get a random sample
I have 10 million documents. Of my 10,000 random “buckets”,
I should expect each “bucket” to hold about 1,000 users.
E.g.,
db.users.find({random: 123}).count() == ~1000
Counting Quickly

Step 1: Get a random sample
Let’s take a random 100,000 users. Grab a random range that
“holds” those users. These all work:
Tip: Limit $maxScan to 100,000 just to be safe
db.users.find({random: {$gt: 0, $lt: 101})
db.users.find({$or: [
{random: {$gt: 9955}},
{random: {$lt: 56}}
])
Counting Quickly

Step 2: Learn about that random sample
Explain Result:
db.users.find(
{
random: {$gt: 0, $lt: 101},
gender: “M”,
size_size: {$gt: 10}
},
)
._addSpecial(“$maxScan”, 100000)
.explain()
{
nscannedObjects: 100000,
n: 11302,
...
}
Counting Quickly

Step 3: Do the math
Population: 10,000,000
Sample size: 100,000
Num matches: 11,302
Percentage of users who matched: 11.3%
Estimated total count: 1,130,000 +/- 0.2% with
95% confidence
Counting Quickly

Step 4: Optimize
• Limit $maxScan to (100,000/numShards) to be even faster
• Cache the random range for a few hours (keep sample set warm)
• Add more RAM (or shards)
• Cache results to not hit the database for the same query
• Don’t use explain(). Get more than one count: use the
aggregation framework on top of the population’s sample size
Counting Quickly

Counting Quickly
Goal is to handle scale,
do things that work for any size user base
Random sampling is a good way to do this

RATE LIMITING AND A/B TESTING

• Want to send different messages to users in a cohort and measure
against a control (a set of users in the cohort who do not receive any
message)
• Who receives the message should be random
• If you have 1M users and want to send a test to 50k, want to select a
random 50k (and another random 50k for control)
• If you target the same 1M user cohort with 50k test sizes, different users
should be in each test
• Generically, this is the same as “random rate limiting”
• If you wanted to limit delivery to 50k, who receives it should be random
A/B Testing

Randomly scan and select users based on “random” value

• Parallel processes process users
across different “random” ranges
• Be sure to handle all “random”
values (for apps with fewer than
10,000 users)
• Keep track of global rate limited
state to know when to stop
processing
• Users randomly receive variations
based on send probability (more on
this later), also randomly chosen to
be in control
Randomly scan and select users based on “random” value

NEED MORE RANDOMNESS

• Use statistical analysis to look at random user samples based on
“random” value
• A/B tests send on random users based on “random” value
• You just biased yourself when retargeting by overloading - need
another “random” value and use different ones for each case
Statistical Sampling and A/B Testing

Flexible Schemas:
EXTENSIBLE USER PROFILES

Appboy creates a rich
user profile on every
user who opens one of
our customers’ apps
Extensible User Profiles

We also let our customers add their own custom attributes

{
first_name: “Sherika”,
email: “sherika+demo@appboy.com”,
dob: 1994-10-24,
gender: “F”,
country: “DE”,
...
}
Let’s talk schema

{
dob: 1994-10-24,
gender: “F”,
custom: {
brands_purchased: “Puma and Asics”,
credit_card_holder: true,
shoe_size: 11,
...
},
...
}
Custom attributes can go alongside other fields!
db.users.find(…).update({$set: {“custom.loyalty_program”:true}})

• Easily extensible to add any number of fields
• Don’t need to worry about type (bool, string, integer, float, etc.):
MongoDB handles it all
• Can do atomic operations like $inc easily
• Easily queryable, no need to do complicated joins against the right
value column 
 
• Can take up a lot of space
“this_is_my_really_long_custom_attribute_name_weeeeeee”
• Can end up with mismatched types across documents 
{ visited_website: true } 
{ visited_website: “yes” }
Pros
Cons

Space Concern
Tokenize values, use a field map:
{
dob: 1994-10-24,
gender: “F”,
custom: {
0: true,
1: 11,
2: “Alex & Ani”,
...
},
...
}
{
loyalty_program: 0,
shoe_size: 1,
favorite_brand: 2
}
You should also limit the length of values
Extensible User Profiles - How to Improve the Cons

Type Constraints
Handle in the client, store expected types in a map and
coerce/reject bad values
{
loyalty_program: Boolean,
shoe_size: Integer,
favorite_brand: String
}
(also need a map for display names of fields…)
Extensible User Profiles - How to Improve the Cons

• Use arrays to store items in map, index in array is “token”
• 1+ document per customer that has array field list
• Atomically push new custom attribute to end of array, get
index (“token”) and cache value for fast retrieval later
Field Map
[“Loyalty Program”, “Shoe Size”, “Favorite Color”]
0 1 2

• Avoid document growing unbounded
• We cap how many array elements we store before
generating a new document (say, 100)
• Have field least_value in document that represents
token value of index 0 in “list”
• $push if list.99 does not exist, use $findAndModify
to create a new document atomically and retry $push
Field Map
[“Loyalty Program”, “Shoe Size”, “Favorite Color”]
100 101 102

• Adds indirection and complexity, but worth it
• Small field name size in each document
• Compression in WiredTiger makes this not an issue anymore
from storage perspective, but still has benefits for field names
• Easy identifiers to pass around in code for custom attributes
Field Map Summary

Flexible Schemas:
FOR DATA INTENSIVE ALGORITHMS

Data Intensive Algorithms, Part 1:
MULTI-ARM BANDIT
MULTIVARIATE TESTING

• Appboy customers run multivariate tests of message
campaigns for a long duration
• Goal is to, in the shortest period of time, find the variation
which we are statistically certain provides the highest
conversion
• Customers check in on results and make determination
Multivariate Testing

Think of it like you are at a row of slot machines, each has a
random reward across a specific distribution not known in
advance. Need to maximize reward.
Multi-arm Bandit Multivariate Testing
"Las Vegas slot machines". Licensed under CC BY-SA 3.0 via Wikipedia 
http://en.wikipedia.org/wiki/File:Las_Vegas_slot_machines.jpg#/media/File:Las_Vegas_slot_machines.jpg

“[The bandit problem] was formulated during the [second world]
war, and efforts to solve it so sapped the energies and minds of
Allied analysts that the suggestion was made that the problem
be dropped over Germany, as the ultimate instrument of
intellectual sabotage.”
- Peter Whittle, 1979

Appboy inspired by paper from U. Chicago Booth - http://
faculty.chicagobooth.edu/workshops/marketing/pdf/pdf/
ExperimentsInTheServiceEconomy.pdf
“Multi-armed bandit experiments in the online service economy”
Steven L. Scott, Harvard PhD., Senior Economic Analyst at Google

• Twice per day, Appboy will automatically go in and optimize
send distributions for each variation using algorithm
• Requires a lot of observed data
• For each variation:
• Unique recipients who received it
• Conversion rate
• Timeseries of this data

{
company_id: BSON::ObjectId,
campaign_id: BSON::ObjectId,
date: 2015-05-31,
message_variation_1: {
unique_recipient_count: 100000,
total_conversion_count: 5000,
total_open_rate: 8000,
hourly_breakdown: {
0: {
...
},
...
},
...
},
...
}
}

{
company_id: BSON::ObjectId,
campaign_id: BSON::ObjectId,
date: 2015-05-31,
hourly_breakdown: {
0: {
...
},
...
},
...
},
...
}
}
• Pre-aggregated stat lets us
pull back entirety of
experiment extremely quickly
• Shard on company ID so we
can pull back all their
campaigns at once and
optimize together
• Pre-aggregated stats need
special care to build to avoid
write overload

Data Intensive Algorithms, Part 2:
INTELLIGENT DELIVERY

• Appboy analyzes the optimal time to
send a message to a user
• If Alice is more likely to engage at
night and Bob in the morning, they’ll
get notifications at those windows
“Comparing overall open rates before and after using it, we've seen over 100% improvement
in performance. Our one week retention campaigns targeted at male Urban On members
improved 138%. Additionally, engaging a particularly difficult segment, users who have been
inactive for three months, has improved 94%.”
- Jim Davis, Director of CRM and Interactive Marketing at Urban Outﬁtters
Intelligent Delivery

• Algorithm is data-intensive on a per-user basis
• Appboy Intelligent Delivery sends tens to hundreds of
millions of messages each day, need to compute optimal
time on a per-user basis quickly

Pre-aggregate dimensions on a per-user basis

{
_id: BSON::ObjectId of user,
dimension_1: [DateTime, DateTime, …],
dimension_4: [Float, Float, …],
dimension_5: […],
}
• When dimensional data for a user comes in, record a copy of it in a document
• Shard on {_id: “hashed”} for optimal distribution across shards and best write
throughput
• When needing to Intelligently Deliver to a user, query back one document to
get all the data to input into the algorithm. This is super fast.
• MongoDB’s flexible schemas make adding new dimensions trivial

• Consolidating data for fast
retrieval is a huge win
• MongoDB’s flexible schemas
make this possible
• Choose the right shard key for
the document access pattern
• (Not a catch all, be sure to still
store data non pre-aggregated)
Data Intensive Algorithms Summary

Thanks! Questions?
jon@appboy.com
@appboy @jon_hyman

Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (14)

Similar to Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas

Similar to Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas (20)

More from MongoDB

More from MongoDB (20)

Recently uploaded

Recently uploaded (20)

Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas