SlideShare a Scribd company logo
1 of 57
Download to read offline
Jon Hyman, Co-Founder & CIO
MongoDB World 2015
@appboy @jon_hyman
REMAINING AGILE WITH BILLIONS OF DOCUMENTS:
APPBOY’S CREATIVE MONGODB SCHEMAS
Appboy is a marketing
automation platform
for mobile apps
We work with top brands and apps
MongoDB World 2014 Recap
• Prior to 2013, scaled vertically
• Sharded in Q2 2013
• Added write buffering with Redis (transactional)
• In 2014, started splitting out collections to more clusters
• By MongoDB World 2014, Appboy handled over 4 billion data points per month
Appboy’s growth on MongoDB
MongoDB World 2014 Recap
Appboy’s Growth Today
• Approximately 22 billion events per month
• Handling spikes of 2B+ events per day
• We anticipate tracking over 1B unique
users in Q3
• 11 clusters, over 160 shards
Appboy’s growth on MongoDB
Appboy’s Growth in 2015
• Statistical analysis in read queries
• Random rate limiting and A/B testing
• Flexible schemas, tokenizing field names
• Schemas for data intensive algorithms at Appboy
Agenda
Today at MongoDB World 2015!
The importance of randomness:
STATISTICAL ANALYSIS
A group of users who match some set of filters.
User Segmentation
Appboy shows you segment membership in real-time as
you add/edit/remove filters.
How do we do it quickly?
We estimate the population sizes of segments when using
our web UI.
Counting Quickly
Goal: Quickly get the count() of an arbitrary query
Problem: MongoDB counts are slow, especially
unindexed ones
Counting Quickly
10 million documents that represent people:
Counting Quickly
{
favorite_color: “blue”,
age: 29,
gender: “M”,
favorite_food: “pizza”,
city: “NYC”,
shoe_size: 11,
attractiveness: 10,
...
}
10 million documents that represent people:
• How many people like blue?
• How many live in NYC and love pizza?
• How many men have a shoe size less than 10?
{
favorite_color: “blue”,
age: 29,
gender: “M”,
favorite_food: “pizza”,
city: “NYC”,
shoe_size: 11,
attractiveness: 10,
...
}
Counting Quickly
Big Question:
How do you estimate
counts?
Answer:
The same way news
networks do it.
With confidence.
Add an index on the random number:
{
random: 4583,
favorite_color: “blue”,
age: 29,
gender: “M”,
favorite_food: “pizza”,
city: “NYC”,
shoe_size: 11,
attractiveness: 10,
...
}
Add a random number in a known range to each document.
Say, between 0 and 9999.
db.users.ensureIndex({random:1})
Counting Quickly
Step 1: Get a random sample
I have 10 million documents. Of my 10,000 random “buckets”,
I should expect each “bucket” to hold about 1,000 users.
E.g.,
db.users.find({random: 123}).count() == ~1000
db.users.find({random: 9043}).count() == ~1000
db.users.find({random: 4982}).count() == ~1000
Counting Quickly
Step 1: Get a random sample
Let’s take a random 100,000 users. Grab a random range that
“holds” those users. These all work:
Tip: Limit $maxScan to 100,000 just to be safe
db.users.find({random: {$gt: 0, $lt: 101})
db.users.find({random: {$gt: 503, $lt: 604})
db.users.find({random: {$gt: 8938, $lt: 9039})
db.users.find({$or: [
{random: {$gt: 9955}},
{random: {$lt: 56}}
])
Counting Quickly
Step 2: Learn about that random sample
Explain Result:
db.users.find(
{
random: {$gt: 0, $lt: 101},
gender: “M”,
favorite_color: “blue”,
size_size: {$gt: 10}
},
)
._addSpecial(“$maxScan”, 100000)
.explain()
{
nscannedObjects: 100000,
n: 11302,
...
}
Counting Quickly
Step 3: Do the math
Population: 10,000,000
Sample size: 100,000
Num matches: 11,302
Percentage of users who matched: 11.3%
Estimated total count: 1,130,000 +/- 0.2% with
95% confidence
Counting Quickly
Step 4: Optimize
• Limit $maxScan to (100,000/numShards) to be even faster
• Cache the random range for a few hours (keep sample set warm)
• Add more RAM (or shards)
• Cache results to not hit the database for the same query
• Don’t use explain(). Get more than one count: use the
aggregation framework on top of the population’s sample size
Counting Quickly
Counting Quickly
Goal is to handle scale,
do things that work for any size user base
Random sampling is a good way to do this
The importance of randomness:
RATE LIMITING AND A/B TESTING
• Want to send different messages to users in a cohort and measure
against a control (a set of users in the cohort who do not receive any
message)
• Who receives the message should be random
• If you have 1M users and want to send a test to 50k, want to select a
random 50k (and another random 50k for control)
• If you target the same 1M user cohort with 50k test sizes, different users
should be in each test
• Generically, this is the same as “random rate limiting”
• If you wanted to limit delivery to 50k, who receives it should be random
A/B Testing
Randomly scan and select users based on “random” value
Randomly scan and select users based on “random” value
• Parallel processes process users
across different “random” ranges
• Be sure to handle all “random”
values (for apps with fewer than
10,000 users)
• Keep track of global rate limited
state to know when to stop
processing
• Users randomly receive variations
based on send probability (more on
this later), also randomly chosen to
be in control
Randomly scan and select users based on “random” value
The importance of randomness:
NEED MORE RANDOMNESS
• Use statistical analysis to look at random user samples based on
“random” value
• A/B tests send on random users based on “random” value
• You just biased yourself when retargeting by overloading - need
another “random” value and use different ones for each case
Statistical Sampling and A/B Testing
Flexible Schemas:
EXTENSIBLE USER PROFILES
Appboy creates a rich
user profile on every
user who opens one of
our customers’ apps
Extensible User Profiles
We also let our customers add their own custom attributes
Extensible User Profiles
{
first_name: “Sherika”,
email: “sherika+demo@appboy.com”,
dob: 1994-10-24,
gender: “F”,
country: “DE”,
...
}
Let’s talk schema
Extensible User Profiles
{
first_name: “Sherika”,
email: “sherika+demo@appboy.com”,
dob: 1994-10-24,
gender: “F”,
custom: {
brands_purchased: “Puma and Asics”,
credit_card_holder: true,
shoe_size: 11,
...
},
...
}
Custom attributes can go alongside other fields!
db.users.find(…).update({$set: {“custom.loyalty_program”:true}})
Extensible User Profiles
• Easily extensible to add any number of fields
• Don’t need to worry about type (bool, string, integer, float, etc.):
MongoDB handles it all
• Can do atomic operations like $inc easily
• Easily queryable, no need to do complicated joins against the right
value column



• Can take up a lot of space
“this_is_my_really_long_custom_attribute_name_weeeeeee”
• Can end up with mismatched types across documents

{ visited_website: true }

{ visited_website: “yes” }
Pros
Cons
Extensible User Profiles
Space Concern
Tokenize values, use a field map:
{
first_name: “Sherika”,
email: “sherika+demo@appboy.com”,
dob: 1994-10-24,
gender: “F”,
custom: {
0: true,
1: 11,
2: “Alex & Ani”,
...
},
...
}
{
loyalty_program: 0,
shoe_size: 1,
favorite_brand: 2
}
You should also limit the length of values
Extensible User Profiles - How to Improve the Cons
Type Constraints
Handle in the client, store expected types in a map and
coerce/reject bad values
{
loyalty_program: Boolean,
shoe_size: Integer,
favorite_brand: String
}
(also need a map for display names of fields…)
Extensible User Profiles - How to Improve the Cons
• Use arrays to store items in map, index in array is “token”
• 1+ document per customer that has array field list
• Atomically push new custom attribute to end of array, get
index (“token”) and cache value for fast retrieval later
Field Map
[“Loyalty Program”, “Shoe Size”, “Favorite Color”]
0 1 2
• Avoid document growing unbounded
• We cap how many array elements we store before
generating a new document (say, 100)
• Have field least_value in document that represents
token value of index 0 in “list”
• $push if list.99 does not exist, use $findAndModify
to create a new document atomically and retry $push
Field Map
[“Loyalty Program”, “Shoe Size”, “Favorite Color”]
100 101 102
• Adds indirection and complexity, but worth it
• Small field name size in each document
• Compression in WiredTiger makes this not an issue anymore
from storage perspective, but still has benefits for field names
• Easy identifiers to pass around in code for custom attributes
Field Map Summary
Flexible Schemas:
FOR DATA INTENSIVE ALGORITHMS
Data Intensive Algorithms, Part 1:
MULTI-ARM BANDIT
MULTIVARIATE TESTING
• Appboy customers run multivariate tests of message
campaigns for a long duration
• Goal is to, in the shortest period of time, find the variation
which we are statistically certain provides the highest
conversion
• Customers check in on results and make determination
Multivariate Testing
Multivariate testing example:
Think of it like you are at a row of slot machines, each has a
random reward across a specific distribution not known in
advance. Need to maximize reward.
Multi-arm Bandit Multivariate Testing
"Las Vegas slot machines". Licensed under CC BY-SA 3.0 via Wikipedia

http://en.wikipedia.org/wiki/File:Las_Vegas_slot_machines.jpg#/media/File:Las_Vegas_slot_machines.jpg
“[The bandit problem] was formulated during the [second world]
war, and efforts to solve it so sapped the energies and minds of
Allied analysts that the suggestion was made that the problem
be dropped over Germany, as the ultimate instrument of
intellectual sabotage.”
Multi-arm Bandit Multivariate Testing
- Peter Whittle, 1979
Appboy inspired by paper from U. Chicago Booth - http://
faculty.chicagobooth.edu/workshops/marketing/pdf/pdf/
ExperimentsInTheServiceEconomy.pdf
“Multi-armed bandit experiments in the online service economy”
Steven L. Scott, Harvard PhD., Senior Economic Analyst at Google
Multi-arm Bandit Multivariate Testing
• Twice per day, Appboy will automatically go in and optimize
send distributions for each variation using algorithm
• Requires a lot of observed data
• For each variation:
• Unique recipients who received it
• Conversion rate
• Timeseries of this data
Multi-arm Bandit Multivariate Testing
{
company_id: BSON::ObjectId,
campaign_id: BSON::ObjectId,
date: 2015-05-31,
message_variation_1: {
unique_recipient_count: 100000,
total_conversion_count: 5000,
total_open_rate: 8000,
hourly_breakdown: {
0: {
unique_recipient_count: 1000,
total_conversion_count: 40,
total_open_rate: 125,
...
},
...
},
...
},
message_variation_2: {
...
}
}
Multi-arm Bandit Multivariate Testing
{
company_id: BSON::ObjectId,
campaign_id: BSON::ObjectId,
date: 2015-05-31,
message_variation_1: {
unique_recipient_count: 100000,
total_conversion_count: 5000,
total_open_rate: 8000,
hourly_breakdown: {
0: {
unique_recipient_count: 1000,
total_conversion_count: 40,
total_open_rate: 125,
...
},
...
},
...
},
message_variation_2: {
...
}
}
• Pre-aggregated stat lets us
pull back entirety of
experiment extremely quickly
• Shard on company ID so we
can pull back all their
campaigns at once and
optimize together
• Pre-aggregated stats need
special care to build to avoid
write overload
Multi-arm Bandit Multivariate Testing
Data Intensive Algorithms, Part 2:
INTELLIGENT DELIVERY
• Appboy analyzes the optimal time to
send a message to a user
• If Alice is more likely to engage at
night and Bob in the morning, they’ll
get notifications at those windows
“Comparing overall open rates before and after using it, we've seen over 100% improvement
in performance. Our one week retention campaigns targeted at male Urban On members
improved 138%. Additionally, engaging a particularly difficult segment, users who have been
inactive for three months, has improved 94%.”
- Jim Davis, Director of CRM and Interactive Marketing at Urban Outfitters
Intelligent Delivery
• Algorithm is data-intensive on a per-user basis
• Appboy Intelligent Delivery sends tens to hundreds of
millions of messages each day, need to compute optimal
time on a per-user basis quickly
Intelligent Delivery
Pre-aggregate dimensions on a per-user basis
Intelligent Delivery
{
_id: BSON::ObjectId of user,
dimension_1: [DateTime, DateTime, …],
dimension_2: [DateTime, DateTime, …],
dimension_3: [DateTime, DateTime, …],
dimension_4: [Float, Float, …],
dimension_5: […],
}
• When dimensional data for a user comes in, record a copy of it in a document
• Shard on {_id: “hashed”} for optimal distribution across shards and best write
throughput
• When needing to Intelligently Deliver to a user, query back one document to
get all the data to input into the algorithm. This is super fast.
• MongoDB’s flexible schemas make adding new dimensions trivial
Intelligent Delivery
• Consolidating data for fast
retrieval is a huge win
• MongoDB’s flexible schemas
make this possible
• Choose the right shard key for
the document access pattern
• (Not a catch all, be sure to still
store data non pre-aggregated)
Data Intensive Algorithms Summary
Thanks! Questions?
jon@appboy.com
@appboy @jon_hyman

More Related Content

Viewers also liked

Viewers also liked (14)

GIS for Infrastructure Management
GIS for Infrastructure ManagementGIS for Infrastructure Management
GIS for Infrastructure Management
 
Real-time, Sensor-based Monitoring of Shipping Containers
Real-time, Sensor-based Monitoring of Shipping ContainersReal-time, Sensor-based Monitoring of Shipping Containers
Real-time, Sensor-based Monitoring of Shipping Containers
 
Chem Lab Report (1)
Chem Lab Report (1)Chem Lab Report (1)
Chem Lab Report (1)
 
Designing your Product as a Platform
Designing your Product as a PlatformDesigning your Product as a Platform
Designing your Product as a Platform
 
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?
 
High-Density Wireless Networks for Auditoriums
High-Density Wireless Networks for AuditoriumsHigh-Density Wireless Networks for Auditoriums
High-Density Wireless Networks for Auditoriums
 
Airport Billing System for Aviation and Non-Aviation Services
Airport Billing System for Aviation and Non-Aviation Services Airport Billing System for Aviation and Non-Aviation Services
Airport Billing System for Aviation and Non-Aviation Services
 
Web Services Automated Testing via SoapUI Tool
Web Services Automated Testing via SoapUI ToolWeb Services Automated Testing via SoapUI Tool
Web Services Automated Testing via SoapUI Tool
 
Spend Analysis In 60 Seconds
Spend Analysis In 60 SecondsSpend Analysis In 60 Seconds
Spend Analysis In 60 Seconds
 
Surgical induced astigmatism
Surgical induced astigmatismSurgical induced astigmatism
Surgical induced astigmatism
 
Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...Best practice strategies to clean up and maintain your database with Hether G...
Best practice strategies to clean up and maintain your database with Hether G...
 
Rosewood hotels & resorts HBS case study
Rosewood hotels & resorts HBS case studyRosewood hotels & resorts HBS case study
Rosewood hotels & resorts HBS case study
 
Surgical Bleeding
Surgical BleedingSurgical Bleeding
Surgical Bleeding
 
Surface Plasmon Resonance
Surface Plasmon ResonanceSurface Plasmon Resonance
Surface Plasmon Resonance
 

Similar to Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas

Appboy analytics - NYC MUG 11/19/13
Appboy analytics - NYC MUG 11/19/13Appboy analytics - NYC MUG 11/19/13
Appboy analytics - NYC MUG 11/19/13
MongoDB
 
Unify Your Selling Channels in One Product Catalog Service
Unify Your Selling Channels in One Product Catalog ServiceUnify Your Selling Channels in One Product Catalog Service
Unify Your Selling Channels in One Product Catalog Service
MongoDB
 

Similar to Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas (20)

Appboy analytics - NYC MUG 11/19/13
Appboy analytics - NYC MUG 11/19/13Appboy analytics - NYC MUG 11/19/13
Appboy analytics - NYC MUG 11/19/13
 
Socialite, the Open Source Status Feed
Socialite, the Open Source Status FeedSocialite, the Open Source Status Feed
Socialite, the Open Source Status Feed
 
Pre-Aggregated Analytics And Social Feeds Using MongoDB
Pre-Aggregated Analytics And Social Feeds Using MongoDBPre-Aggregated Analytics And Social Feeds Using MongoDB
Pre-Aggregated Analytics And Social Feeds Using MongoDB
 
MongoDB World 2018: Data Models for Storing Sophisticated Customer Journeys i...
MongoDB World 2018: Data Models for Storing Sophisticated Customer Journeys i...MongoDB World 2018: Data Models for Storing Sophisticated Customer Journeys i...
MongoDB World 2018: Data Models for Storing Sophisticated Customer Journeys i...
 
Data_Modeling_MongoDB.pdf
Data_Modeling_MongoDB.pdfData_Modeling_MongoDB.pdf
Data_Modeling_MongoDB.pdf
 
Retail referencearchitecture productcatalog
Retail referencearchitecture productcatalogRetail referencearchitecture productcatalog
Retail referencearchitecture productcatalog
 
MongoDB @ Viacom
MongoDB @ ViacomMongoDB @ Viacom
MongoDB @ Viacom
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Unify Your Selling Channels in One Product Catalog Service
Unify Your Selling Channels in One Product Catalog ServiceUnify Your Selling Channels in One Product Catalog Service
Unify Your Selling Channels in One Product Catalog Service
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
 
Neue Features in MongoDB 3.6
Neue Features in MongoDB 3.6Neue Features in MongoDB 3.6
Neue Features in MongoDB 3.6
 
MediaGlu and Mongo DB
MediaGlu and Mongo DBMediaGlu and Mongo DB
MediaGlu and Mongo DB
 
MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean,...
MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean,...MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean,...
MongoDB World 2019: How Braze uses the MongoDB Aggregation Pipeline for Lean,...
 
The Right (and Wrong) Use Cases for MongoDB
The Right (and Wrong) Use Cases for MongoDBThe Right (and Wrong) Use Cases for MongoDB
The Right (and Wrong) Use Cases for MongoDB
 
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
[AFEL] Neighborhood Troubles: On the Value of User Pre-Filtering To Speed Up ...
 
Novedades de MongoDB 3.6
Novedades de MongoDB 3.6Novedades de MongoDB 3.6
Novedades de MongoDB 3.6
 
High Performance Applications with MongoDB
High Performance Applications with MongoDBHigh Performance Applications with MongoDB
High Performance Applications with MongoDB
 
ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...
ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...
ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent P...
 
Prepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDBPrepare for Peak Holiday Season with MongoDB
Prepare for Peak Holiday Season with MongoDB
 
Microservices and the Art of Taming the Dependency Hell Monster
Microservices and the Art of Taming the Dependency Hell MonsterMicroservices and the Art of Taming the Dependency Hell Monster
Microservices and the Art of Taming the Dependency Hell Monster
 

More from MongoDB

More from MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas

  • 1. Jon Hyman, Co-Founder & CIO MongoDB World 2015 @appboy @jon_hyman REMAINING AGILE WITH BILLIONS OF DOCUMENTS: APPBOY’S CREATIVE MONGODB SCHEMAS
  • 2. Appboy is a marketing automation platform for mobile apps
  • 3. We work with top brands and apps
  • 5. • Prior to 2013, scaled vertically • Sharded in Q2 2013 • Added write buffering with Redis (transactional) • In 2014, started splitting out collections to more clusters • By MongoDB World 2014, Appboy handled over 4 billion data points per month Appboy’s growth on MongoDB MongoDB World 2014 Recap
  • 7. • Approximately 22 billion events per month • Handling spikes of 2B+ events per day • We anticipate tracking over 1B unique users in Q3 • 11 clusters, over 160 shards Appboy’s growth on MongoDB Appboy’s Growth in 2015
  • 8. • Statistical analysis in read queries • Random rate limiting and A/B testing • Flexible schemas, tokenizing field names • Schemas for data intensive algorithms at Appboy Agenda Today at MongoDB World 2015!
  • 9. The importance of randomness: STATISTICAL ANALYSIS
  • 10. A group of users who match some set of filters. User Segmentation
  • 11. Appboy shows you segment membership in real-time as you add/edit/remove filters. How do we do it quickly? We estimate the population sizes of segments when using our web UI. Counting Quickly
  • 12. Goal: Quickly get the count() of an arbitrary query Problem: MongoDB counts are slow, especially unindexed ones Counting Quickly
  • 13. 10 million documents that represent people: Counting Quickly { favorite_color: “blue”, age: 29, gender: “M”, favorite_food: “pizza”, city: “NYC”, shoe_size: 11, attractiveness: 10, ... }
  • 14. 10 million documents that represent people: • How many people like blue? • How many live in NYC and love pizza? • How many men have a shoe size less than 10? { favorite_color: “blue”, age: 29, gender: “M”, favorite_food: “pizza”, city: “NYC”, shoe_size: 11, attractiveness: 10, ... } Counting Quickly
  • 15. Big Question: How do you estimate counts? Answer: The same way news networks do it. With confidence.
  • 16. Add an index on the random number: { random: 4583, favorite_color: “blue”, age: 29, gender: “M”, favorite_food: “pizza”, city: “NYC”, shoe_size: 11, attractiveness: 10, ... } Add a random number in a known range to each document. Say, between 0 and 9999. db.users.ensureIndex({random:1}) Counting Quickly
  • 17. Step 1: Get a random sample I have 10 million documents. Of my 10,000 random “buckets”, I should expect each “bucket” to hold about 1,000 users. E.g., db.users.find({random: 123}).count() == ~1000 db.users.find({random: 9043}).count() == ~1000 db.users.find({random: 4982}).count() == ~1000 Counting Quickly
  • 18. Step 1: Get a random sample Let’s take a random 100,000 users. Grab a random range that “holds” those users. These all work: Tip: Limit $maxScan to 100,000 just to be safe db.users.find({random: {$gt: 0, $lt: 101}) db.users.find({random: {$gt: 503, $lt: 604}) db.users.find({random: {$gt: 8938, $lt: 9039}) db.users.find({$or: [ {random: {$gt: 9955}}, {random: {$lt: 56}} ]) Counting Quickly
  • 19. Step 2: Learn about that random sample Explain Result: db.users.find( { random: {$gt: 0, $lt: 101}, gender: “M”, favorite_color: “blue”, size_size: {$gt: 10} }, ) ._addSpecial(“$maxScan”, 100000) .explain() { nscannedObjects: 100000, n: 11302, ... } Counting Quickly
  • 20. Step 3: Do the math Population: 10,000,000 Sample size: 100,000 Num matches: 11,302 Percentage of users who matched: 11.3% Estimated total count: 1,130,000 +/- 0.2% with 95% confidence Counting Quickly
  • 21. Step 4: Optimize • Limit $maxScan to (100,000/numShards) to be even faster • Cache the random range for a few hours (keep sample set warm) • Add more RAM (or shards) • Cache results to not hit the database for the same query • Don’t use explain(). Get more than one count: use the aggregation framework on top of the population’s sample size Counting Quickly
  • 22. Counting Quickly Goal is to handle scale, do things that work for any size user base Random sampling is a good way to do this
  • 23. The importance of randomness: RATE LIMITING AND A/B TESTING
  • 24. • Want to send different messages to users in a cohort and measure against a control (a set of users in the cohort who do not receive any message) • Who receives the message should be random • If you have 1M users and want to send a test to 50k, want to select a random 50k (and another random 50k for control) • If you target the same 1M user cohort with 50k test sizes, different users should be in each test • Generically, this is the same as “random rate limiting” • If you wanted to limit delivery to 50k, who receives it should be random A/B Testing
  • 25. Randomly scan and select users based on “random” value
  • 26. Randomly scan and select users based on “random” value
  • 27. • Parallel processes process users across different “random” ranges • Be sure to handle all “random” values (for apps with fewer than 10,000 users) • Keep track of global rate limited state to know when to stop processing • Users randomly receive variations based on send probability (more on this later), also randomly chosen to be in control Randomly scan and select users based on “random” value
  • 28. The importance of randomness: NEED MORE RANDOMNESS
  • 29. • Use statistical analysis to look at random user samples based on “random” value • A/B tests send on random users based on “random” value • You just biased yourself when retargeting by overloading - need another “random” value and use different ones for each case Statistical Sampling and A/B Testing
  • 31. Appboy creates a rich user profile on every user who opens one of our customers’ apps Extensible User Profiles
  • 32. We also let our customers add their own custom attributes Extensible User Profiles
  • 33. { first_name: “Sherika”, email: “sherika+demo@appboy.com”, dob: 1994-10-24, gender: “F”, country: “DE”, ... } Let’s talk schema Extensible User Profiles
  • 34. { first_name: “Sherika”, email: “sherika+demo@appboy.com”, dob: 1994-10-24, gender: “F”, custom: { brands_purchased: “Puma and Asics”, credit_card_holder: true, shoe_size: 11, ... }, ... } Custom attributes can go alongside other fields! db.users.find(…).update({$set: {“custom.loyalty_program”:true}}) Extensible User Profiles
  • 35. • Easily extensible to add any number of fields • Don’t need to worry about type (bool, string, integer, float, etc.): MongoDB handles it all • Can do atomic operations like $inc easily • Easily queryable, no need to do complicated joins against the right value column
 
 • Can take up a lot of space “this_is_my_really_long_custom_attribute_name_weeeeeee” • Can end up with mismatched types across documents
 { visited_website: true }
 { visited_website: “yes” } Pros Cons Extensible User Profiles
  • 36. Space Concern Tokenize values, use a field map: { first_name: “Sherika”, email: “sherika+demo@appboy.com”, dob: 1994-10-24, gender: “F”, custom: { 0: true, 1: 11, 2: “Alex & Ani”, ... }, ... } { loyalty_program: 0, shoe_size: 1, favorite_brand: 2 } You should also limit the length of values Extensible User Profiles - How to Improve the Cons
  • 37. Type Constraints Handle in the client, store expected types in a map and coerce/reject bad values { loyalty_program: Boolean, shoe_size: Integer, favorite_brand: String } (also need a map for display names of fields…) Extensible User Profiles - How to Improve the Cons
  • 38. • Use arrays to store items in map, index in array is “token” • 1+ document per customer that has array field list • Atomically push new custom attribute to end of array, get index (“token”) and cache value for fast retrieval later Field Map [“Loyalty Program”, “Shoe Size”, “Favorite Color”] 0 1 2
  • 39. • Avoid document growing unbounded • We cap how many array elements we store before generating a new document (say, 100) • Have field least_value in document that represents token value of index 0 in “list” • $push if list.99 does not exist, use $findAndModify to create a new document atomically and retry $push Field Map [“Loyalty Program”, “Shoe Size”, “Favorite Color”] 100 101 102
  • 40. • Adds indirection and complexity, but worth it • Small field name size in each document • Compression in WiredTiger makes this not an issue anymore from storage perspective, but still has benefits for field names • Easy identifiers to pass around in code for custom attributes Field Map Summary
  • 41. Flexible Schemas: FOR DATA INTENSIVE ALGORITHMS
  • 42. Data Intensive Algorithms, Part 1: MULTI-ARM BANDIT MULTIVARIATE TESTING
  • 43. • Appboy customers run multivariate tests of message campaigns for a long duration • Goal is to, in the shortest period of time, find the variation which we are statistically certain provides the highest conversion • Customers check in on results and make determination Multivariate Testing
  • 45. Think of it like you are at a row of slot machines, each has a random reward across a specific distribution not known in advance. Need to maximize reward. Multi-arm Bandit Multivariate Testing "Las Vegas slot machines". Licensed under CC BY-SA 3.0 via Wikipedia
 http://en.wikipedia.org/wiki/File:Las_Vegas_slot_machines.jpg#/media/File:Las_Vegas_slot_machines.jpg
  • 46. “[The bandit problem] was formulated during the [second world] war, and efforts to solve it so sapped the energies and minds of Allied analysts that the suggestion was made that the problem be dropped over Germany, as the ultimate instrument of intellectual sabotage.” Multi-arm Bandit Multivariate Testing - Peter Whittle, 1979
  • 47. Appboy inspired by paper from U. Chicago Booth - http:// faculty.chicagobooth.edu/workshops/marketing/pdf/pdf/ ExperimentsInTheServiceEconomy.pdf “Multi-armed bandit experiments in the online service economy” Steven L. Scott, Harvard PhD., Senior Economic Analyst at Google Multi-arm Bandit Multivariate Testing
  • 48. • Twice per day, Appboy will automatically go in and optimize send distributions for each variation using algorithm • Requires a lot of observed data • For each variation: • Unique recipients who received it • Conversion rate • Timeseries of this data Multi-arm Bandit Multivariate Testing
  • 49. { company_id: BSON::ObjectId, campaign_id: BSON::ObjectId, date: 2015-05-31, message_variation_1: { unique_recipient_count: 100000, total_conversion_count: 5000, total_open_rate: 8000, hourly_breakdown: { 0: { unique_recipient_count: 1000, total_conversion_count: 40, total_open_rate: 125, ... }, ... }, ... }, message_variation_2: { ... } } Multi-arm Bandit Multivariate Testing
  • 50. { company_id: BSON::ObjectId, campaign_id: BSON::ObjectId, date: 2015-05-31, message_variation_1: { unique_recipient_count: 100000, total_conversion_count: 5000, total_open_rate: 8000, hourly_breakdown: { 0: { unique_recipient_count: 1000, total_conversion_count: 40, total_open_rate: 125, ... }, ... }, ... }, message_variation_2: { ... } } • Pre-aggregated stat lets us pull back entirety of experiment extremely quickly • Shard on company ID so we can pull back all their campaigns at once and optimize together • Pre-aggregated stats need special care to build to avoid write overload Multi-arm Bandit Multivariate Testing
  • 51. Data Intensive Algorithms, Part 2: INTELLIGENT DELIVERY
  • 52. • Appboy analyzes the optimal time to send a message to a user • If Alice is more likely to engage at night and Bob in the morning, they’ll get notifications at those windows “Comparing overall open rates before and after using it, we've seen over 100% improvement in performance. Our one week retention campaigns targeted at male Urban On members improved 138%. Additionally, engaging a particularly difficult segment, users who have been inactive for three months, has improved 94%.” - Jim Davis, Director of CRM and Interactive Marketing at Urban Outfitters Intelligent Delivery
  • 53. • Algorithm is data-intensive on a per-user basis • Appboy Intelligent Delivery sends tens to hundreds of millions of messages each day, need to compute optimal time on a per-user basis quickly Intelligent Delivery
  • 54. Pre-aggregate dimensions on a per-user basis Intelligent Delivery
  • 55. { _id: BSON::ObjectId of user, dimension_1: [DateTime, DateTime, …], dimension_2: [DateTime, DateTime, …], dimension_3: [DateTime, DateTime, …], dimension_4: [Float, Float, …], dimension_5: […], } • When dimensional data for a user comes in, record a copy of it in a document • Shard on {_id: “hashed”} for optimal distribution across shards and best write throughput • When needing to Intelligently Deliver to a user, query back one document to get all the data to input into the algorithm. This is super fast. • MongoDB’s flexible schemas make adding new dimensions trivial Intelligent Delivery
  • 56. • Consolidating data for fast retrieval is a huge win • MongoDB’s flexible schemas make this possible • Choose the right shard key for the document access pattern • (Not a catch all, be sure to still store data non pre-aggregated) Data Intensive Algorithms Summary