2. Agenda
• Why is schema design important
• A real world use case
– Social Inbox
– History
• Conclusions
3. Why is Schema Design important?
•
Largest factor for a performant system
•
Schema design with MongoDB is different
•
•
RDBMS – "What answers do I have?"
MongoDB – "What question will I have?"
9. 3 Approaches (there are more)
• Fan out on Read
• Fan out on Write
• Fan out on Write with Bucketing
10. Fan out on read
// Shard on "from"
db.shardCollection( "mongodbdays.inbox", { from: 1 } )
// Make sure we have an index to handle inbox reads
db.inbox.ensureIndex( { to: 1, sent: 1 } )
msg = {
from: ”Matias",
to:
[ "Bob", "Jane" ],
sent: new Date(),
message: "Hi!",
}
// Send a message
db.inbox.save( msg )
// Read my inbox
db.inbox.find( { to: ”Matias" } ).sort( { sent: -1 } )
Schema Design, Matias Cascallares
11. Fan out on read – IO
Send
Message
Shard 1
Shard 2
Shard 3
12. Fan out on read – IO
Read
Inbox
Shard 1
Shard 2
Shard 3
13. Considerations
• Write: one document per message sent
• Reading my inbox means finding all messages with
my own name in the recipient field
• Read: requires scatter-gather on sharded cluster
• Then a lot of random IO on a shard to find
everything
14. Fan out on write
// Shard on “recipient” and “sent”
db.shardCollection( "mongodbdays.inbox", { ”recipient”: 1, ”sent”: 1 } )
msg = {
from: ”Matias",
to:
[ "Bob", "Jane" ],
sent: new Date(),
message: "Hi!",
}
// Send a message
for ( recipient in msg.to ) {
msg.recipient = recipient
db.inbox.save( msg );
}
// Read my inbox
db.inbox.find( { recipient: "Matias" } ).sort( { sent: -1 } )
Schema Design, Matias Cascallares
15. Fan out on write – IO
Send
Message
Shard 1
Shard 2
Shard 3
16. Fan out on write – IO
Read
Inbox
Shard 1
Shard 2
Shard 3
17. Considerations
• Write: one document per recipient
• Reading my inbox is just finding all of the messages
with me as the recipient
• Can shard on recipient, so inbox reads hit one shard
• But still lots of random IO on the shard
18. Fan out on write with buckets
// Shard on “owner / sequence”
db.shardCollection( "mongodbdays.inbox", { owner: 1, sequence: 1 } )
db.shardCollection( "mongodbdays.users", { user_name: 1 } )
msg = {
from: ”Matias",
to:
[ "Bob", "Jane" ],
sent: new Date(),
message: "Hi!",
}
Schema Design, Matias Cascallares
20. Fan out on write with buckets
• Each “inbox” document is an array of messages
• Append a message onto “inbox” of recipient
• Bucket inboxes so there’s not too many messages
per document
• Can shard on recipient, so inbox reads hit one shard
• 1 or 2 documents to read the whole inbox
21. Fan out on write with buckets - IO
Send
Message
Shard 1
Shard 2
Shard 3
22. Fan out on write with buckets - IO
Read
Inbox
Shard 1
Shard 2
Shard 3
25. Design Goals
Need to retain a limited amount of history e.g.
– Number of items
– Hours, Days, Weeks
– May be legislative requirement (e.g. HIPPA, SOX, DPA)
Need to query efficiently by
– match
– ranges
26. 3 Approaches (there are more)
•
Bucket by number of messages
•
Fixed size array
•
Bucket by date + TTL Collections
27. Bucket by number of
messages
db.inbox.find()
{ owner: "Matias", sequence: 25,
messages: [
{ from: "Matias",
to: [ "Bob", "Jane" ],
sent: ISODate("2013-03-01T09:59:42.689Z"),
message: "Hi!"
},
…
]}
// Query with a date range
db.inbox.find({ owner: "Matias",
messages: {
$elemMatch: {sent:{$gt: ISODate("…") }}}})
// Remove elements based on a date
db.inbox.update({ owner: "Matias" },
{ $pull: { messages: {
sent: { $lt: ISODate("…") } } } } )
Schema Design, Matias Cascallares
28. Considerations
•
Shrinking documents, space can be reclaimed
with
– db.runCommand ( { compact: '<collection>' } )
•
Removing the document after the last element
in the array as been removed
– { "_id" : …, "messages" : [ ], "owner" : ”Bob",
"sequence" : 0 }
31. TTL Collections
// messages: one doc per user per day
db.inbox.findOne()
{
_id: 1,
to: "Joe",
sequence: ISODate("2013-02-04T00:00:00.392Z"),
messages: [ ]
}
// Auto expires data after 31536000 seconds = 1 year
db.messages.ensureIndex( { sequence: 1 },
{
expireAfterSeconds: 31536000 }
)
Schema Design, Matias Cascallares
33. Summary
•
Multiple ways to model a domain problem
•
Understand the key uses cases of your app
•
Balance between ease of query vs. ease of
write
•
Random IO should be avoided
•
Scatter/gatter should be avoided
Define your schema when saving and creating indexesFunctional goalsPerformance goalsIn RDBMSImplement your domain model in the canonical way following normalization practices. Afterwards using relational databases mechanisms like joins and group by answer your queriesIn MongoDBYou first detect your queries, your typical access patterns and using these you implement your schema
Let’s go to our first example
Social media applicationsChronological feedsAll those platforms provide some level of messaging among their users
The message that I write here needs to be sent to hundreds or thousands of usersHow do we structure this in MongoDB?
This feed is unique per user, it’s 100% personalized
The simplest approachThe first idea that is coming to your mindWe’ll use Mongo shell for our code samples‘To’ field is an array, MongoDB when filtering with array fields similar to SQL ‘in’ operatorIt’s a really easy to implement solution
- No need to touch more than one shard, great for horizontal scalability!
- Reading close to the worst case scenario, thanks god we have an index
Write is fastRead is close to the worst caseFor a very read heavy application this is not a good approachIn order to retrieve all these documents when reading the inbox lots of IO
It’s the opposite situation that we faced in the first scenario
Efficient when reading messages but less efficient when writingWhen reading lot of random IO since we don’t have control where MongoDB stores each document, this is where the 3rd solution helps us
This is not a common sense solutionIt’s not going to be your first solution, maybe yes if you have a lot of experience with MongoDB
Let’s see in detail this findAndModify…Sequence is going to take the total count of messages, divide it by 50 and round it down, this is a pagination or bucketing algorithm where sequence is the number of pageWe push the message to the end of the array, each document contains 50 messages at maximumIt seems a lot of work for writing or sending a message
Writing it’s the same amount of work, actually a bit more, than previous solution
Reading it’s much better in this case because I only retrieve one or two documents to build my inbox and using an indexFor really high reading traffic applications this optimization is really important
Tweet is an example of history applicationRead a time window of messages
- Give me everything between 6 and 4 months ago.
Similar example to our previous case using sequence as a paginationUpdate operation is atomic at document level
Using pull command we shrink documents and produce fragmentationYou can fix that using compact in periodical basis, maybe with a cron job. Compaction it’s slow, it lockes, etc, good alternative to run it on secondariesRemember to delete the document once you got rid of all your messages
With this approach instead of deleting messages we are going to keep the latest messages when we insert them
- We need to know the size of the array,adaptative, based on user, or overestimate it
Another approach would be to set sequence with a Datetime in the future and expireAfterSeconds equals to 0TTL collections are quite popular for this kind of expiration
Schema design in no relational databases is not trivialThere is not a unique solution like in RDBMSThere is nothing mathematically tested like normalization formsWhich solution is best depends on your users and how do they use your application