MongoDB's oplog is possibly its most underrated feature. The oplog is vital as the basis on which replication is built, but its value doesn't stop there. Unlike the MySQL binlog, which is poorly documented and not directly exposed to MySQL clients, the oplog is a well-documented, structured format for changes that is query-able through the same mechanisms as your data. This allows many types of powerful, application-driven streaming or transformation. At Stripe, we've used the MongoDB oplog to create PostgresSQL, HBase, and ElasticSearch mirrors of our data. We've built a simple real-time trigger mechanism for detecting new data. And we've even used it to recover data. In this talk, we'll show you how we use the MongoDB oplog, and how you can build powerful reactive streaming data applications on top of it.
If you'd like to see the presentation with presenter's notes, I've published my Google Docs presentation at https://docs.google.com/presentation/d/19NcoFI9BG7PwLoBV7zvidjs2VLgQWeVVcUd7Xc7NoV0/pub
Originally given at MongoDB World 2014 in New York
32. oplog.find({'op' => 'i', 'ns' => ns}) do |cursor|
cursor.add_option(Mongo::Constants::OP_QUERY_TAILABLE)
cursor.add_option(Mongo::Constants::OP_QUERY_AWAIT_DATA)
loop do
cursor.each do |op|
puts op['o']['_id']
end
end
end
33. oplog.find({'op' => 'i', 'ns' => ns}) do |cursor|
cursor.add_option(Mongo::Constants::OP_QUERY_TAILABLE)
cursor.add_option(Mongo::Constants::OP_QUERY_AWAIT_DATA)
loop do
cursor.each do |op|
puts op['o']['_id']
end
end
end
34. oplog.find({'op' => 'i', 'ns' => ns}) do |cursor|
cursor.add_option(Mongo::Constants::OP_QUERY_TAILABLE)
cursor.add_option(Mongo::Constants::OP_QUERY_AWAIT_DATA)
loop do
cursor.each do |op|
puts op['o']['_id']
end
end
end
35. oplog.find({'op' => 'i', 'ns' => ns}) do |cursor|
cursor.add_option(Mongo::Constants::OP_QUERY_TAILABLE)
cursor.add_option(Mongo::Constants::OP_QUERY_AWAIT_DATA)
loop do
cursor.each do |op|
puts op['o']['_id']
end
end
end
36.
37.
38.
39.
40.
41.
42. start_entry = oplog.find_one({},
{:sort => {'$natural' => -1}})
start = start_entry['ts']
oplog.find({'ts' => {'$gt' => start},
'op' => 'i',
'ns' => ns}) do |cursor|
cursor.add_option(Mongo::Constants::OP_QUERY_TAILABLE)
cursor.add_option(Mongo::Constants::OP_QUERY_AWAIT_DATA)
loop do
cursor.each do |op|
puts op['o']['_id']
end
52. cursor.each do |op|
case op['op']
when 'i'
query = "INSERT INTO #{op['ns']} (" +
op['o'].keys.join(', ') +
') VALUES (' +
op['o'].values.map(&:inspect).join(', ') + ')'
else
# ¯_(ツ)_/¯
end
end
53.
54.
55.
56. cursor.each do |op|
case op['op']
when 'i'
query = "INSERT INTO #{op['ns']} (" +
op['o'].keys.join(', ') +
') VALUES (' +
op['o'].values.map(&:inspect).join(', ') + ')'
when 'd'
query = "DELETE FROM #{op['ns']} WHERE _id=" +
op['o']['_id'].inspect
else
# ¯_(ツ)_/¯
end
end
57.
58.
59.
60. query = "DELETE FROM #{op['ns']} WHERE _id=" +
op['o']['_id'].inspect
when 'u'
query = "UPDATE #{op['ns']} SET"
updates = op['o']['$set'] ? op['o']['$set'] : op['o']
updates.each do |k, v|
query += " #{k}=#{v.inspect}"
end
query += " WHERE _id="
query += op['o2']['_id'].inspect
else
# ¯_(ツ)_/¯
end
end
61. cursor.each do |op|
case op['op']
when 'i'
query = "INSERT INTO #{op['ns']} (" +
op['o'].keys.join(', ') + ') VALUES (' +
op['o'].values.map(&:inspect).join(', ') + ')'
when 'd'
query = "DELETE FROM #{op['ns']} WHERE _id=" +
op['o']['_id'].inspect
when 'u'
query = "UPDATE #{op['ns']} SET"
updates = op['o']['$set'] ? op['o']['$set'] : op['o']
updates.each do |k, v|
query += " #{k}=#{v.inspect}"
end
query += " WHERE _id=" + op['o2']['_id'].inspect
else
# ¯_(ツ)_/¯
end
end
74. new_oplog.find({'ts' => {'$gt' => start}}) do |cursor|
cursor.add_option(Mongo::Constants::OP_QUERY_OPLOG_REPLAY)
cursor.each do |op|
if op['op'] == 'd' && op['ns'] == 'monsterdb.tasks'
old_task = old_tasks.find_one({'_id' => op['o']['_id']})
if old_task['finished'] == nil
# found one!
# save old_task to a file, and we'll re-queue it later
end
end
old_connection['admin'].command({'applyOps' => [op]})
end
end
75. new_oplog.find({'ts' => {'$gt' => start}}) do |cursor|
cursor.add_option(Mongo::Constants::OP_QUERY_OPLOG_REPLAY)
cursor.each do |op|
if op['op'] == 'd' && op['ns'] == 'monsterdb.tasks'
old_task = old_tasks.find_one({'_id' => op['o']['_id']})
if old_task['finished'] == false
# found one!
# save old_task to a file, and we'll re-queue it later
end
end
old_connection['admin'].command({'applyOps' => [op]})
end
end
76. new_oplog.find({'ts' => {'$gt' => start}}) do |cursor|
cursor.add_option(Mongo::Constants::OP_QUERY_OPLOG_REPLAY)
cursor.each do |op|
if op['op'] == 'd' && op['ns'] == 'monsterdb.tasks'
old_task = old_tasks.find_one({'_id' => op['o']['_id']})
if old_task['finished'] == false
# found one!
# save old_task to a file, and we'll re-queue it later
end
end
old_connection['admin'].command({'applyOps' => [op]})
end
end
Hi folks
Today I’m going to talk about the MongoDB oplog, and how you can use the oplog as a tool for application development
This talk can basically be broken down into 2 parts
First, I’m going to explain what the oplog is.
We’ll bring everyone up to speed on the basics so that hopefully the rest of the talk will be accessible, even to people who have never looked at the oplog before.
Then, I’ve got 3 different oplog-based applications that we’ve built and used in production at Stripe.
We’ll recreate them here, and show the basic code structure that makes them tick.
My goal is that by the end, you’ll learn enough of the basics that you can take this knowledge and build your own oplog-based applications.
So first, let’s do a quick crash course on what the oplog is, and how MongoDB uses it.
Because you’re replicating your production data,
(...you are replicating your production data, right? Great)
then you all know, with MongoDB replication,
applications write changes to a node in the replset designated primary
and the primary replicates those changes to the secondaries.
If the primary fails, one of the secondaries takes over as the new primaries, and all is right with the world.
You know that this happens, but you may not know how it happens. The answer is that it happens via the oplog.
When the MongoDB primary receives a write from the application,
it first updates its own copy of the data...
...but then adds a record of that write to its operations log, or oplog.
You can think of the oplog as a “list of things I’ve done”.
That’s not entirely accurate, but it’s a good starting point.
As the application continues to write data and make changes...
...those operations are added to the oplog...
...until eventually there’s a long history of changes to your data.
How does this help with replication?
As it turns out, a log of “things I’ve done to my data” is basically a TODO list for the secondary.
Secondaries look at that list, find things that they haven’t done, and, well, do them.
Really quickly, let’s cover a couple details that will become important later:
- The oplog is “row-based” replication, so multi-inserts, -updates, and -deletes become multiple oplog entries.
- The oplog represents actual changes to data, not the operation requested, so things like findAndModify will get drastically simplified. We’ll see more of that later
But that is, at a high level, how the oplog fits into replication.
But it’s still not very useful to you as an application developer,
so let’s dig a bit deeper into the format of the oplog.
Here’s what an oplog looks like.
This is from a replset that I had just created, so it’s pretty short - only 3 entries
Let’s zoom in on the last few operations.
If you look, you might recognize some of the information.
For instance, the first operation is...
...an insert...
...into the “test” collection of the “test” namespace...
...of an object with “_id” of 1 and the “a” property set to 2
That is a self-contained instruction that a secondary could follow.
But the cool thing about the oplog
is that I didn’t have to fight with any internals of MongoDB to pull this up.
Unlike, say, MySQL, where you need direct disk access to access the binlog...
...the oplog is just a (mostly) normal collection
It’s accessible to any client, local or remote, using the normal MongoDB drivers
So when I describe the oplog entry as a self-contained instruction that a secondary could follow,
it’s really a self-contained instruction that *we* could follow.
And that’s basically what we’re going to do for the rest of this talk:
building applications that take the instructions in the oplog and follow them,
just maybe not in the way it was originally intended.
So let’s build some apps.
For the first example, we’re going to build a simple trigger system.
A trigger system is probably the simplest thing you can build using the oplog
If you haven’t used triggers, they are basically a piece of code executed in response to certain DB events
They’re useful for decoupling code which changes data and code that wants to respond to those changes.
A lot of SQL servers support triggers, but because they’re hosted on the SQL server,
the scalability story isn’t great, and sometimes triggers are synchronous, slowing down other operations.
MongoDB doesn’t have built-in triggers, but we can use the oplog to build our own.
Because they’re standalone applications and not built-in to the server, the scalability story is a lot better.
What kind of trigger system do we want to build?
Well, I modeled this example after a problem we have at Stripe.
We model a lot of our batch processing using a queue consumer model
with queue consumers that are asynchronous but close-to-realtime
We record events for a lot of actions within Stripe
Any time someone logs in, we create an event, and write it to MongoDB.
Any time someone makes a payment, we create an event, and write it to MongoDB.
In response to those events, we do things that don’t need to complete synchronously with the payment, or operation
For instance, we asynchronously...
...update the graphs and totals on the user’s dashboard,...
...or send emails to the customer that just purchased something.
Plus a host of other actions, including some internal applications which won’t mean much to you.
We call this event system...
...monster
And it’s proven very popular at Stripe.
Queues are, in general, a great model.
They’re easy for developers to think about.
As generalized abstractions go, they’re comparatively easy to manage and scale.
We have over 250 different types of events representing all sorts of automated and human-generated activity.
And over 150 consumers look at them.
We built monster using the oplog.
Now, monster is not a simple system, and we don’t have enough time to build a full pub-sub system.
So instead, I’m going to focus on the core oplog-driven system in monster.
So here’s the core problem: monster needs to detect when new events are inserted
We could poll every few seconds, but that introduces an almost guaranteed delay
(and it wouldn’t make for a good talk about the oplog)
So instead, we’ll use the oplog, which allows us to react more or less immediately.
Specifically, we’ll build the simplest thing I could come up with:
a script which watches the oplog for insertions
and prints the _id of the resulting record
The application is simple, but it lets us cover the basics of building against the oplog.
As a reminder before we start, here’s the kind of oplog record we’re going to be interacting with.
Conceptually, we want records with “op” set to “i”, “ns” set to our collection name
and then we’ll look at the “o” field and extract the “_id”
Before I show you any code, want to be *very* clear that none of this code should be run as-is
Some of the code I show today will have extremely dangerous security vulnerabilities
or it might just take your system down
All of the code is for pedagogy, not production.
Ok, with that out of the way, let’s look at some code.
Remember, we want to find all inserts into a particular collection.
This is how we’d do it assuming the oplog was just a normal collection
I’ve written the code in Ruby, using the raw mongodb-ruby-driver,
but everything I do should be supported by all the official MongoDB drivers.
Hopefully this looks familiar. All we’re doing is...
...making a query...
...getting a cursor back for that query, and iterating over the results.
Because the oplog is just a normal collection,
this code will work as written.
But by accessing the oplog as a normal collection, you’ll only see things that have already happened
Once you get to the end of the oplog (i.e. the present time), the cursor returns.
That’s not actually what we want, so we’ll have to invoke a few unusual MongoDB features...
Specifically, we’re going to need to use query options.
Query options are flags you set on a cursor before you start pulling results
which modify the behavior of the query
Our first flag is the “tailable” flag, which only works with capped collections.
This tells MongoDB to keep the query cursor open and hold its place in the cursor,
even if it thinks there are no more results.
Later on, you can attempt to pull more results from the cursor,
and if any new results have come in, you’ll get them.
In general, cursor iterators will break when there are no more results,
even if the cursor is left open
So we have to wrap looping over the cursor in another loop.
The next flag is the “await data” flag
This tells MongoDB that, before returning, if you think there are no results,
wait a few seconds just in case any show up.
This means that instead of super aggressively polling your mongod
(and likely spiking CPU utilization on both the server and client)
you’ll get proactively notified when new data shows up.
With those options, we have a cursor that stays open indefinitely,
and continues to push notifications of insertions as they happen.
But there’s still one problem: where do we start?
None of these options has fundamentally changed how queries work
So we’ll still start at the beginning of time, every time
When what we actually want, is to never show historical data,
and only show events that happen after our script starts
In order to do that, we’re going to have to explore some fields in the oplog we glossed over earlier
Specifically, the “ts”, or timestamp, field
The timestamp uses BSON’s, well, timestamp type
I you’ve ever read the docs about types in MongoDB, you might remember the timestamp type
from its note that timestamps are for internal MongoDB use only
Fortunately, we’re mucking around with MongoDB internals
When writing the oplog, the MongoDB primary assigns a timestamp to each operation
These timestamps are constructed to be monotonically increasing, so that entries in the oplog are in timestamp order.
The timestamp type is broken into two pieces...
...a UNIX timestamp, which ticks up once every second...
...and in “ordinal”, which resets to 1 at the beginning of the second,
and is incremented for every operation within that second.
This results in a timestamp that
(a) is monotonically increasing, and
(b) uniquely identifies a specific op
Because it is, effectively, both an ordering and a unique identifier,
...we can use the timestamp as a placeholder
What I’ve done here is find the last entry in the oplog,
grab the timestamp from that entry,
and then only show oplog entries greater than that timestamp.
In general, instead of starting at the end of the oplog,
you might want to have your program store the timestamp for the last oplog entry it looked at,
and start from there next time.
But unfortunately, this code won’t work very well on a database with a long oplog - it’ll just hang
Because the oplog has no indexes, it has to do a full collection scan to find records that match our “$gt” query
In order to fix this, we need to add our last magic query option...
The “oplog replay” flag is something of a special case for, well, oplog replays
It tells the query planner to assume that entries are in oplog order,
and instead of scanning the whole collection, it starts at the end and scans backwards
until it finds the beginning of the range you were looking for
It makes the oplog scan much more efficient
(especially since oplog tailers generally start from near the end)
Once we put everything together,
we have a whole program that can tail the oplog and print when new records are inserted.
Again, this code isn’t really production ready. There are a bunch of things you’d want to do
- You might want to handle failures
- You might want to store progress so that you could resume rather than always skipping to the end.
But this is really about what all oplog-driven applications look like.
The only thing that really changes from app to app is...
...this bit - the actual action you take in response to an oplog entry.
The implementation here is about as simple as I could imagine.
I think the only thing simpler would be to do nothing.
But what if we allow ourselves some more complexity?
Can we solve more interesting problems?
Can I ask more leading questions?
As Stripe has grown, we’ve found that there are certain usage patterns that MongoDB is less well-suited for
Bulk analytics that you want to express with joins and aggregation is the most obvious use case.
While a lot of our MongoDB-driven applications depend on the reliability and uptime of replica sets and failovers,
analytics applications don’t usually have the same requirements.
We found ourselves saying, “what if we wanted all of our data from MongoDB, but in PostgreSQL instead”
Or HBase
Or ElasticSearch
In general, as long as you can tolerate some slight lag,
having multiple copies of your data in different datastores lets you trade off the advantages and disadvantages
depending on the needs of your application
So let’s say we want to get our entire MongoDB database into PostgreSQL.
How would we do that?
We could do an ETL process with nightly imports.
But you already know that’s not what this talk is about.
Instead, we’re going to follow the oplog,
and every time there’s a change in MongoDB,
we’ll make an equivalent change in Postgres.
We’ll start with the trigger code from the last example,
but we’ll clean up some of the stuff that’s not needed here...
...like the constraints on collection or op type.
And we are going to need some more space,
so I’m going to go ahead and clear a lot of this code out of the way
In every iteration of this loop, we get passed some change to the MongoDB database.
We want to make a corresponding change to our Postgres database.
Let’s look at inserts first, since those are simple...
When we get an insert in MongoDB…
...we do an insert in Postgres
It might look like this
There are a bunch of assumptions here:
- All data types are scalars (there are no embedded objects or arrays)
- Your schema is consistent (all records have the same keys, and all values for a key are of the same type)
- Someone figured out that schema and pre-created the Postgres table.
All of these are simplifying assumptions, but they’re not required.
This code is also grossly negligent about escaping. Like, it doesn’t do any
Please don’t write code that doesn’t do any escaping.
And don’t use this code - there’s better code coming up soon.
In any case, we can handle inserts.
What do other event types look like?
Here’s a new oplog type we’ve never seen before.
It’s a delete op, as evidenced by the letter “d”
Unlike insert ops, which included the full record being inserted,
delete ops include the _id of the object to delete.
Now, you might think this looks like a query
and in some sense it is
But MongoDB always uses the _id field in updates and deletes,
so even if you express a delete with a complex query,
MongoDB will record it in the oplog by the _id of the record that it finds.
This is important for idempotency - a complex query might not always match the same record if applied multiple times
But an _id query will.
Using that knowledge, we can write the logic to handle deletions.
This means that inserts and deletes will be propagated from our MongoDB database to our Postgres database.
That’s 2 of the 3 core data operations in the oplog.
Which finally brings us to updates.
They’re definitely the most complicated of the 3 basic operations,
mostly because they have the most variation.
First off, update ops have a field we haven’t seen before...
...the “o2” field, or the “selector”
This determines which record to update.
Like deletes, it always finds the record to modify by _id for idempotency.
The “o” field, which previously contained the record to insert, or the query to delete
now contains the updates to apply.
It looks like something you could pass to the 2nd argument of MongoDB’s update function.
If you’re saving a whole object, the object shows up in the “o” field.
If you’re instead doing partial updates with $set or $unset, you’ll see those fields in the “o” property.
If you use $inc, $push, $pop, or any other special operations, those will *also* show up as $set and $unset
Again, that’s to ensure the oplog is idempotent - applying the operation multiple times should have the same affect.
This turns out to make our lives a lot easier as application developers, too.
Here’s a quick sketch of what we might do with an update op in order to apply it to our Postgres database
Again, this cuts some pretty significant corners, but hopefully it gets you the broad strokes.
Fundamentally, it shows how you can take an MongoDB oplog entry
and transform it into an operation on a different datastore.
Unfortunately, I have to shrink the font more to fit everything on here
So again, I’m sorry if you’re at the back and this is hard to read.
But still - this is 20 lines of code, and with it we have something that can
(at least in theory)
apply changes from a MongoDB database to a Postgres database in real time.
Now, if this is actually something you want to do...
...you should check out MoSQL, which we open-sourced last year.
It’s all of the logic I’ve described in the last few slides
plus, well, correctness
plus robustness
plus some configuration hooks
plus support for non-trivial schemas
But, again, the cool thing here isn’t just that you can take MongoDB and replicate it into Postgres
It’s that you can take MongoDB and replicate it into *anything*
Say, for instance...
HBase.
We’ve also open-sourced a project called zerowing
(so-called because “all your HBase are belong to us” - I’m sorry, it’s not my fault)
Zerowing uses the oplog to mirror MongoDB into HBase.
We also do the same thing with ElasticSearch,
and even though we haven’t open-sourced any of that code,
the concepts are all identical to what we’ve been doing.
The idea that you can transform your data
...that you can inject your own logic into the stream of changes to your database
is, to me, really powerful.
And to show how powerful it can be, I want to tell one last story about a time at Stripe where I used the oplog for...
...recovering from a bit of an operational disaster.
This involved the monster system I talked about earlier.
Like I mentioned, monster as we use it is a bit more complicated than the simple trigger system we designed.
There was a component that looked something like this.
We had a collection with a list of tasks,
and workers which would grab unfinished tasks (tasks with the “finished” field unset)
and mark them as finished when they were done.
For various reasons, we wanted to keep those task records around for a bit,
but clean them up eventually.
We had a sweeper that would periodically delete sufficiently old records
I rolled out a small change to that cleanup sweeper
that looked something like this.
It seems simple enough -
every 10 seconds, find and delete all tasks that finished more than 30 seconds ago.
Unfortunately, this code has a bug, and it took me 10 hours to spot.
Do you see it?
As it turns out, in MongoDB’s sorting order,
“null” is less than all integers...including the integer timestamp of “30 seconds ago”
So unfortunately, that sweeper went on a somewhat Cybermen-esque deletion spree
deleting *all* tasks, not just those that had been completed.
Now, as I mentioned, we use monster for a huge number of our batch processes.
Which meant that this bug made for...not a great day.
Fortunately, though, we had hourly backups of our database going back for a week.
How is that helpful?
Well, you mostly just need to think about replication a little differently.
Instead of thinking about primaries and secondaries, let’s think about new backups and old backups
The new backup has a bunch of changes that the old backup doesn’t
The new backup is like a primary, while the older backup is like a secondary.
In particular, the older backup doesn’t have the deletions.
So if we could “replicate” from the new backup to the old backup,
but, before applying any deletions, see whether we actually *wanted* to delete the record
we can find records that shouldn’t have been deleted.
Now, in order for this to work, we have to have the entire opload for the outage
By default, MongoDB allocates 5% of your disk space to the oplog.
In our experience, that’s a lot of oplog (although every once and a while we’ll bump it)
I grabbed a snapshot from that time period to see how much time the oplog covered.
Here’s the bit that matters:
A single backup has about 4 days of oplog history
But we caught the bug in about 10 hours.
That’s more than enough time!
Here’s a slightly simplified version of the code I used to clean up.
We’re going to scan through events in the newer oplog (starting a little before the last oplog entry in the older snapshot),
and for each oplog entry,
if it’s a delete from the collection we care about,
we load the record from the *old* database
(which hasn’t “caught up” to the deletion yet)
and figure out whether that record was deleted incorrectly.
If it was, we save it somewhere
Either way, in order to keep the old database consistent,
you then need to apply that change (even if it was wrong) to the old database
To do that, we’re going to introduce one new command...
...the “applyOps” admin command
It does basically what it says on the tin:
you give it a list of oplog entries, and it applies them, in order, to its local copy of the database.
So, again, this took remarkably little code - about 14 lines
But combining that with some good backups,
we were able to recover from what would have been a huge disaster
in about a weekend
In summary, the oplog was built to support replication
But because of the way it’s exposed to users, it’s not just limited to that
At Stripe, we’ve used the oplog in many of our internal applications, including...
...trigger systems...
...data transformations and replication into alternative data stores...
...and even data recovery.
I put this talk together because I think that the oplog is underappreciated
I hope that some of the examples will give other people ideas for new and exciting things to do with the oplog.
If I can share any other information, or if you have applications for the oplog,
let me know! I always like geeking out over this stuff.