MongoDB Aggregation Framework

Quick Overview of
Document-oriented
Schemaless
JSON-style documents
Rich Queries
Scales Horizontally
db.users.find({
last_name: 'Smith',
age: {$gt : 10}
});
SELECT * FROM users WHERE
last_name=‘Smith’ AND age > 10;

Computing Aggregations in
Databases
SQL-based
RDBMS
JOIN
GROUP BY
AVG(),
COUNT(),
SUM(), FIRST(),
LAST(),
etc.
MongoDB 2.0
MapReduce
MongoDB 2.2+
MapReduce
Aggregation Framework

MapReduce
var map = function()
{
...
emit(key, val);
}
var reduce = function(key, vals)
{
...
return resultVal;
}
Data
Map()
emit(k,v)
Sort(k)
Group(k)
Reduce(k,values)
k,v
Finalize(k,v)
k,v
MongoDB
map iterates on
documents
Document is $this
1 at time per shard
Input matches output
Can run multiple times

What’s wrong with just using
MapReduce?
Map/Reduce is very
powerful, but often overkill
Lots of users relying on it
for simple aggregation tasks
•
•

What’s wrong with just using
MapReduce?
Easy to screw up JavaScript
Debugging a M/R job sucks
Writing more JS for simple tasks should not be necessary
•
•
•
(ಠ︿ಠ)

Aggregation
Framework
Declarative (no need to write JS)
Implemented directly in C++
Expression Evaluation
Return computed values
Framework: We can extend it with new
ops
•
•
•
•
•

Input
Data
(collection)
Filter
Project
Unwind
Group
Sort
Limit
Result
(document)

db.article.aggregate(
{ $project : {author : 1,tags : 1}},
{ $unwind : "$tags" },
{ $group : {_id : “$tags”,
authors:{ $addToSet:"$author"}}
}
);
An aggregation command looks like:

{ $project : {author : 1, tags : 1}},
{ $group : {
_id : “$tags”,
authors : { $addToSet:"$author"}
}}
);
New Helper
Method:
.aggregate()
Operator
pipeline
db.runCommand({
aggregate : "article",
pipeline : [ {$op1, $op2, ...} ]
}

{
"result" : [
{ "_id" : "art", "authors" : [ "bill", "bob" ] },
{ "_id" : "sports", "authors" : [ "jane", "bob" ] },
{ "_id" : "food", "authors" : [ "jane", "bob" ] },
{ "_id" : "science", "authors" : [ "jane", "bill", "bob" ] }
],
"ok" : 1
}
Output Document Looks like this:
result: array of pipeline
output
ok: 1 for success, 0
otherwise

Pipeline
Input to the start of the pipeline is a collection
Series of operators - each one filters or transforms its
input
Passes output data to next operator in the pipeline
Output of the pipeline is the result document
•
•
•
•
ps -ax | tee processes.txt | more
Kind of like UNIX:

Let’s do:
1. Tour of the pipeline
operators
2. A couple examples based on
common SQL aggregation tasks
$match
$unwind
$group
$project
$skip $limit $sort

filters documents from pipeline with a query predicate
filtered with:
{$match: {author:”bob”}}
$match
{author: "bob", pageViews:5, title:"Lorem Ipsum..."}
{author: "bill", pageViews:3, title:"dolor sit amet..."}
{author: "joe", pageViews:52, title:"consectetur adipi..."}
{author: "jane", pageViews:51, title:"sed diam..."}
{author: "bob", pageViews:14, title:"magna aliquam..."}
{author: "bob", pageViews:53, title:"claritas est..."}
filtered with:
{$match: {pageViews:{$gt:50}}
{author:"bob",pageViews:5,title:"Lorem Ipsum..."}
{author:"bob",pageViews:14,title:"magna aliquam..."}
{author:"bob",pageViews:53,title:"claritas est..."}
{author: "joe", pageViews:52, title:"consectetur adipiscing..."}
{author: "jane", pageViews:51, title:"sed diam..."}
{author: "bob", pageViews:53, title:"claritas est..."}
Input:

$unwind
{
"_id" : ObjectId("4f...146"),
"author" : "bob",
"tags" :[ "fun","good","awesome"]
}
explode the “tags” array with:
{ $unwind : ”$tags” }
{ _id : ObjectId("4f...146"), author : "bob", tags:"fun"},
{ _id : ObjectId("4f...146"), author : "bob", tags:"good"},
{ _id : ObjectId("4f...146"), author : "bob", tags:"awesome"}
produces output:
Produce a new document for
each value in an input array

Bucket a subset of docs together,
calculate an aggregated output doc from the bucket
$sum
$max, $min
$avg
$first, $last
$addToSet
$push
{ $group : {
_id : "$author",
viewsPerAuthor : { $sum :
"$pageViews" }
}
}
);
$group
Output
Calculation
Operators:

{ $group : {
_id : "$author",
viewsPerAuthor : { $sum : "$pageViews" }
}
}
);
_id: selects a field to use as
bucket key for grouping
Output field name Operation used to calculate the
output value
($sum, $max, $avg, etc.)
$group (cont’d)
dot notation (nested fields)
a constant
a multi-key expression inside
{...}
•
•
•
also allowed here:

An example with $match and $group
SELECT SUM(price) FROM orders
WHERE customer_id = 4;
MongoDB:
SQL:
db.orders.aggregate(
{$match : {“$customer_id” : 4}},
{$group : { _id : null,
total: {$sum : “price”}})
English: Find the sum of all prices of the
orders placed by customer #4

An example with $unwind and $group
MongoDB:
SQL:
English:
db.posts.aggregate(
{ $group : {
_id : “$tags”,
authors : { $addToSet : "$author" }
}}
);
For all tags used in blog posts, produce a list of
authors that have posted under each tag
SELECT tag, author FROM post_tags LEFT
JOIN posts ON post_tags.post_id =
posts.id GROUP BY tag, author;

More operators - Controlling Pipeline Input
$skip
$limit
$sort
Similar to:
.skip()
.limit()
.sort()
in a regular Mongo query

$sort
specified the same way as index keys:
{ $sort : { name : 1, age: -1 } }
Must be used in order to take
advantage of $first/$last with
$group.
order input documents

$limit
limit the number of input documents
{$limit : 5}
$skip
skips over documents
{$skip : 5}

$project
Use for:
Add, Remove, Pull up, Push down, Rename
Fields
Building computed fields
Reshape a document

$project
(cont’d)
Include or exclude fields
{$project :
{ title : 1,
author : 1} }
Only pass on fields
“title” and “author”
{$project : { comments : 0}
Exclude
“comments” field,
keep everything
else

Moving + Renaming fields
{$project :
{ page_views : “$pageViews”,
catName : “$category.name”,
info : {
published : “$ctime”,
update : “$mtime”
}
}
}
Rename page_views to pageViews
Take nested field
“category.name”, move
it into top-level field
called “catName”
Populate a new
sub-document
into the output
$project
(cont’d)

{ $project : {
name : 1,
age_fixed : { $add:["$age", 2] }
}}
);
Building a Computed Field
Output
(computed field) Operands
Expression
$project
(cont’d)

Lots of Available
Expressions
$project
(cont’d)
Numeric $add $sub $mod $divide $multiply
Logical $eq $lte/$lt $gte/$gt $and $not $or $eq
Dates
$dayOfMonth $dayOfYear $dayOfWeek $second $minute
$hour $week $month $isoDate
Strings $substr $add $toLower $toUpper $strcasecmp

Example: $sort → $limit → $project→
$group
MongoDB:
SQL:
English: Of the most recent 1000 blog posts, how many
were posted within each calendar year?
SELECT YEAR(pub_time) as pub_year,
COUNT(*) FROM
(SELECT pub_time FROM posts ORDER BY
pub_time desc)
GROUP BY pub_year;
db.test.aggregate(
{$sort : {pub_time: -1}},
{$limit : 1000},
{$project:{pub_year:{$year:["$pub_time"]}}},
{$group: {_id:"$pub_year", num_year:{$sum:1}}}
)

Some Usage Notes
In BSON, order matters - so computed
fields always show up after regular fields
We use $ in front of field names to
distinguish fields from string literals
in expressions “$name”
“name”
vs.

Some Usage Notes
Use a $match,$sort and $limit
first in pipeline if possible
Cumulative Operators $group:
be aware of memory usage
Use $project to discard unneeded fields
Remember the 16MB output limit

Aggregation vs.
MapReduce
Framework is geared towards counting/accumulating
If you need something more exotic, use
MapReduce
No 16MB constraint on output size with
MapReduce
JS in M/R is not limited to any fixed set of expressions
•
•
•
•

thanks! ✌(-‿-)✌
questions?
$$$ BTW: we are hiring!
http://10gen.com/jobs $$$
@mpobrien
github.com/mpobrien
hit me up:

MongoDB Aggregation Framework

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to MongoDB Aggregation Framework

Similar to MongoDB Aggregation Framework (20)

More from Caserta

More from Caserta (20)

Recently uploaded

Recently uploaded (20)

MongoDB Aggregation Framework