SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Exploring the Aggregation Framework
Jay Runkel
Solutions Architect
jay.runkel@mongodb.com
@jayrunkel
Agenda
1. Analytics in MongoDB?
2. Aggregation Framework
3. Aggregation Framework in Action
– US Census Data
4. Aggregation Framework Options
Analytics in MongoDB?
Create
Read
Update
Delete
Analytics
?
Group
Count
Derive Values
Filter
Average
Sort
For Example: US Census Data
• Census data from 1990, 2000, 2010
• Question:
Which US Division has the fastest growing population density?
– We only want to include data states with more than 1M people
– We only want to include divisions larger than 100K square miles
Division = a group of US States
Population density = Area of division/# of people
Data is provided at the state level
US Regions and Divisions
How would we solve this in SQL?
• SELECT GROUP BY HAVING
What About MongoDB?
Aggregation Framework
What is an Aggregation Pipeline?
• A Series of Document Transformations
– Executed in stages
– Original input is a collection
– Output as a cursor or a collection
• Rich Library of Functions
– Filter, compute, group, and summarize data
– Output of one stage sent to input of next
– Operations executed in sequential order
Aggregation Pipeline
Pipeline Operators
• $match
Filter documents
• $project
Reshape documents
• $group
Summarize documents
• $unwind
Expand documents
• $sort
Order documents
• $limit/$skip
Paginate documents
• $redact
Restrict documents
• $geoNear
Proximity sort documents
• $let,$map
Define variables
Aggregation Framework in Action
(let’s play with the census data)
MongoDB State Collection
• Document For Each State
• Name
• Region
• Division
• Census Data For 1990, 2000, 2010
– Population
– Housing Units
– Occupied Housing Units
• Census Data is an array with three subdocuments
Document Model
{ "_id" : ObjectId("54e23c7b28099359f5661525"),
"name" : "California",
"region" : "West",
"data" : [
{"totalPop" : 33871648,
"totalHouse" : 12214549,
"occHouse" : 11502870,
"year" : 2000},
{"totalPop" : 37253956,
"totalHouse" : 13680081,
"occHouse" : 12577498,
"year" : 2010},
{"totalPop" : 29760021,
"totalHouse" : 11182882,
"occHouse" : 29008161,
"year" : 1990}
],
…
}
Count, Distinct
Total US Area
db.cData.aggregate([
{"$group" : {"_id" : null,
"totalArea" : {$sum : "$areaM"},
"avgArea" : {$avg : "$areaM"}}}])
$group
• Group documents by value
– Field reference, object, constant
– Other output fields are computed
• $max, $min, $avg, $sum
• $addToSet, $push
• $first, $last
– Processes all data in memory by
default
Area By Region
db.cData.aggregate([
{"$group" : {"_id" : "$region",
"totalArea" : {$sum : "$areaM"},
"avgArea" : {$avg : "$areaM"},
"numStates" : {$sum : 1},
"states" : {$push : "$name"}}}
])
Calculating Average StateArea By Region
{ $group: {
_id: "$region",
avgAreaM: {$avg:
”$areaM" }
}}
{
_id: ”North East",
avgAreaM: 154
}
{
_id: “West",
avgAreaM: 300
}
{
state: ”New York",
areaM: 218,
region: “North East"
}
{
state: ”New Jersey",
areaM: 90,
region: “North East”
}
{
state: “California",
areaM: 300,
region: “West"
}
Calculating Total Area and State Count
{ $group: {
_id: "$region",
totArea: {$sum:
”$areaM" },
sCount : {$sum : 1}}}
{
_id: ”North East",
totArea: 308
sCount: 2}
{
_id: “West",
totArea: 300,
sCount: 1}
{
state: ”New York",
areaM: 218,
region: “North East"
}
{
state: ”New Jersey",
areaM: 90,
region: “North East”
}
{
state: “California",
area: 300,
region: “West"
}
Total US Population By Year
db.cData.aggregate(
[{$unwind : "$data"},
{$group : {"_id" : "$data.year",
"totalPop" : {$sum : "$data.totalPop"}}},
{$sort : {"totalPop" : 1}}
])
$unwind
• Operate on an array field
– Create documents from array elements
• Array replaced by element value
• Missing/empty fields → no output
• Non-array fields → error
– Pipe to $group to aggregate
$unwind
{ $unwind: $census }
{ state: “New York,
census: 1990}
{
state: ”New York",
census: [1990, 2000,
2010]
}
{
state: ”New Jersey",
census: [1990, 2000]
}
{
state: “California",
census: [1980, 1990,
2000, 2010]
}
{
state: ”Delaware",
census: [1990, 2000]
}
{ state: “New York,
census: 2000}
{ state: “New York,
census: 2010}
{ state: “New Jersey,
census: 1990}
{ state: “New Jersey,
census: 2000}
…
Southern State Population By Year
db.cData.aggregate(
[{$match : {"region" : "South"}},
{$unwind : "$data"},
{$group : {"_id" : "$data.year",
"totalPop” : {"$sum” :
"$data.totalPop"}}}])
$match
• Filter documents
– Uses existing query syntax
– No $where (server side Javascript)
$match
{ $match:
{ “region” : “West” }
}
{
state: ”New York",
areaM: 218,
region: “North East"
}
{
state: ”Oregon",
areaM: 245,
region: “West”
}
{
state: “California",
area: 300,
region: “West"
}
{
state: ”Oregon",
areaM: 245,
region: “West”
}
{
state: “California",
area: 300,
region: “West"
}
Population Delta By State from 1990 to 2010
db.cData.aggregate(
[{$unwind : "$data"},
{$sort : {"data.year" : 1}},
{$group : {"_id" : "$name",
"pop1990" : {"$first" : "$data.totalPop"},
"pop2010" : {"$last" : "$data.totalPop"}}},
{$project : {"_id" : 0,
"name" : "$_id",
"delta" : {"$subtract" :
["$pop2010", "$pop1990"]},
"pop1990" : 1,
"pop2010” : 1}
}]
)
Population Delta By State from 1990 to 2010
db.cData.aggregate(
[{$unwind : "$data"},
{$sort : {"data.year" : 1}},
{$group : {"_id" : "$name",
"pop1990" : {"$first" : "$data.totalPop"},
"pop2010" : {"$last" : "$data.totalPop"}}},
{$project : {"_id" : 0,
"name" : "$_id",
"delta" : {"$subtract" :
["$pop2010", "$pop1990"]},
"pop1990" : 1,
"pop2010” : 1}
}]
)
$sort, $limit, $skip
• Sort documents by one or more fields
– Same order syntax as cursors
– Waits for earlier pipeline operator to return
– In-memory unless early and indexed
• Limit and skip follow cursor behavior
Population Delta By State from 1990 to 2010
db.cData.aggregate(
[{$unwind : "$data"},
{$sort : {"data.year" : 1}},
{$group : {"_id" : "$name",
"pop1990" : {"$first" : "$data.totalPop"},
"pop2010" : {"$last" : "$data.totalPop"}}},
{$project : {"_id" : 0,
"name" : "$_id",
"delta" : {"$subtract" :
["$pop2010", "$pop1990"]},
"pop1990" : 1,
"pop2010” : 1}
}]
)
$first, $last
• Collection operations like $push and $addToSet
• Must be used in $group
• $first and $last determined by document order
• Typically used with $sort to ensure ordering is
known
Population Delta By State from 1990 to 2010
db.cData.aggregate(
[{$unwind : "$data"},
{$sort : {"data.year" : 1}},
{$group : {"_id" : "$name",
"pop1990" : {"$first" : "$data.totalPop"},
"pop2010" : {"$last" : "$data.totalPop"}}},
{$project : {"_id" : 0,
"name" : "$_id",
"delta" : {"$subtract" :
["$pop2010", "$pop1990"]},
"pop1990" : 1,
"pop2010” : 1}
}]
)
$project
• Reshape Documents
– Include, exclude or rename fields
– Inject computed fields
– Create sub-document fields
Including and Excluding Fields
{ $project:
{ “_id” : 0,
“pop1990” : 1,
“pop2010” : 1
}
{
"_id" : "Virginia”,
"pop1990" : 453588,
"pop2010" : 3725789
}
{
"_id" : "South Dakota",
"pop1990" : 453588,
"pop2010" : 3725789
} {
"pop1990" : 453588,
"pop2010" : 3725789
}
{
"pop1990" : 453588,
"pop2010" : 3725789
}
{
”name" : “South Dakota”,
”delta" : 118176
}
Renaming and Computing Fields
{ $project:
{ “_id” : 0,
“pop1990” : 0,
“pop2010” : 0,
“name” : “$_id”,
"delta" :
{"$subtract" :
["$pop2010",
"$pop1990"]}}
}
{
"_id" : "Virginia”,
"pop1990" : 6187358,
"pop2010" : 8001024
}
{
"_id" : "South Dakota",
"pop1990" : 696004,
"pop2010" : 814180
}
{
”name" : “Virginia”,
”delta" : 1813666
}
Compare number of people living within
500KM of Memphis, TN in 1990, 2000, 2010
Compare number of people living within
500KM of Memphis, TN in 1990, 2000, 2010
db.cData.aggregate([
{$geoNear : {
"near" : {"type" : "Point", "coordinates" : [90, 35]},
“distanceField” : "dist.calculated",
“maxDistance” : 500000,
“includeLocs” : "dist.location",
“spherical” : true }},
{$unwind : "$data"},
{$group : {"_id" : "$data.year",
"totalPop" : {"$sum" : "$data.totalPop"},
"states" : {"$addToSet" : "$name"}}},
{$sort : {"_id" : 1}}
])
$geoNear
• Order/Filter Documents by Location
– Requires a geospatial index
– Output includes physical distance
– Must be first aggregation stage
{
"_id" : ”Tennessee",
"pop1990" : 4877185,
"pop2010" : 6346105,
“center” :
{“type” : “Point”,
“coordinates” :
[86.6, 37.8]}
}
{
"_id" : "Virginia”,
"pop1990" : 6187358,
"pop2010" : 8001024,
“center” :
{“type” : “Point”,
“coordinates” :
[78.6, 37.5]}
}
$geoNear
{$geoNear : {
"near”: {"type”: "Point",
"coordinates”:
[90, 35]},
maxDistance : 500000,
spherical : true }}
{
"_id" : ”Tennessee",
"pop1990" : 4877185,
"pop2010" : 6346105,
“center” :
{“type” : “Point”,
“coordinates” :
[86.6, 37.8]}
}
What if I want to save the results to a
collection?
db.cData.aggregate([
{$geoNear : {
"near" : {"type" : "Point", "coordinates" : [90, 35]},
“distanceField” : "dist.calculated",
“maxDistance” : 500000,
“includeLocs” : "dist.location",
“spherical” : true }},
{$unwind : "$data"},
{$group : {"_id" : "$data.year",
"totalPop" : {"$sum" : "$data.totalPop"},
"states" : {"$addToSet" : "$name"}}},
{$sort : {"_id" : 1}},
{$out : “peopleNearMemphis”}
])
$out
db.cData.aggregate([<pipeline stages>,
{“$out” : “resultsCollection”}])
• Save aggregation results to a new collection
• New aggregation uses:
• Transform documents - ETL
Back To The Original Question
• Which US Division has the fastest growing population density?
– We only want to include data states with more than 1M people
– We only want to include divisions larger than 100K square miles
Division with Fastest Growing Pop Density
db.cData.aggregate(
[{$match : {"data.totalPop" : {"$gt" : 1000000}}},
{$unwind : "$data"},
{$sort : {"data.year" : 1}},
{$group : {"_id" : "$name",
"pop1990" : {"$first" : "$data.totalPop"},
"pop2010" : {"$last" : "$data.totalPop"},
"areaM" : {"$first" : "$areaM"},
"division" : {"$first" : "$division"}}},
{$group : {"_id" : "$division",
"totalPop1990" : {"$sum" : "$pop1990"},
"totalPop2010" : {"$sum" : "$pop2010"},
"totalAreaM" : {"$sum" : "$areaM"}}},
{$match : {"totalAreaM" : {"$gt" : 100000}}},
{$project : {"_id" : 0,
"division" : "$_id",
"density1990" : {"$divide" : ["$totalPop1990", "$totalAreaM"]},
"density2010" : {"$divide" : ["$totalPop2010", "$totalAreaM"]},
"denDelta" : {"$subtract" : [{"$divide" : ["$totalPop2010",
"$totalAreaM"]},
{"$divide" : ["$totalPop1990”,
"$totalAreaM"]}]},
"totalAreaM" : 1,
"totalPop1990" : 1,
"totalPop2010" : 1}},
{$sort : {"denDelta" : -1}}])
Aggregate Options
Aggregate options
db.cData.aggregate([<pipeline stages>],
{‘explain’ : false
'allowDiskUse' : true,
'cursor' : {'batchSize' : 5}})
explain – similar to find().explain()
allowDiskUse – enable use of disk to store intermediate
results
cursor – specify the size of the initial result
Aggregation and Sharding
Sharding
• Workload split between shards
– Shards execute pipeline up
to a point
– Primary shard merges
cursors and continues
processing*
– Use explain to analyze
pipeline split
– Early $match may excuse
shards
– Potential CPU and memory
implications for primary
shard host
*Prior to v2.6 second stage pipeline processing was
done by mongos
Summary
Analytics in MongoDB?
Create
Read
Update
Deletet
Analytics
?
Group
Count
Derive Values
Filter
Average
Sort
YES!
Framework Use Cases
• Basic aggregation queries
• Ad-hoc reporting
• Real-time analytics
• Visualizing and reshaping data
Questions?
jay.runkel@mongodb.com
@jayrunkel

Weitere ähnliche Inhalte

Was ist angesagt?

Aggregation Framework
Aggregation FrameworkAggregation Framework
Aggregation Framework
MongoDB
 
Aggregation in MongoDB
Aggregation in MongoDBAggregation in MongoDB
Aggregation in MongoDB
Kishor Parkhe
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB
 
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"
"Powerful Analysis with the Aggregation Pipeline (Tutorial)""Powerful Analysis with the Aggregation Pipeline (Tutorial)"
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"
MongoDB
 

Was ist angesagt? (20)

Aggregation Framework
Aggregation FrameworkAggregation Framework
Aggregation Framework
 
Aggregation in MongoDB
Aggregation in MongoDBAggregation in MongoDB
Aggregation in MongoDB
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
 
MongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced AggregationMongoDB World 2016 : Advanced Aggregation
MongoDB World 2016 : Advanced Aggregation
 
Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1Aggregation Framework in MongoDB Overview Part-1
Aggregation Framework in MongoDB Overview Part-1
 
Mongodb Aggregation Pipeline
Mongodb Aggregation PipelineMongodb Aggregation Pipeline
Mongodb Aggregation Pipeline
 
Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework Beyond the Basics 2: Aggregation Framework
Beyond the Basics 2: Aggregation Framework
 
MongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB Europe 2016 - Graph Operations with MongoDBMongoDB Europe 2016 - Graph Operations with MongoDB
MongoDB Europe 2016 - Graph Operations with MongoDB
 
Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop Connector
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDB
 
MongoDB - Aggregation Pipeline
MongoDB - Aggregation PipelineMongoDB - Aggregation Pipeline
MongoDB - Aggregation Pipeline
 
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
Joins and Other Aggregation Enhancements Coming in MongoDB 3.2
 
MongoDB Aggregation
MongoDB Aggregation MongoDB Aggregation
MongoDB Aggregation
 
MongoDB Europe 2016 - Advanced MongoDB Aggregation Pipelines
MongoDB Europe 2016 - Advanced MongoDB Aggregation PipelinesMongoDB Europe 2016 - Advanced MongoDB Aggregation Pipelines
MongoDB Europe 2016 - Advanced MongoDB Aggregation Pipelines
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
 
Webinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation OptionsWebinar: Data Processing and Aggregation Options
Webinar: Data Processing and Aggregation Options
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"
"Powerful Analysis with the Aggregation Pipeline (Tutorial)""Powerful Analysis with the Aggregation Pipeline (Tutorial)"
"Powerful Analysis with the Aggregation Pipeline (Tutorial)"
 
MongoDB Europe 2016 - Debugging MongoDB Performance
MongoDB Europe 2016 - Debugging MongoDB PerformanceMongoDB Europe 2016 - Debugging MongoDB Performance
MongoDB Europe 2016 - Debugging MongoDB Performance
 

Ähnlich wie Agg framework selectgroup feb2015 v2

Ähnlich wie Agg framework selectgroup feb2015 v2 (20)

MongoDB 3.2 - Analytics
MongoDB 3.2  - AnalyticsMongoDB 3.2  - Analytics
MongoDB 3.2 - Analytics
 
Starting out with MongoDB
Starting out with MongoDBStarting out with MongoDB
Starting out with MongoDB
 
Querying mongo db
Querying mongo dbQuerying mongo db
Querying mongo db
 
A Century Of Weather Data - Midwest.io
A Century Of Weather Data - Midwest.ioA Century Of Weather Data - Midwest.io
A Century Of Weather Data - Midwest.io
 
Schema Design by Chad Tindel, Solution Architect, 10gen
Schema Design  by Chad Tindel, Solution Architect, 10genSchema Design  by Chad Tindel, Solution Architect, 10gen
Schema Design by Chad Tindel, Solution Architect, 10gen
 
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right WayMongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
MongoDB Europe 2016 - ETL for Pros – Getting Data Into MongoDB The Right Way
 
Powerful Analysis with the Aggregation Pipeline
Powerful Analysis with the Aggregation PipelinePowerful Analysis with the Aggregation Pipeline
Powerful Analysis with the Aggregation Pipeline
 
Joins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation EnhancementsJoins and Other MongoDB 3.2 Aggregation Enhancements
Joins and Other MongoDB 3.2 Aggregation Enhancements
 
MongoDB Aggregations Indexing and Profiling
MongoDB Aggregations Indexing and ProfilingMongoDB Aggregations Indexing and Profiling
MongoDB Aggregations Indexing and Profiling
 
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018Couchbase Tutorial: Big data Open Source Systems: VLDB2018
Couchbase Tutorial: Big data Open Source Systems: VLDB2018
 
2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky2014 bigdatacamp asya_kamsky
2014 bigdatacamp asya_kamsky
 
Talk MongoDB - Amil
Talk MongoDB - AmilTalk MongoDB - Amil
Talk MongoDB - Amil
 
Query for json databases
Query for json databasesQuery for json databases
Query for json databases
 
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDBMongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
MongoDB.local DC 2018: Tutorial - Data Analytics with MongoDB
 
How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6How to leverage what's new in MongoDB 3.6
How to leverage what's new in MongoDB 3.6
 
Introduction to MongoDB for C# developers
Introduction to MongoDB for C# developersIntroduction to MongoDB for C# developers
Introduction to MongoDB for C# developers
 
Database api
Database apiDatabase api
Database api
 
Retail Reference Architecture Part 2: Real-Time, Geo Distributed Inventory
Retail Reference Architecture Part 2: Real-Time, Geo Distributed InventoryRetail Reference Architecture Part 2: Real-Time, Geo Distributed Inventory
Retail Reference Architecture Part 2: Real-Time, Geo Distributed Inventory
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
N1QL+GSI: Language and Performance Improvements in Couchbase 5.0 and 5.5
 

Mehr von MongoDB

Mehr von MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDBMongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
MongoDB .local Paris 2020: Les bonnes pratiques pour sécuriser MongoDB
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Agg framework selectgroup feb2015 v2

  • 1. Exploring the Aggregation Framework Jay Runkel Solutions Architect jay.runkel@mongodb.com @jayrunkel
  • 2. Agenda 1. Analytics in MongoDB? 2. Aggregation Framework 3. Aggregation Framework in Action – US Census Data 4. Aggregation Framework Options
  • 4. For Example: US Census Data • Census data from 1990, 2000, 2010 • Question: Which US Division has the fastest growing population density? – We only want to include data states with more than 1M people – We only want to include divisions larger than 100K square miles Division = a group of US States Population density = Area of division/# of people Data is provided at the state level
  • 5. US Regions and Divisions
  • 6. How would we solve this in SQL? • SELECT GROUP BY HAVING
  • 9. What is an Aggregation Pipeline? • A Series of Document Transformations – Executed in stages – Original input is a collection – Output as a cursor or a collection • Rich Library of Functions – Filter, compute, group, and summarize data – Output of one stage sent to input of next – Operations executed in sequential order
  • 11. Pipeline Operators • $match Filter documents • $project Reshape documents • $group Summarize documents • $unwind Expand documents • $sort Order documents • $limit/$skip Paginate documents • $redact Restrict documents • $geoNear Proximity sort documents • $let,$map Define variables
  • 12. Aggregation Framework in Action (let’s play with the census data)
  • 13. MongoDB State Collection • Document For Each State • Name • Region • Division • Census Data For 1990, 2000, 2010 – Population – Housing Units – Occupied Housing Units • Census Data is an array with three subdocuments
  • 14. Document Model { "_id" : ObjectId("54e23c7b28099359f5661525"), "name" : "California", "region" : "West", "data" : [ {"totalPop" : 33871648, "totalHouse" : 12214549, "occHouse" : 11502870, "year" : 2000}, {"totalPop" : 37253956, "totalHouse" : 13680081, "occHouse" : 12577498, "year" : 2010}, {"totalPop" : 29760021, "totalHouse" : 11182882, "occHouse" : 29008161, "year" : 1990} ], … }
  • 16. Total US Area db.cData.aggregate([ {"$group" : {"_id" : null, "totalArea" : {$sum : "$areaM"}, "avgArea" : {$avg : "$areaM"}}}])
  • 17. $group • Group documents by value – Field reference, object, constant – Other output fields are computed • $max, $min, $avg, $sum • $addToSet, $push • $first, $last – Processes all data in memory by default
  • 18. Area By Region db.cData.aggregate([ {"$group" : {"_id" : "$region", "totalArea" : {$sum : "$areaM"}, "avgArea" : {$avg : "$areaM"}, "numStates" : {$sum : 1}, "states" : {$push : "$name"}}} ])
  • 19. Calculating Average StateArea By Region { $group: { _id: "$region", avgAreaM: {$avg: ”$areaM" } }} { _id: ”North East", avgAreaM: 154 } { _id: “West", avgAreaM: 300 } { state: ”New York", areaM: 218, region: “North East" } { state: ”New Jersey", areaM: 90, region: “North East” } { state: “California", areaM: 300, region: “West" }
  • 20. Calculating Total Area and State Count { $group: { _id: "$region", totArea: {$sum: ”$areaM" }, sCount : {$sum : 1}}} { _id: ”North East", totArea: 308 sCount: 2} { _id: “West", totArea: 300, sCount: 1} { state: ”New York", areaM: 218, region: “North East" } { state: ”New Jersey", areaM: 90, region: “North East” } { state: “California", area: 300, region: “West" }
  • 21. Total US Population By Year db.cData.aggregate( [{$unwind : "$data"}, {$group : {"_id" : "$data.year", "totalPop" : {$sum : "$data.totalPop"}}}, {$sort : {"totalPop" : 1}} ])
  • 22. $unwind • Operate on an array field – Create documents from array elements • Array replaced by element value • Missing/empty fields → no output • Non-array fields → error – Pipe to $group to aggregate
  • 23. $unwind { $unwind: $census } { state: “New York, census: 1990} { state: ”New York", census: [1990, 2000, 2010] } { state: ”New Jersey", census: [1990, 2000] } { state: “California", census: [1980, 1990, 2000, 2010] } { state: ”Delaware", census: [1990, 2000] } { state: “New York, census: 2000} { state: “New York, census: 2010} { state: “New Jersey, census: 1990} { state: “New Jersey, census: 2000} …
  • 24. Southern State Population By Year db.cData.aggregate( [{$match : {"region" : "South"}}, {$unwind : "$data"}, {$group : {"_id" : "$data.year", "totalPop” : {"$sum” : "$data.totalPop"}}}])
  • 25. $match • Filter documents – Uses existing query syntax – No $where (server side Javascript)
  • 26. $match { $match: { “region” : “West” } } { state: ”New York", areaM: 218, region: “North East" } { state: ”Oregon", areaM: 245, region: “West” } { state: “California", area: 300, region: “West" } { state: ”Oregon", areaM: 245, region: “West” } { state: “California", area: 300, region: “West" }
  • 27. Population Delta By State from 1990 to 2010 db.cData.aggregate( [{$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : {"_id" : "$name", "pop1990" : {"$first" : "$data.totalPop"}, "pop2010" : {"$last" : "$data.totalPop"}}}, {$project : {"_id" : 0, "name" : "$_id", "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}, "pop1990" : 1, "pop2010” : 1} }] )
  • 28. Population Delta By State from 1990 to 2010 db.cData.aggregate( [{$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : {"_id" : "$name", "pop1990" : {"$first" : "$data.totalPop"}, "pop2010" : {"$last" : "$data.totalPop"}}}, {$project : {"_id" : 0, "name" : "$_id", "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}, "pop1990" : 1, "pop2010” : 1} }] )
  • 29. $sort, $limit, $skip • Sort documents by one or more fields – Same order syntax as cursors – Waits for earlier pipeline operator to return – In-memory unless early and indexed • Limit and skip follow cursor behavior
  • 30. Population Delta By State from 1990 to 2010 db.cData.aggregate( [{$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : {"_id" : "$name", "pop1990" : {"$first" : "$data.totalPop"}, "pop2010" : {"$last" : "$data.totalPop"}}}, {$project : {"_id" : 0, "name" : "$_id", "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}, "pop1990" : 1, "pop2010” : 1} }] )
  • 31. $first, $last • Collection operations like $push and $addToSet • Must be used in $group • $first and $last determined by document order • Typically used with $sort to ensure ordering is known
  • 32. Population Delta By State from 1990 to 2010 db.cData.aggregate( [{$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : {"_id" : "$name", "pop1990" : {"$first" : "$data.totalPop"}, "pop2010" : {"$last" : "$data.totalPop"}}}, {$project : {"_id" : 0, "name" : "$_id", "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}, "pop1990" : 1, "pop2010” : 1} }] )
  • 33. $project • Reshape Documents – Include, exclude or rename fields – Inject computed fields – Create sub-document fields
  • 34. Including and Excluding Fields { $project: { “_id” : 0, “pop1990” : 1, “pop2010” : 1 } { "_id" : "Virginia”, "pop1990" : 453588, "pop2010" : 3725789 } { "_id" : "South Dakota", "pop1990" : 453588, "pop2010" : 3725789 } { "pop1990" : 453588, "pop2010" : 3725789 } { "pop1990" : 453588, "pop2010" : 3725789 }
  • 35. { ”name" : “South Dakota”, ”delta" : 118176 } Renaming and Computing Fields { $project: { “_id” : 0, “pop1990” : 0, “pop2010” : 0, “name” : “$_id”, "delta" : {"$subtract" : ["$pop2010", "$pop1990"]}} } { "_id" : "Virginia”, "pop1990" : 6187358, "pop2010" : 8001024 } { "_id" : "South Dakota", "pop1990" : 696004, "pop2010" : 814180 } { ”name" : “Virginia”, ”delta" : 1813666 }
  • 36. Compare number of people living within 500KM of Memphis, TN in 1990, 2000, 2010
  • 37. Compare number of people living within 500KM of Memphis, TN in 1990, 2000, 2010 db.cData.aggregate([ {$geoNear : { "near" : {"type" : "Point", "coordinates" : [90, 35]}, “distanceField” : "dist.calculated", “maxDistance” : 500000, “includeLocs” : "dist.location", “spherical” : true }}, {$unwind : "$data"}, {$group : {"_id" : "$data.year", "totalPop" : {"$sum" : "$data.totalPop"}, "states" : {"$addToSet" : "$name"}}}, {$sort : {"_id" : 1}} ])
  • 38. $geoNear • Order/Filter Documents by Location – Requires a geospatial index – Output includes physical distance – Must be first aggregation stage
  • 39. { "_id" : ”Tennessee", "pop1990" : 4877185, "pop2010" : 6346105, “center” : {“type” : “Point”, “coordinates” : [86.6, 37.8]} } { "_id" : "Virginia”, "pop1990" : 6187358, "pop2010" : 8001024, “center” : {“type” : “Point”, “coordinates” : [78.6, 37.5]} } $geoNear {$geoNear : { "near”: {"type”: "Point", "coordinates”: [90, 35]}, maxDistance : 500000, spherical : true }} { "_id" : ”Tennessee", "pop1990" : 4877185, "pop2010" : 6346105, “center” : {“type” : “Point”, “coordinates” : [86.6, 37.8]} }
  • 40. What if I want to save the results to a collection? db.cData.aggregate([ {$geoNear : { "near" : {"type" : "Point", "coordinates" : [90, 35]}, “distanceField” : "dist.calculated", “maxDistance” : 500000, “includeLocs” : "dist.location", “spherical” : true }}, {$unwind : "$data"}, {$group : {"_id" : "$data.year", "totalPop" : {"$sum" : "$data.totalPop"}, "states" : {"$addToSet" : "$name"}}}, {$sort : {"_id" : 1}}, {$out : “peopleNearMemphis”} ])
  • 41. $out db.cData.aggregate([<pipeline stages>, {“$out” : “resultsCollection”}]) • Save aggregation results to a new collection • New aggregation uses: • Transform documents - ETL
  • 42. Back To The Original Question • Which US Division has the fastest growing population density? – We only want to include data states with more than 1M people – We only want to include divisions larger than 100K square miles
  • 43. Division with Fastest Growing Pop Density db.cData.aggregate( [{$match : {"data.totalPop" : {"$gt" : 1000000}}}, {$unwind : "$data"}, {$sort : {"data.year" : 1}}, {$group : {"_id" : "$name", "pop1990" : {"$first" : "$data.totalPop"}, "pop2010" : {"$last" : "$data.totalPop"}, "areaM" : {"$first" : "$areaM"}, "division" : {"$first" : "$division"}}}, {$group : {"_id" : "$division", "totalPop1990" : {"$sum" : "$pop1990"}, "totalPop2010" : {"$sum" : "$pop2010"}, "totalAreaM" : {"$sum" : "$areaM"}}}, {$match : {"totalAreaM" : {"$gt" : 100000}}}, {$project : {"_id" : 0, "division" : "$_id", "density1990" : {"$divide" : ["$totalPop1990", "$totalAreaM"]}, "density2010" : {"$divide" : ["$totalPop2010", "$totalAreaM"]}, "denDelta" : {"$subtract" : [{"$divide" : ["$totalPop2010", "$totalAreaM"]}, {"$divide" : ["$totalPop1990”, "$totalAreaM"]}]}, "totalAreaM" : 1, "totalPop1990" : 1, "totalPop2010" : 1}}, {$sort : {"denDelta" : -1}}])
  • 45. Aggregate options db.cData.aggregate([<pipeline stages>], {‘explain’ : false 'allowDiskUse' : true, 'cursor' : {'batchSize' : 5}}) explain – similar to find().explain() allowDiskUse – enable use of disk to store intermediate results cursor – specify the size of the initial result
  • 47. Sharding • Workload split between shards – Shards execute pipeline up to a point – Primary shard merges cursors and continues processing* – Use explain to analyze pipeline split – Early $match may excuse shards – Potential CPU and memory implications for primary shard host *Prior to v2.6 second stage pipeline processing was done by mongos
  • 50. Framework Use Cases • Basic aggregation queries • Ad-hoc reporting • Real-time analytics • Visualizing and reshaping data