4. Big Data in MongoDB
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
5. Big Data in MongoDB
⢠An ideal operational database
⢠High performance for storage and
retrieval at large scale
⢠Robust query interface for intelligent
operations
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
7. Big Data in MongoDB
Pre-aggregate in MongoDB for real-time queries
Process in MongoDB using Aggregation
Framework
Process in MongoDB using Map/Reduce
Process outside MongoDB using Hadoop and
other external tools
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
10. Aggregation Framework
⢠Declared in JSON, executes in C++
⢠Flexible, functional, and simple
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
11. Aggregation Framework
⢠Declared in JSON, executes in C++
⢠Flexible, functional, and simple
⢠Plays nice with sharding
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
12. Pipeline
ps ax | grep mongod | head 1
Piping command line operations
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
13. Pipeline
$match $group | $sort|
Piping aggregation operations
Stream of documents Result document
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
14. Pipeline Operators
⢠$match
⢠$project
⢠$group
⢠$unwind
⢠$sort/$skip/$limit
⢠$redact
⢠$geoNear
⢠$out
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
15. $match
⢠Filter documents
⢠Uses existing query syntax
⢠2.4 added support for geospatial operations
⢠2.6 added support for full text search indexes
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
23. $group
⢠Group documents by an ID
â Field reference, object, constant
⢠Other output fields are computed
â $max, $min, $avg, $sum
â $addToSet, $push
â $first, $last
⢠Processes all data in memory
â can utilize external disk-based sort in 2.6
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
24. Find the smallest cities
within twenty miles of San
Francisco{ _id: "94306",
city: âPALO ALTO",
loc: [ -122.127, 37.418],
pop: 24309 }
{ _id: "10280",
city: "NEW YORK",
loc: [ -74.016, 40.710],
pop: 5574 }
{ _id: "94124",
city: âSAN FRANCISCO",
loc: [-122.388, 37.73],
pop: 27239 }
26. $unwind
⢠Operate on an array field
⢠Yield new documents for each array element
â Array replaced by element value
â Missing/empty fields â no output
â Non-array fields â error
⢠Pipe to $group to aggregate array values
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
27. $unwind
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "Long Island"
}
{ $unwind: "$subjects" }
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "New York"
}
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: "1920s"
}
{
title: "The Great Gatsby",
ISBN: "9781857150193",
subjects: [
"Long Island",
"New York",
"1920s"
]
}
28. 2.6 Improvements
⢠Returns a cursor (not a document)
â just like a regular find
⢠New stages
â $redact
â $out
⢠New operators:
â set expression operators.
â $let and $map operators to allow for the use of variables.
â $literal operator and $size operator
â $cond expression object
⢠Integrated $text search
⢠Performance improvements, "explain" and more
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
29. Advantages
⢠Runs on the server
â Uses indexes
â Uses shards
⢠Simple to build complex pipelines
⢠Easy to use from any driver
⢠Fast -er than other options
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
30. Limitations
⢠Pipeline operator memory limits
â 10% of total system RAM in 2.4 and earlier
â 100MB in 2.6 but can use disk for external sort
⢠Some data types not allowed
â Code, CodeWithScope, etc.
⢠Result size limited⢠Result size limited (in 2.4 and earlier)
â 2.6 returns a cursor or direct output to a new collection
No result size limit!
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
34. MapReduce
⢠Versatile, powerful
⢠Intended for complex data
analysis
⢠Overkill for simple aggregations
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
43. Advantages
⢠Map and reduce code can be arbitrarily complex
â JavaScript, helper functions
⢠Results can be saved into a new collection
â replace, merge or re-reduce
⢠Incremental MapReduce
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
44. Limitations
⢠Implemented with JavaScript
â Single-threaded
⢠Slower than Aggregation Framework
â Batch, not real time
⢠Harder to understand, implement, debug...
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
46. Hadoop
Framework that allows for the distributed processing
of large data sets across clusters of computers
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
47. Hadoop MongoDB Connector
⢠MongoDB or BSON files as input/output
⢠Source data can be filtered with queries
⢠Hadoop Streaming support
â For jobs written in Python, Ruby, Node.js
⢠Supports Hadoop tools such as Pig and Hive
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
48. Processing Big Data
⢠Data broken up into smaller pieces
⢠Process data across multiple nodes
Hadoop Hadoop Hadoop Hadoop
Hadoop Hadoop Hadoop Hadoop Hadoop
Hadoop
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
49. Input splits on Non-sharded
Systems
Single Map
Reduce
Hadoop Hadoop Hadoop Hadoop Hadoop
Hadoop Hadoop Hadoop Hadoop Hadoop
Total Dataset
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
50. Advantages
⢠Processing decoupled
from data store
⢠Parallel processing
⢠Leverage existing
infrastructure
⢠Java has rich set of data
processing libraries
â And other languages if
using Hadoop Streaming
⢠Batch processing
⢠Requires synchronization
between data store and
processor
⢠Adds complexity to
infrastructure
Disadvantages
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
53. Storm MongoDB connector
⢠Spout for MongoDB oplog or capped collections
â Filtering capabilities
â Threaded and non-blocking
⢠Output to new or existing documents
â Insert/update bolt
Data Processing andAggregation Options in MongoDB / Asya
Kamsky
55. Internal Tools
⢠Storing pre-aggregated data
â An exercise in schema design
⢠Aggregation Framework
⢠MapReduce
Data Processing andAggregation Options in MongoDB / Asya
Kamsky