Slides from Mortar co-founder Jeremy Karn's presentation at MongoSV 2012. Learn to process Mongo data with Hadoop—specifically with Apache Pig.
Jeremy's presentation covered the steps needed to read JSON from Mongo into Pig, parallel process it on Hadoop with sophisticated functions, and write back to Mongo. This talk will demonstrate its concepts with Mortar, which has contributed to the Mongo Hadoop connector, extending it to work with Pig.
2. Overview
OF THIS SESSION
Intro to Hadoop
Intro to Pig
Why MongoDB + Pig?
Demo: load Pig
Demo: processing data with Pig
Demo: store data from Pig to MongoDB
13. Alternatives to Hadoop
MONGODB NATIVE MAPREDUCE
Write MapReduce in Javascript
• Javascript is not fast
• Has limited data types
• Hard to use complex analytic libs
Adds load to data store
14. Alternatives to Hadoop
MONGODB NATIVE MAPREDUCE
Hadoop has libs for
• Machine learning
• ETL
• Can access any JVM analytic libs
And many organizations already use Hadoop
15. Alternatives to Hadoop
MONGODB AGGREGATION FRAMEWORK
Great when
• Doing SQL-style aggregation
• Do not require external data libs
• Users will learn framework
16. Alternatives to Hadoop
MONGODB AGGREGATION FRAMEWORK
But you may want Hadoop when
• Doing sophisticated aggregation
• Require external data libs
• Users unwilling to learn framework
• Need to transfer workload off datastore
21. MongoDB + Pig
MOTIVATIONS
Data storage and data processing are often
separate concerns
Hadoop is built for scalable processing of
large datasets
22. MongoDB, Pig
SIMILAR STANCE
Poly-structured data
• MongoDB: stores data, regardless of
structure
• Pig: reads data, regardless of structure
(got its name because Pigs are
omnivorous)
23. MongoDB, Pig
JSON-PIG DATA TYPE MAPPING
JSON Pig
string chararray
integer int
boolean boolean
double double
array bag
object map/tuple
null null
24. MongoDB, Pig
MONGODB-PIG DATA TYPE MAPPING
MongoDB Pig
date datetime
object id chararray
binary bytearray
data
regexp chararray
code chararray
33. MongoDB, Pig
LOADING DATA
One requirement:
• Must specify top level fields to load from
the mongoDB collection.
Optional:
• Specify a subset of embedded fields
• Data type for any/all fields
34. MongoDB, Pig
LOADING DATA - ENRON DATA
{
"body": "the ... person...",
"subFolder": "notes_inbox",
"mailbox": "bass-e",
"filename": "450.",
"headers": {
"From": "michael.simmons@enron.com",
"To": "tim_belden@enron.com",
“Subject”: “Subject”
"Date": "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
}
}
36. MongoDB, Pig
STORE STATEMENT
The MongoStorage function takes an optional
list of arguments of two types:
• A single set of keys to base updating on.
This has three options: None, update, or
multi.
• Multiple indexes to ensure in the same
format as db.col.ensureIndex().
39. Pig
USER-DEFINED FUNCTIONS (UDF)
Pig is like procedural SQL
UDFs for rich data manipulation
UDFs: Java-based language
We made Pig work with CPython (NumPy,
etc)
40. MongoDB + Pig
WITHOUT MORTAR
Get the mongo-hadoop connector:
http://github.com/mongodb/mongo-
hadoop
41. MongoDB + Pig
SUMMARY
Hadoop and friends are maturing
MongoDB and Pig are philosophically
aligned
Reading and writing to Pig is straightforward
Once in Pig (Hadoop)
• massive batch calcs / analytics possible
• work is offloaded
• external libraries available