MongoDB + Pig on Hadoop (MongoSV 2012)

MongoDB + Pig
Jeremy Karn - co-founder, Mortar

Overview
OF THIS SESSION

Intro to Hadoop
Intro to Pig
Why MongoDB + Pig?
Demo: load Pig
Demo: processing data with Pig
Demo: store data from Pig to MongoDB

Hadoop
RAPID OVERVIEW

MapReduce programming model
from Google
(Jeff Dean and Sanjay Ghemawat)

Hadoop
RAPID OVERVIEW

Hadoop implements MapReduce (Java)
(Doug Cutting)
Incubated at Yahoo
Indexing, Spam detection, more

Hadoop
STRENGTHS

Scalable
Open source
Lots of momentum
Very broadly applicable

Hadoop
PROBLEMS

Difficult
Batch only (...or it was)

Hadoop
FUTURE

Yarn
MapReduce optional
Generic management + distributed
apps
Impala

Alternatives to Hadoop
MONGODB NATIVE MAPREDUCE

Write MapReduce in Javascript
• Javascript is not fast
• Has limited data types
• Hard to use complex analytic libs
Adds load to data store

MONGODB NATIVE MAPREDUCE

Hadoop has libs for
• Machine learning
• ETL
• Can access any JVM analytic libs
And many organizations already use Hadoop

MONGODB AGGREGATION FRAMEWORK

Great when
• Doing SQL-style aggregation
• Do not require external data libs
• Users will learn framework

MONGODB AGGREGATION FRAMEWORK

But you may want Hadoop when
• Doing sophisticated aggregation
• Require external data libs
• Users unwilling to learn framework
• Need to transfer workload off datastore

Pig
ON HADOOP

Less code
Expressive code

Pig
BRIEF, EXPRESSIVE
LIKE PROCEDURAL SQL

(thanks: twitter hadoop world presentation)

The Same Script, In MapReduce
FOR SERIOUS

Pig
ON HADOOP

Less code
Expressive code
Compiles to MR
Insulates from API
Popular
(LinkedIn, Twitter,
Salesforce, Yahoo,
Stanford

MongoDB + Pig
MOTIVATIONS

Data storage and data processing are often
separate concerns

Hadoop is built for scalable processing of
large datasets

MongoDB, Pig
SIMILAR STANCE

Poly-structured data
• MongoDB: stores data, regardless of
structure
• Pig: reads data, regardless of structure
(got its name because Pigs are
omnivorous)

MongoDB, Pig
JSON-PIG DATA TYPE MAPPING

JSON Pig
string chararray
integer int
boolean boolean
double double
array bag
object map/tuple
null null

MongoDB, Pig
MONGODB-PIG DATA TYPE MAPPING

MongoDB Pig
date datetime
object id chararray
binary bytearray
data
regexp chararray
code chararray

Mortar
FAST INTRO

Open-source code-based dev framework for
data, built on Hadoop and Pig
Inspired by Rails
Self-contained, organized, executable
projects

Mortar
FAST INTRO

Our service hosts and executes mortar
projects

> mortar jobs:run your_pigscript
--clustersize 5

Mortar
FAST INTRO

Browser-only interface, great for
demonstrating Hadoop

MongoDB, Pig
LOADING DATA

One requirement:
• Must specify top level fields to load from
the mongoDB collection.

Optional:
• Specify a subset of embedded fields
• Data type for any/all fields

MongoDB, Pig
LOADING DATA - ENRON DATA

{
  "body": "the ... person...",
  "subFolder": "notes_inbox",
  "mailbox": "bass-e",
  "filename": "450.",
  "headers": {
  "From": "michael.simmons@enron.com",
  "To": "tim_belden@enron.com",
“Subject”: “Subject”
  "Date": "Mon, 14 May 2001 16:39:00 -0700 (PDT)",
  }
}

MongoDB, Pig
STORE STATEMENT

The MongoStorage function takes an optional
list of arguments of two types:
• A single set of keys to base updating on.
This has three options: None, update, or
multi.
• Multiple indexes to ensure in the same
format as db.col.ensureIndex().

Pig
ILLUSTRATE

Auto-select dataset

Exercise every execution path

Step-by-step execution

Pig
WHY ILLUSTRATE

Write correct code quickly

Understand others’ code

Test every execution path, every step

Pig
USER-DEFINED FUNCTIONS (UDF)

Pig is like procedural SQL

UDFs for rich data manipulation

UDFs: Java-based language

We made Pig work with CPython (NumPy,
etc)

MongoDB + Pig
WITHOUT MORTAR

Get the mongo-hadoop connector:
http://github.com/mongodb/mongo-
hadoop

MongoDB + Pig
SUMMARY

Hadoop and friends are maturing
MongoDB and Pig are philosophically
aligned
Reading and writing to Pig is straightforward
Once in Pig (Hadoop)
• massive batch calcs / analytics possible
• work is offloaded
• external libraries available

MongoDB + Pig on Hadoop (MongoSV 2012)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von mortardata

Mehr von mortardata (9)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

MongoDB + Pig on Hadoop (MongoSV 2012)