5. Hadoop
“The Apache Hadoop software library is a framework
that allows for the distributed processing of large
data sets across clusters of computers using simple
programming models.”
• Terabyte and Petabyte datasets
• Data warehousing
• Advanced analytics
http://hadoop.apache.org
7. ‹#›
Operational vs. Analytical: Enrichment
Applications, Interactions Warehouse, Analytics
8. Operational: MongoDB
First-Level
Analytics
Internet of Things
Mobile Apps
Social
Product/Asset
Catalog
Security & Fraud
Customer Data
Management
Single View
Churn Analysis
Risk Modeling
Trade
Surveillance
Sentiment
Analysis
Recommender
Warehouse & ETL
Predictive
Analytics
Ad Targeting
9. Analytical: Hadoop
First-Level
Analytics
Internet of Things
Mobile Apps
Social
Product/Asset
Catalog
Security & Fraud
Customer Data
Management
Single View
Churn Analysis
Risk Modeling
Trade
Surveillance
Sentiment
Analysis
Recommender
Warehouse & ETL
Predictive
Analytics
Ad Targeting
10. Operational & Analytical: Lifecycle
First-Level
Analytics
Internet of Things
Mobile Apps
Social
Product/Asset
Catalog
Security & Fraud
Customer Data
Management
Single View
Churn Analysis
Risk Modeling
Trade
Surveillance
Sentiment
Analysis
Recommender
Warehouse & ETL
Predictive
Analytics
Ad Targeting
12. Commerce
Applications
powered by
Analysis
powered by
Products & Inventory
Recommended products
Customer profile
Session management
Elastic pricing
Recommendation models
Predictive analytics
Clickstream history
MongoDB Connector
for Hadoop
13. Insurance
Applications
powered by
Analysis
powered by
Customer profiles
Insurance policies
Session data
Call center data
Customer action analysis
Churn analysis
Churn prediction
Policy rates
MongoDB Connector
for Hadoop
14. Fraud Detection
Payments Nightly Analysis
MongoDB Connector
for Hadoop
3rd Party
Data Sources
Results Cache
Fraud
Detection
Query Only
Query Only
17. ‹#›
Connector Features and Functionality
• Computes splits to read data
• Single Node, Replica Sets, Sharded Clusters
• Mappings for Pig and Hive
• MongoDB as a standard data source/destination
• Support for
• Filtering data with MongoDB queries
• Authentication
• Reading from Replica Set tags
• Appending to existing collections
19. ‹#›
Pig Mappings
• Input: BSONLoader and MongoLoader
data = LOAD ‘mongodb://mydb:27017/db.collection’
using com.mongodb.hadoop.pig.MongoLoader
• Output: BSONStorage and MongoInsertStorage
STORE records INTO ‘hdfs:///output.bson’
using com.mongodb.hadoop.pig.BSONStorage
20. ‹#›
Hive Support
• Access collections as Hive tables
• Use with MongoStorageHandler or BSONStorageHandler
CREATE TABLE mongo_users (id int, name string, age int)
STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"
WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”)
TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)
21. ‹#›
Spark
• Use with MapReduce input/output
formats
• Create Configuration objects with
input/output formats and data URI
• Load/save data using SparkContext
Hadoop file API
22. ‹#›
Data Movement
Dynamic queries to MongoDB vs. BSON snapshots in HDFS
Dynamic queries with most
recent data
Puts load on operational
database
Snapshots move load to
Hadoop
Snapshots add predictable
load to MongoDB
25. ‹#›
MovieWeb Web Application
• Browse
- Top movies by ratings count
- Top genres by movie count
• Log in to
- See My Ratings
- Rate movies
• Recommendations
- Movies You May Like
- Recommendations
26. ‹#›
MovieWeb Components
• MovieLens dataset
– 10M ratings, 10K movies, 70K users
– http://grouplens.org/datasets/movielens/
• Python web app to browse movies, recommendations
– Flask, PyMongo
• Spark app computes recommendations
– MLLib collaborative filter
• Predicted ratings are exposed in web app
– New predictions collection
27. ‹#›
Spark Recommender
• Apache Hadoop (2.3)
- HDFS & YARN
- Top genres by movie count
• Spark (1.0)
- Execute within YARN
- Assign executor resources
• Data
- From HDFS, MongoDB
- To MongoDB
28. ‹#›
MovieWeb Workflow
Snapshot db
as BSON
Predict ratings for
all pairings
Write Prediction to
MongoDB
collection
Store BSON
in HDFS
Read BSON
into Spark App
Create user movie
pairing
Web Application
exposes
recommendations
Train Model from
existing ratings
Repeat Process
31. ‹#›
Business First!
First-Level
Analytics
Internet of
Things
Mobile Apps
Social
What/Why How
Product/Asse
t Catalog
Security &
Fraud
Customer
Data
Management
Single View
Churn
Analysis
Risk
Modeling
Trade
Surveillance
Sentiment
Analysis
Recommend
er
Warehouse
& ETL
Predictive
Analytics
Ad Targeting
32. ‹#›
The good tool for the task
• Dataset size
• Data processing complexity
• Continuous improvement
V1.0
33. ‹#›
The good tool for the task
• Dataset size
• Data processing complexity
• Continuous improvement
V2.0
34. ‹#›
Resources / Questions
• MongoDB Connector for Hadoop
- http://github.com/mongodb/mongo-hadoop
• Getting Started with MongoDB and Hadoop
- http://docs.mongodb.org/ecosystem/tutorial/getting-started-
with-hadoop/
• MongoDB-Spark Demo
- https://github.com/crcsmnky/mongodb-hadoop-workshop
Apache def, a framework to enable many things
Distributed File system one of the core component is MapReduce
Now it is more YARN, that is resource manager, and MR is just one type of jobs you can manage
Mongo DB : GB and Terabytes
Hadoop : Tb and Pb
You have 2 places where you deal with data
You have to think about “enrichment”
MongoDB is here to enrich data that are in Hadoop
Hadoop is here to enrich data that are in Mongodb
Let’s look at the different uses cases between Operational and Analytics
First level could be done in MongoDB “what is your application is talking to?”
Hadoop will be there to analyze a bigger problem and do some treatment
We are talking of hadoop when it is Pb of data
We are trying to solve the bigger problem, by connecting the 2 technologies when it makes sense
Split the data when reading data (Mapper)
But also filtering queries, for example to take data from a specific timestamp
To reduce the load on your cluster read from replicaset tags
a new feature that people asked for, is adding result to the existing collection
Spark in a new data processing that happens most in memory
Take all the power of the connector
Open a new “hadoop file” that is loaded in RDD ( Resilient Distributed Dataset )
Load Data
Read Users from MongoDB (user collection)
Read Movies from BSON (HDFS)
Read Ratings from MongoDB (ratings collection)
Data Processing
Generate (user/movies) pairs
users.cartesian(movies)
Train : Collaborative Filter
ALS.train(ratings.rdd(), 10, 10, 0.01);
Predict/Recommend
Save data into MongoDB prediction collection