2024: Domino Containers - The Next Step. News from the Domino Container commu...
Using MongoDB + Hadoop Together
1. #MongoDB
Using MongoDB and Hadoop
Together For Success
Buzz Moschetti
buzz.moschetti@mongodb.com
Enterprise Architect, MongoDB
2. Who is your Presenter?
• Yes, I use “Buzz” on my business cards
• Former Investment Bank Chief Architect at
JPMorganChase and Bear Stearns before that
• Over 25 years of designing and building systems
• Big and small
• Super-specialized to broadly useful in any vertical
• “Traditional” to completely disruptive
• Advocate of language leverage and strong factoring
• Still programming – using emacs, of course
3. Agenda
• (Occasionally) Brutal Truths about Big Data
• The Key To Success in Large Scale Data Management
• Review of Directed Content Business Architecture
• Technical Implementation Examples
• Recommendation Capability
• Realtime Trade / Position Risk
• Q & A
4. Truths
• Clear definition of Big Data still maturing
• Efficiently operationalizing Big Data is non-trivial
• Developing, debugging, understanding MapReduce
• Cluster monitoring & management, job scheduling/recovery
• If you thought regular ETL Hell was bad….
• Big Data is not about math/set accuracy
• The last 25000 items in a 25,497,612 set “don’t matter”
• Big Data questions are best asked periodically
• “Are we there yet?”
• Realtime means … realtime
5. It’s About The Functions, not the Terms
DON’T ASK:
• Is this an operations or an analytics problem?
• Is this online or offline?
• What query language should we use?
• What is my integration strategy across tools?
ASK INSTEAD:
• Am I incrementally addressing data (esp. writes)?
• Am I computing a precise answer or a trend?
• Do I need to operate on this data in realtime?
• What is my holistic architecture?
6. Success in Big Data: MongoDB + Hadoop
• Efficient Operationalization
• Robust data movements
• Clarity and fidelity of data movements
• Designing for change
• Analysis Feedback
• Data computed in Hadoop integrated back into
MongoDB
7. What We’re Going to “Build” today
Realtime Directed Content System
• Based on what users click, “recommended”
content is returned in addition to the target
• The example is sector (manufacturing, financial
services, retail) neutral
• System dynamically updates behavior in response
to user activity
8. The Participants and Their Roles
Directed
Content
System
Customers
Analysts/
Data Scientists
Content
Creators
Management/
Strategy
Operate on data to
identify trends and
develop tag domains
Generate and tag
content from a known
domain of tags
Make decisions based
on trends and other
summarized data
Developers/
ProdOps
Bring it all together:
apps, SDLC, integration,
etc.
9. Priority #1: Maximizing User value
Considerations/Requirements
Maximize realtime user value and experience
Provide management reporting and trend analysis
Engineer for Day 2 agility on recommendation engine
Provide scrubbed click history for customer
Permit low-cost horizontal scaling
Minimize technical integration
Minimize technical footprint
Use conventional and/or approved tools
Provide a RESTful service layer
…..
11. Complementary Strengths
App(s) MongoDB Hadoop MapReduce
• Standard design paradigm (objects,
tools, 3rd party products, IDEs, test
drivers, skill pool, etc. etc.)
• Language flexibility (Java, C#, C++
python, Scala, …)
• Webscale deployment model
• appservers, DMZ, monitoring
• High performance rich shape CRUD
• MapReduce design paradigm
• Node deployment model
• Very large set operations
• Computationally intensive, longer
duration
• Read-dominated workload
12. “Legacy” Approach: Somewhat unidirectional
ETL
App(s) MongoDB Hadoop MapReduce
• Extract data from mongoDB and other
sources nightly (or weekly)
• Generate reports for people to read
• Same pains as existing ETL:
reconciliation, transformation, change
management …
13. Somewhat better approach
ETL
App(s) MongoDB Hadoop MapReduce
ETL
• Extract data from mongoDB and other
sources nightly (or weekly)
• Generate reports for people to read
• Move important summary data back to
mongoDB for consumption by apps.
• Still in ETL-dominated landscape
14. …but the overall problem remains:
• How to realtime integrate and operate upon both
periodically generated data and realtime current
data?
• Lackluster integration between OLTP and Hadoop
• It’s not just about the database: you need a
realtime profile and profile update function
15. The legacy problem in pseudocode
onContentClick() {!
String[] tags = content.getTags();!
Resource[] r = f1(database, tags);!
}!
• Realtime intraday state not well-handled
• Baselining is a different problem than click
handling
16. The Right Approach
• Users have a specific Profile entity
• The Profile captures trend analytics as baselining
information
• The Profile has per-tag “counters” that are updated with
each interaction / click
• Counters plus baselining are passed to fetch function
• The fetch function itself could be dynamic!
17. 24 hours in the life of The System
• Assume some content has been created and tagged
• Two systemetized tags: Pets & PowerTools
18. Monday, 1:30AM EST
App(s) MongoDB Hadoop MapReduce
• Fetch all user Profiles from MongoDB; load into Hadoop
• Or skip if using the MongoDB-Hadoop connector!
19. MongoDB-Hadoop MapReduce Example
public class ProfileMapper !
extends Mapper<Object, BSONObject, IntWritable, IntWritable>
{!
@Override!
public void map(final Object pKey,!
! ! ! !final BSONObject pValue,!
! ! ! !final Context pContext )!
!throws IOException, InterruptedException{!
String user = (String)pValue.get(”user");!
Date d1 = (Date)pValue.get(“lastUpdate”);!
int count = 0;!
List<String> keys = pValue.get(“tags”).keys();!
for ( String tag : keys) {!
count += pValue.get(tag).get(“hist”).size();!
)!
int avg = count / keys.size();!
pContext.write( new IntWritable( count), new
IntWritable( avg ) );!
}!
}!
20. MongoDB-Hadoop v1 (today)
Hadoop
MR Mapper
v1
MongoDB-Hadoop
ü V1 adapter draws data directly from MongoDB
ü No ETL, scripts, change management, etc.
ü Storage optimized: NO data copies
21. MongoDB-Hadoop v2 (soon)
Hadoop
MR Mapper
HDFS
ü V2 flows data directly into HDFS via a special
MongoDB secondary
ü No ETL, scripts, change management, etc.
ü Data is copied – but still one data fabric
ü Realtime data with snapshotting as an option
22. Monday, 1:45AM EST
App(s) MongoDB Hadoop MapReduce
• Grind through all content data and user Profile data to produce:
• Tags based on feature extraction (vs. creator-applied tags)
• Trend baseline per user for tags Pets and PowerTools
• Load Profiles with new baseline back into MongoDB
23. Monday, 8AM EST
App(s) MongoDB Hadoop MapReduce
• User Bob logs in and Profile retrieved from MongoDB
• Bob clicks on Content X which is already tagged as “Pets”
• Bob has clicked on Pets tagged content many times
• Adjust Profile for tag “Pets” and save back to MongoDB
• Analysis = f(Profile)
• Analysis can be “anything”; it is simply a result. It could trigger
an ad, a compliance alert, etc.
24. Monday, 8:02AM EST
App(s) MongoDB Hadoop MapReduce
• Bob clicks on Content Y which is already tagged as “Spices”
• Spice is a new tag type for Bob
• Adjust Profile for tag “Spices” and save back to MongoDB
• Analysis = f(profile)
26. Tag-based algorithm detail
getRecommendedContent(profile, [“PETS”, other]) {
if algo for a tag available {!
!filter = algo(profile, tag);!
}!
fetch N recommendations (filter);!
}!
!
A4(profile, tag) {!
weight = get tag (“PETS”) global weighting;!
adjustForPersonalBaseline(weight, “PETS” baseline); !
if “PETS” clicked more than 2 times in past 10 mins!
then weight += 10;!
if “PETS” clicked more than 10 times in past 2 days!
then weight += 3; !!
!
return new filter({“PETS”, weight}, globals)!
}!
27. Tuesday, 1AM EST
App(s) MongoDB Hadoop MapReduce
• Fetch all user Profiles from MongoDB; load into Hadoop
• Or skip if using the MongoDB-Hadoop connector!
28. Tuesday, 1:30AM EST
App(s) MongoDB Hadoop MapReduce
• Grind through all content data and user profile data to produce:
• Tags based on feature extraction (vs. creator-applied tags)
• Trend baseline for Pets and PowerTools and Spice
• Data can be specific to individual or by group
• Load new baselines back into MongoDB
30. Tuesday, 1:35AM EST
App(s) MongoDB Hadoop MapReduce
• Perform maintenance on user Profiles
• Click history trimming (variety of algorithms)
• “Dead tag” removal
• Update of auxiliary reference data
32. Feel free to run the baselining more frequently
App(s) MongoDB Hadoop MapReduce
… but avoid “Are We There Yet?”
33. Nearterm / Realtime Questions & Actions
With respect to the Customer:
• What has Bob done over the past 24 hours?
• Given an input, make a logic decision in 100ms or less
With respect to the Provider:
• What are all current users doing or looking at?
• Can we nearterm correlate single events to shifts in behavior?
34. Longterm/ Not Realtime Questions & Actions
With respect to the Customer:
• Any way to explain historic performance / actions?
• What are recommendations for the future?
With respect to the Provider:
• Can we correlate multiple events from multiple sources
over a long period of time to identify trends?
• What is my entire customer base doing over 2 years?
• Show me a time vs. aggregate tag hit chart
• Slice and dice and aggregate tags vs. XYZ
• What tags are trending up or down?
35. Another Example: Realtime Risk
Applications
Trade Processing
Risk
Risk Service
Calculation
(Spark)
Log trade
activities
Query
trades
Query
Risk
Risk
Params
Admin
Analysis/
Reporting
(Impala)
OTHER
HDFS DATA
OTHER
HDFS DATA
36. Recording a trade
Applications
Trade Processing
1. Bank makes a trade
2. Trade sent to Trade Processing
3. Trade Processing writes trade to MongoDB
4. Realtime replicate trade to Hadoop/HDFS
Non-functional notes:
• High volume of data ingestion (10,000s or more
events per second)
• Durable storage of trade data
• Store trade events across all asset classes
1
2
3
4
37. Querying deal / trade / event data
1. Query on deal attributes (id, counterparty, asset
class, termination date, notional amount, book)
2. MongoDB performs index-optimized query and
Trade Processing assembles Deal/Trade/Event data
into response packet
3. Return response packet to caller
Non-functional notes:
• System can support very high volume (10,000s
or more queries per second)
• Millisecond response times
Applications
1
Trade Processing
2
3
38. Updating intra-day risk data
1. Mirror of trade data already stored in HDFS
Trade data partitioned into time windows
2. Signal/timer kicks off a “run”
3. Spark ingests new partition of trade data as RDD
and calculates and merges risk data based on
latest trade data
4. Risk data written directly to MongoDB and indexed
and available for online queries / aggregations /
applications logic
Applications
Risk Service
1
Risk
Calculation
(Spark)
2
4
3
39. Querying detail & aggregated risk on demand
1. Applications can use full MongoDB query API to
access risk data and trade data
2. Risk data can be indexed on multiple fields for fast
access by multiple dimensions
3. Hadoop jobs periodically apply incremental
updates to risk data with no down time
4. Interpolated / matrix risk can be computed on-the-fly
Non-functional notes
• System can support very high volume (10,000s
or more queries per second)
• Millisecond response times
Applications
1
Risk Service
2
3
40. Trade Analytics & Reporting
1. Impala provides full SQL access to all content in
Hadoop
2. Dashboards and Reporting frameworks deliver
periodic information to consumers
3. Breadth of data discovery / ad-hoc analysis tools
can be brought bear on all data in Hadoop
Non-functional notes:
• Lower query frequency
• Full SQL query flexibility
• Most queries / analysis yield value accessing large
volumes of data (e.g. all events in the last 30 days
– or 30 months)
Applications
Impala
Dashboards Reports
Ad-hoc
Analysis
41. The Key To Success: It is One System
MongoDB
App(s)
Hadoop
MapReduce