1. The document discusses using MongoDB and data lakes for enterprise data management. It outlines the current issues with relational databases and how MongoDB addresses challenges like flexibility, scalability and performance.
2. Various architectures for enterprise data management with MongoDB are presented, including using it for raw, transformed and aggregated data stores.
3. The benefits of combining MongoDB and Hadoop in a data lake are greater agility, insight from handling different data structures, scalability and low latency for real-time decisions.
08448380779 Call Girls In Civil Lines Women Seeking Men
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
1. Enterprise Data Management in the Era of
MongoDB & Data Lakes
Matt Kalan
Sr. Solution Architect
matt.kalan@mongodb.com
@matthewkalan
2. Agenda
1. EDM Pipeline Overview
2. Current issues
3. Quick MongoDB Overview
4. Each Stage of EDM Pipeline
5. Future State EDM Architecture
6. Case Study & Scenarios
7. Data Lake Lessons Learned
3. Enterprise Data Management Pipeline
…
Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Transform
Store raw
data
AnalyzeAggregate
Pub-sub,ETL,fileimports
Stream Processing
Users
Other
Systems
4. Data Management Requirements in
Today’s World
Data
• Volume
• Velocity
• Variety
Time
• Iterative
• Agile
• Short Cycles
Risk
• Always On
• Scale
• Global
Cost
• Open-Source
• Cloud
• Commodity
6. More specifically
• Unnecessary joins cause bad
performance
• Expensive to scale up or horizontally
• Rigid schema make consolidating difficult
• Poor fit with variably structured or
unstructured data
• Common models cause differences in
records to be removed when aggregating
• Process often takes many hours overnight
• Data is too stale for intraday decisions and
engagement
7. Ways to Improve
What data management
technology improves on
these requirements?
Transform
Store raw
data
AnalyzeAggregate
11. Documents Enable Dynamic Schema & Optimal
Performance
Relational MongoDB
{ customer_id : 1,
first_name : "Mark",
last_name : "Smith",
city : "San Francisco",
phones: [
{
number : “1-212-777-1212”,
dnc : true,
type : “home”
},
number : “1-212-777-1213”,
type : “cell”
}]
}
Customer
ID
First Name Last Name City
0 John Doe New York
1 Mark Smith San Francisco
2 Jay Black Newark
3 Meagan White London
4 Edward Daniels Boston
Phone Number Type DNC
Customer
ID
1-212-555-1212 home T 0
1-212-555-1213 home T 0
1-212-555-1214 cell F 0
1-212-777-1212 home T 1
1-212-777-1213 cell (null) 1
1-212-888-1212 home F 2
12. Document Model Benefits
Agility and flexibility
Data model supports business change
Rapidly iterate to meet new requirements
Intuitive, natural data representation
Eliminates ORM layer
Developers are more productive
Reduces the need for joins, disk
seeks
Programming is more simple
Performance delivered at scale
{
customer_id : 1,
first_name : "Mark",
last_name : "Smith",
city : "San Francisco",
phones: [
{
number : “1-212-777-1212”,
dnc : true,
type : “home”
},
number : “1-212-777-1213”,
type : “cell”
}]
}
15. 15
Better Data Locality In-Memory Caching Flexible Indexes
Performance
vs.
Relational MongoDB
16. Scale
250M Ticks/Sec
300K+ Ops/Sec
500K+ Ops/SecFed Agency
Performance
1,400 Servers
1,000+ Servers
250+ Servers
Entertainment Co.
Cluster
Petabytes
10s of billions of objects
13B documents
Data
Asian Internet Co.
17. 3.2 Features Relevant for EDM
1. WiredTiger as default storage engine
2. In-memory storage engine
3. Encryption at rest
4. Document Validation Rules
5. Compass (data viewer & query builder)
6. Connector for BI (Visualization)
7. $lookUp (left outer join)
18. Data Governance with Document Validation
Implement data governance
without sacrificing agility that
comes from dynamic schema
• Enforce data quality across
multiple teams and applications
• Use familiar MongoDB
expressions to control document
structure
• Validation is optional and can be
as simple as a single field, all
the way to every field, including
existence, data types, and
regular expressions
19. MongoDB Compass
For fast schema discovery
and visual construction of
ad-hoc queries
• Visualize schema
– Frequency of fields
– Frequency of types
– Determine validator
rules
• View Documents
• Graphically build queries
• Authenticated access
20. MongoDB Connector for BI
Visualize and explore multi-
dimensional documents using
SQL-based BI tools. The
connector does the following:
• Provides the BI tool with the schema
of the MongoDB collection to be
visualized
• Translates SQL statements issued
by the BI tool into equivalent
MongoDB queries that are sent to
MongoDB for processing
• Converts the results into the tabular
format expected by the BI tool,
which can then visualize the data
based on user requirements
21. Dynamic Lookup
Combine data from multiple
collections with left outer joins for
richer analytics & more flexibility in
data modeling
• Blend data from multiple sources for
analysis
• Higher performance analytics with less
application-side code and less effort
from your developers
• Executed via the new $lookup
operator, a stage in the MongoDB
Aggregation Framework pipeline
22. Aggregation Framework – PipelinedAnalysis
Start with the original collection; each record
(document) contains a number of shapes
(keys), each with a particular color (value)
• $match filters out documents that don’t
contain a red diamond
• $project adds a new “square” attribute
with a value computed from the value
(color) of the snowflake and triangle
attributes
• $lookup performs a left outer join with
another collection, with the star being the
comparison key
• Finally, the $group stage groups the data
by the color of the square and produces
statistics for each group
24. 4th Most Popular, Fastest Growing
RANK DBMS MODEL SCORE GROWTH (20 MO)
1. Oracle Relational DBMS 1,442 -5%
2. MySQL Relational DBMS 1,294 2%
3.
Microsoft SQL
Server
Relational DBMS 1,131 -10%
4. MongoDB Document Store 277 172%
5. PostgreSQL Relational DBMS 273 40%
6. DB2 Relational DBMS 201 11%
7. Microsoft Access Relational DBMS 146 -26%
8. Cassandra Wide Column 107 87%
9. SQLite Relational DBMS 105 19%
Source: DB-engines database popularity rankings; May 2015
Only non-relational in the top 5; 2.5x ahead of nearest NoSQL Competitor
25. Partner Ecosystem (500+)
* BI Connector (ODBC driver) and $lookUp (left outer join) are planned to be released with v3.2 in Q4
26. 1. Operational Data Store (ODS)
2. Enterprise Data Service
3. Datamart/Cache
4. Master Data Distribution
5. Single Operational View
6. Operationalizing Hadoop
MongoDB Architecture Patterns
System of Record
System of Engagement
27. MongoDB Hadoop/Spark Connector
• Sub-second latency
• Expressive querying
• Flexible indexing
• Aggregations in database
• Great for any subset of
data
• Longer jobs
• Batch analytics
• Append only files
• Great for scanning all data or
large subsets in files
Distributed processing/analytics
- MongoDB Hadoop
Connector
- Spark-mongodb
Both provide:
• Schema-on-read
• Low TCO
• Horizontal scale
28. How to Choose Data Management
Products in the EDM Pipeline
29. Enterprise Data Management Pipeline
…
Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Transform
Store raw
data
AnalyzeAggregate
Pub-sub,ETL,fileimports
Stream Processing
Users
Other
Systems
30. How to choose the data management layer for each or all
stages?
Processing
Layer
?
When you want:
1. Secondary indexes
2. Sub-second latency
3. Aggregations in DB
4. Updates of data
For:
1. Scanning files
2. When indexes not
needed
Wide column store
(e.g. HBase)
For:
1. Primary key queries
2. If multiple indexes
& slices not needed
3. Optimized for
writing, not reading
31. Transform
…
Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
AnalyzeAggregate
Pub-sub,ETL,fileimports
Stream Processing
Users
Other
Systems
Data Store for Raw Dataset
Store raw
data
Users
Transform
- Typically just writing record-
by-record from source data
- Usually just need high write
volumes
- All 3 options handle that
Transform read requirements
- Benefits to reading multiple datasets
sorted [by index], e.g. to do a merge
- Might want to look up across tables with
indexes (and join functionality in MDB
v3.2)
- Want high read performance while writes
are happening
Interactive querying on
the raw data could use
indexes with MongoDB
32. Store raw
data
Transform
…
Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Analyze
Pub-sub,ETL,fileimports
Stream Processing
Users
Other
Systems
Data Store for Transformed Dataset
Users
AggregateTransform
Often benefits to
updating data as
merging multiple
datasets
Dashboards & reports
can have sub-second
latency with indexes
Aggregate read requirements
- Benefits to using indexes for grouping
- Aggregations natively in the DB would help
- With indexes, can do aggregations on slices of data
- Might want to look up across tables with indexes to
aggregate
33. Transform
Store raw
data
Transform
…
Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Pub-sub,ETL,fileimports
Stream Processing
Users
Other
Systems
Data Store for Aggregated Dataset
Users
AnalyzeAggregate
Dashboards &
reports can have
sub-second latency
with indexes
Analytics read requirements
- Often want to analyze a slice of
data (using indexes)
- For scanning all of data, could be
in any data store
- Querying on slices is best in
MongoDB
34. AggregateTransform
Store raw
data
Transform
…
Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Pub-sub,ETL,fileimports
Stream Processing
Users
Data Store for Last Dataset
Analyze
Users
Dashboards &
reports can have
sub-second latency
with indexes
- At the last step, there are many
consuming systems and users
- Need expressive querying with
secondary indexes
- MongoDB is best option for the
publication or distribution of
analytical results and
operationalization of data
Other
Systems
Often digital applications
- High scale
- Expressive querying
- JSON preferred
Often RESTful
services, APIs
36. More Complete EDMArchitecture & Data Lake
…
Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Data processing pipeline
Pub-sub,ETL,fileimports
Stream Processing
Downstream
Systems
… …
Single CSR
Application
Unified
Digital
Apps
Operationa
l Reporting
…
… …
Analytic
Reporting
Drivers & Stacks
Custome
r
Clusterin
g
Churn
Analysis
Predictiv
e
Analytics
…
Distributed Processing
Operational Applications & Reporting
Governance to
choose where to load
and process data
Optimal location
for providing
operational
response times
& slices
Can run
processing on
all data or slices
Data Lake
37. Example scenarios
1. Single Customer View
a. Operational
b. Analytics on customer segments
c. Analytics on all customers
2. Customer profiles & clustering
3. Presenting churn analytics on high value customers
38. Top 10 Bank
Case Study
Unified real-time monitoring platform for customer-facing
channels via Stratio’s Big Data Platform
Problem Why MongoDB Results
Problem Solution Results
Wanted a high quality of
service across online
channels
Many untapped data
sources & streams (logs,
clicks, social, etc.)
Want to be able to monitor
service response times &
provide root cause analysis
Used Flume for log data,
MongoDB for persisting and
KPIs, and Spark for analysis
Flexible data model
provided support for wide
variety of machine data
Linear scalability made it
easy to handle additional
load for each data source
Solution impacts
infrastructure for 31
countries and 51 million
customers
Can now adhere to strict
SLAs across infrastructure
Improved response times
are driving higher customer
satisfaction and increased
revenue
39. Data Lake Lessons Learned
1. Define the objectives
2. Design the future state
3. Consider the full data lifecycle towards operationalizing
4. Have a plan for managing metadata to avoid data swamp
5. Deliver business value incrementally towards future state
6. Decide data management layer based on how you will
use the data (esp. read requirements)
7. MongoDB fills common gaps with low latency & indexes
40. Benefits of MongoDB & Hadoop Combined Data Lake
• Low TCO from commodity hardware
• Greater agility and faster time-to-market from
schema-on-read
• Greater insight as differences in data are
present for drill down
• Can scale cheaply to meet any SLAs
• Data can be current for intraday decision
making
• Low latency response times
• Optimal use of resources with indexing
• Overall greater insight and business impact
41. For More Information
Resource Location
Tutorial for Operationalizing Spark with
MongoDB
www.mongodb.com/blog/post/tutorial-for-operationalizing-
spark-with-mongodb
Using MongoDB with Hadoop & Spark
www.mongodb.com/blog/post/using-mongodb-hadoop-spark-
part-1-introduction-setup
Scalability Benchmarks
www.mongodb.com/collateral/scalability-benchmarking-
mongodb-and-nosql-systems-report
Case Studies mongodb.com/customers
Presentations mongodb.com/presentations
Free Online Training education.mongodb.com
Webinars and Events mongodb.com/events
Documentation docs.mongodb.org
MongoDB Downloads mongodb.com/download
Hinweis der Redaktion
Stream processing is often separate processing layer than the batch processing, but it can be stored into the data stores at various stages
Now that we understand some of the challenges you’re facing and where you’d like to get, perhaps I can tell you a bit about why MongoDB exists and where we might be able to help.
Our founders observed some technological and business changes in the market. We built MongoDB to address the way the world is changing…
Data [tie back to what you’ve heard from customer if possible]
90% data created in last 2 years
80% enterprise data is unstructured
Unstructured data growing 2X rate of structured data
Time [tie back to what you’ve heard from customer if possible]
Development methods shifted from waterfall (12-24 months) to iterative
Leading edge companies like Facebook + Etsy shipping code multiple times a day
Risk [tie back to what you’ve heard from customer if possible]
User bases shifted from internal (thousands) to external (millions)
Can’t go down
All across the globe
Cost [tie back to what you’ve heard from customer if possible]
Shift to open-source and SaaS business models to pay for value over time
Ability to leverage cloud and commodity architectures to lower infrastructure costs
Looking at the other technologies in the market…
Relational databases laid the foundation for what you’d want out of your database
Rich and fast access to the data, using an expressive query language and secondary indexes
Strong consistency, so you know you’re always getting the most up to date version of the data
But they weren’t built for the world we just talked about
Built for waterfall dev cycles, structured data
Built for internal users, not large numbers of users all across the global
(From vendors who want large license fees upfront)
--> So what they have in data access and consistency, they lack in flexibility, scalability and performance
Could make more visual
MongoDB was built to address the way the world has changed while preserving the core database capabilities required to build functional apps
MongoDB is the only database that harnesses the innovations of NoSQL and maintains the foundation of relational databases
MongoDB was built to address the way the world has changed while preserving the core database capabilities required to build functional apps
MongoDB is the only database that harnesses the innovations of NoSQL and maintains the foundation of relational databases
More info: http://www.mongodb.com/mongodb-scale
Kernel 3.2 Scope tracking: https://docs.google.com/spreadsheets/d/1L1EbbWoshUIHXBzCh5e3sALtAFxm_dJ52SRPR6GzeAY/edit#gid=0
Release notes for 3.1.6: http://docs.mongodb.org/manual/release-notes/3.1-dev-series/
Determine validator rules: You can use the tool to figure out what you want to set as validation rules
$lookup – this creates new documents which contain everything from the previous stage but augmented with data from any document from the second collection containing a matching colored star (i.e., the blue and yellow stars had matching lookup values, whereas the red star had none)
In terms of reporting, A number of Business Intelligence (BI) vendors have developed connectors to integrate MongoDB as a data source with their suites, alongside traditional relational dbs. This integration provides reporting, visualizations, dash-boarding of MongoDB data
Stream processing is often separate processing layer than the batch processing, but it can be stored into the data stores at various stages
Just a logical diagram. Processing could be on same physical servers as storage nodes to minimize data movement