Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes

Enterprise Data Management in the Era of
MongoDB & Data Lakes
Matt Kalan
Sr. Solution Architect
matt.kalan@mongodb.com
@matthewkalan

Agenda
1. EDM Pipeline Overview
2. Current issues
3. Quick MongoDB Overview
4. Each Stage of EDM Pipeline
5. Future State EDM Architecture
6. Case Study & Scenarios
7. Data Lake Lessons Learned

Enterprise Data Management Pipeline
…
Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Transform
Store raw
data
AnalyzeAggregate
Pub-sub,ETL,fileimports
Stream Processing
Users
Other
Systems

Data Management Requirements in
Today’s World
Data
• Volume
• Velocity
• Variety
Time
• Iterative
• Agile
• Short Cycles
Risk
• Always On
• Scale
• Global
Cost
• Open-Source
• Cloud
• Commodity

Expressive
Query
Language
Strong
Consistency
Secondary
Indexes
Flexibility
Scalability
Performance
Challenges for Relational Data
Management Technologies

More specifically
• Unnecessary joins cause bad
performance
• Expensive to scale up or horizontally
• Rigid schema make consolidating difficult
• Poor fit with variably structured or
unstructured data
• Common models cause differences in
records to be removed when aggregating
• Process often takes many hours overnight
• Data is too stale for intraday decisions and
engagement

Ways to Improve
What data management
technology improves on
these requirements?
Transform
Store raw
data
AnalyzeAggregate

First, Quick MongoDB Background

Expressive
Query
Language
Strong
Consistency
Secondary
Indexes
Flexibility
Scalability
Performance
Relational NoSQL

Nexus Architecture
Expressive
Query
Language
Strong
Consistency
Secondary
Indexes
Flexibility
Scalability
Performance

Documents Enable Dynamic Schema & Optimal
Performance
Relational MongoDB
{ customer_id : 1,
first_name : "Mark",
last_name : "Smith",
city : "San Francisco",
phones: [
{
number : “1-212-777-1212”,
dnc : true,
type : “home”
},
number : “1-212-777-1213”,
type : “cell”
}]
}
Customer
ID
First Name Last Name City
0 John Doe New York
1 Mark Smith San Francisco
2 Jay Black Newark
3 Meagan White London
4 Edward Daniels Boston
Phone Number Type DNC
Customer
ID
1-212-555-1212 home T 0
1-212-555-1213 home T 0
1-212-555-1214 cell F 0
1-212-777-1212 home T 1
1-212-777-1213 cell (null) 1
1-212-888-1212 home F 2

Document Model Benefits
Agility and flexibility
Data model supports business change
Rapidly iterate to meet new requirements
Intuitive, natural data representation
Eliminates ORM layer
Developers are more productive
Reduces the need for joins, disk
seeks
Programming is more simple
Performance delivered at scale
{
customer_id : 1,
first_name : "Mark",
last_name : "Smith",
city : "San Francisco",
phones: [
{
number : “1-212-777-1212”,
dnc : true,
type : “home”
},
number : “1-212-777-1213”,
type : “cell”
}]
}

MongoDB Technical Capabilities
Application
Driver
Mongos
Primary
Secondary
Secondary
Shard 1
Primary
Secondary
Secondary
Shard 2
…
Primary
Secondary
Secondary
Shard N
db.customer.insert({…})
db.customer.find({
name: ”John Smith”})
1.Dynamic Document
Schema
{ name: “John Smith”,
date: “2013-08-01”,
address: “10 3rd St.”,
phone: {
home: 1234567890,
mobile: 1234568138 }
}
2. Native language drivers
5. High performance
- Data locality
- Indexes
- RAM
3. High availability
- Replica sets
6. Horizontal scalability
- Sharding
4. Workload Isolation
- Reading from
secondaries

Morphia
MEAN Stack
Java Python PerlRuby
Support for the most popular languages and
frameworks
Drivers & Ecosystem

15
Better Data Locality In-Memory Caching Flexible Indexes
Performance
vs.
Relational MongoDB

Scale
250M Ticks/Sec
300K+ Ops/Sec
500K+ Ops/SecFed Agency
Performance
1,400 Servers
1,000+ Servers
250+ Servers
Entertainment Co.
Cluster
Petabytes
10s of billions of objects
13B documents
Data
Asian Internet Co.

3.2 Features Relevant for EDM
1. WiredTiger as default storage engine
2. In-memory storage engine
3. Encryption at rest
4. Document Validation Rules
5. Compass (data viewer & query builder)
6. Connector for BI (Visualization)
7. $lookUp (left outer join)

Data Governance with Document Validation
Implement data governance
without sacrificing agility that
comes from dynamic schema
• Enforce data quality across
multiple teams and applications
• Use familiar MongoDB
expressions to control document
structure
• Validation is optional and can be
as simple as a single field, all
the way to every field, including
existence, data types, and
regular expressions

MongoDB Compass
For fast schema discovery
and visual construction of
ad-hoc queries
• Visualize schema
– Frequency of fields
– Frequency of types
– Determine validator
rules
• View Documents
• Graphically build queries
• Authenticated access

MongoDB Connector for BI
Visualize and explore multi-
dimensional documents using
SQL-based BI tools. The
connector does the following:
• Provides the BI tool with the schema
of the MongoDB collection to be
visualized
• Translates SQL statements issued
by the BI tool into equivalent
MongoDB queries that are sent to
MongoDB for processing
• Converts the results into the tabular
format expected by the BI tool,
which can then visualize the data
based on user requirements

Dynamic Lookup
Combine data from multiple
collections with left outer joins for
richer analytics & more flexibility in
data modeling
• Blend data from multiple sources for
analysis
• Higher performance analytics with less
application-side code and less effort
from your developers
• Executed via the new $lookup
operator, a stage in the MongoDB
Aggregation Framework pipeline

Aggregation Framework – PipelinedAnalysis
Start with the original collection; each record
(document) contains a number of shapes
(keys), each with a particular color (value)
• $match filters out documents that don’t
contain a red diamond
• $project adds a new “square” attribute
with a value computed from the value
(color) of the snowflake and triangle
attributes
• $lookup performs a left outer join with
another collection, with the star being the
comparison key
• Finally, the $group stage groups the data
by the color of the square and produces
statistics for each group

4th Most Popular, Fastest Growing
RANK DBMS MODEL SCORE GROWTH (20 MO)
1. Oracle Relational DBMS 1,442 -5%
2. MySQL Relational DBMS 1,294 2%
3.
Microsoft SQL
Server
Relational DBMS 1,131 -10%
4. MongoDB Document Store 277 172%
5. PostgreSQL Relational DBMS 273 40%
6. DB2 Relational DBMS 201 11%
7. Microsoft Access Relational DBMS 146 -26%
8. Cassandra Wide Column 107 87%
9. SQLite Relational DBMS 105 19%
Source: DB-engines database popularity rankings; May 2015
Only non-relational in the top 5; 2.5x ahead of nearest NoSQL Competitor

Partner Ecosystem (500+)
* BI Connector (ODBC driver) and $lookUp (left outer join) are planned to be released with v3.2 in Q4

1. Operational Data Store (ODS)
2. Enterprise Data Service
3. Datamart/Cache
4. Master Data Distribution
5. Single Operational View
6. Operationalizing Hadoop
MongoDB Architecture Patterns
System of Record
System of Engagement

MongoDB Hadoop/Spark Connector
• Sub-second latency
• Expressive querying
• Flexible indexing
• Aggregations in database
• Great for any subset of
data
• Longer jobs
• Batch analytics
• Append only files
• Great for scanning all data or
large subsets in files
Distributed processing/analytics
- MongoDB Hadoop
Connector
- Spark-mongodb
Both provide:
• Schema-on-read
• Low TCO
• Horizontal scale

How to Choose Data Management
Products in the EDM Pipeline

How to choose the data management layer for each or all
stages?
Processing
Layer
?
When you want:
1. Secondary indexes
2. Sub-second latency
3. Aggregations in DB
4. Updates of data
For:
1. Scanning files
2. When indexes not
needed
Wide column store
(e.g. HBase)
For:
1. Primary key queries
2. If multiple indexes
& slices not needed
3. Optimized for
writing, not reading

Transform
…
Siloed source
databases
External feeds
(batch)
Streams
AnalyzeAggregate
Stream Processing
Users
Other
Systems
Data Store for Raw Dataset
Store raw
data
Users
Transform
- Typically just writing record-
by-record from source data
- Usually just need high write
volumes
- All 3 options handle that
Transform read requirements
- Benefits to reading multiple datasets
sorted [by index], e.g. to do a merge
- Might want to look up across tables with
indexes (and join functionality in MDB
v3.2)
- Want high read performance while writes
are happening
Interactive querying on
the raw data could use
indexes with MongoDB

Store raw
data
Transform
…
Siloed source
databases
External feeds
(batch)
Streams
Analyze
Stream Processing
Users
Other
Systems
Data Store for Transformed Dataset
Users
AggregateTransform
Often benefits to
updating data as
merging multiple
datasets
Dashboards & reports
can have sub-second
latency with indexes
Aggregate read requirements
- Benefits to using indexes for grouping
- Aggregations natively in the DB would help
- With indexes, can do aggregations on slices of data
- Might want to look up across tables with indexes to
aggregate

Transform
Store raw
data
Transform
…
Siloed source
databases
External feeds
(batch)
Streams
Stream Processing
Users
Other
Systems
Data Store for Aggregated Dataset
Users
AnalyzeAggregate
Dashboards &
reports can have
sub-second latency
with indexes
Analytics read requirements
- Often want to analyze a slice of
data (using indexes)
- For scanning all of data, could be
in any data store
- Querying on slices is best in
MongoDB

AggregateTransform
Store raw
data
Transform
…
Siloed source
databases
External feeds
(batch)
Streams
Stream Processing
Users
Data Store for Last Dataset
Analyze
Users
Dashboards &
reports can have
sub-second latency
with indexes
- At the last step, there are many
consuming systems and users
- Need expressive querying with
secondary indexes
- MongoDB is best option for the
publication or distribution of
analytical results and
operationalization of data
Other
Systems
Often digital applications
- High scale
- Expressive querying
- JSON preferred
Often RESTful
services, APIs

More Complete EDMArchitecture & Data Lake
…
Siloed source
databases
External feeds
(batch)
Streams
Data processing pipeline
Stream Processing
Downstream
Systems
… …
Single CSR
Application
Unified
Digital
Apps
Operationa
l Reporting
…
… …
Analytic
Reporting
Drivers & Stacks
Custome
r
Clusterin
g
Churn
Analysis
Predictiv
e
Analytics
…
Distributed Processing
Operational Applications & Reporting
Governance to
choose where to load
and process data
Optimal location
for providing
operational
response times
& slices
Can run
processing on
all data or slices
Data Lake

Example scenarios
1. Single Customer View
a. Operational
b. Analytics on customer segments
c. Analytics on all customers
2. Customer profiles & clustering
3. Presenting churn analytics on high value customers

Top 10 Bank
Case Study
Unified real-time monitoring platform for customer-facing
channels via Stratio’s Big Data Platform
Problem Why MongoDB Results
Problem Solution Results
Wanted a high quality of
service across online
channels
Many untapped data
sources & streams (logs,
clicks, social, etc.)
Want to be able to monitor
service response times &
provide root cause analysis
Used Flume for log data,
MongoDB for persisting and
KPIs, and Spark for analysis
Flexible data model
provided support for wide
variety of machine data
Linear scalability made it
easy to handle additional
load for each data source
Solution impacts
infrastructure for 31
countries and 51 million
customers
Can now adhere to strict
SLAs across infrastructure
Improved response times
are driving higher customer
satisfaction and increased
revenue

Data Lake Lessons Learned
1. Define the objectives
2. Design the future state
3. Consider the full data lifecycle towards operationalizing
4. Have a plan for managing metadata to avoid data swamp
5. Deliver business value incrementally towards future state
6. Decide data management layer based on how you will
use the data (esp. read requirements)
7. MongoDB fills common gaps with low latency & indexes

Benefits of MongoDB & Hadoop Combined Data Lake
• Low TCO from commodity hardware
• Greater agility and faster time-to-market from
schema-on-read
• Greater insight as differences in data are
present for drill down
• Can scale cheaply to meet any SLAs
• Data can be current for intraday decision
making
• Low latency response times
• Optimal use of resources with indexing
• Overall greater insight and business impact

For More Information
Resource Location
Tutorial for Operationalizing Spark with
MongoDB
www.mongodb.com/blog/post/tutorial-for-operationalizing-
spark-with-mongodb
Using MongoDB with Hadoop & Spark
www.mongodb.com/blog/post/using-mongodb-hadoop-spark-
part-1-introduction-setup
Scalability Benchmarks
www.mongodb.com/collateral/scalability-benchmarking-
mongodb-and-nosql-systems-report
Case Studies mongodb.com/customers
Presentations mongodb.com/presentations
Free Online Training education.mongodb.com
Webinars and Events mongodb.com/events
Documentation docs.mongodb.org
MongoDB Downloads mongodb.com/download

Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes

Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes

Ähnlich wie Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes (20)

Mehr von MongoDB

Mehr von MongoDB (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes

Hinweis der Redaktion