SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Enterprise Data Management in the Era of
MongoDB & Data Lakes
Matt Kalan
Sr. Solution Architect
matt.kalan@mongodb.com
@matthewkalan
Agenda
1. EDM Pipeline Overview
2. Current issues
3. Quick MongoDB Overview
4. Each Stage of EDM Pipeline
5. Future State EDM Architecture
6. Case Study & Scenarios
7. Data Lake Lessons Learned
Enterprise Data Management Pipeline
…
Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Transform
Store raw
data
AnalyzeAggregate
Pub-sub,ETL,fileimports
Stream Processing
Users
Other
Systems
Data Management Requirements in
Today’s World
Data
• Volume
• Velocity
• Variety
Time
• Iterative
• Agile
• Short Cycles
Risk
• Always On
• Scale
• Global
Cost
• Open-Source
• Cloud
• Commodity
Expressive
Query
Language
Strong
Consistency
Secondary
Indexes
Flexibility
Scalability
Performance
Challenges for Relational Data
Management Technologies
More specifically
• Unnecessary joins cause bad
performance
• Expensive to scale up or horizontally
• Rigid schema make consolidating difficult
• Poor fit with variably structured or
unstructured data
• Common models cause differences in
records to be removed when aggregating
• Process often takes many hours overnight
• Data is too stale for intraday decisions and
engagement
Ways to Improve
What data management
technology improves on
these requirements?
Transform
Store raw
data
AnalyzeAggregate
First, Quick MongoDB Background
Expressive
Query
Language
Strong
Consistency
Secondary
Indexes
Flexibility
Scalability
Performance
Relational NoSQL
Nexus Architecture
Expressive
Query
Language
Strong
Consistency
Secondary
Indexes
Flexibility
Scalability
Performance
Documents Enable Dynamic Schema & Optimal
Performance
Relational MongoDB
{ customer_id : 1,
first_name : "Mark",
last_name : "Smith",
city : "San Francisco",
phones: [
{
number : “1-212-777-1212”,
dnc : true,
type : “home”
},
number : “1-212-777-1213”,
type : “cell”
}]
}
Customer
ID
First Name Last Name City
0 John Doe New York
1 Mark Smith San Francisco
2 Jay Black Newark
3 Meagan White London
4 Edward Daniels Boston
Phone Number Type DNC
Customer
ID
1-212-555-1212 home T 0
1-212-555-1213 home T 0
1-212-555-1214 cell F 0
1-212-777-1212 home T 1
1-212-777-1213 cell (null) 1
1-212-888-1212 home F 2
Document Model Benefits
Agility and flexibility
Data model supports business change
Rapidly iterate to meet new requirements
Intuitive, natural data representation
Eliminates ORM layer
Developers are more productive
Reduces the need for joins, disk
seeks
Programming is more simple
Performance delivered at scale
{
customer_id : 1,
first_name : "Mark",
last_name : "Smith",
city : "San Francisco",
phones: [
{
number : “1-212-777-1212”,
dnc : true,
type : “home”
},
number : “1-212-777-1213”,
type : “cell”
}]
}
MongoDB Technical Capabilities
Application
Driver
Mongos
Primary
Secondary
Secondary
Shard 1
Primary
Secondary
Secondary
Shard 2
…
Primary
Secondary
Secondary
Shard N
db.customer.insert({…})
db.customer.find({
name: ”John Smith”})
1.Dynamic Document
Schema
{ name: “John Smith”,
date: “2013-08-01”,
address: “10 3rd St.”,
phone: {
home: 1234567890,
mobile: 1234568138 }
}
2. Native language drivers
5. High performance
- Data locality
- Indexes
- RAM
3. High availability
- Replica sets
6. Horizontal scalability
- Sharding
4. Workload Isolation
- Reading from
secondaries
Morphia
MEAN Stack
Java Python PerlRuby
Support for the most popular languages and
frameworks
Drivers & Ecosystem
15
Better Data Locality In-Memory Caching Flexible Indexes
Performance
vs.
Relational MongoDB
Scale
250M Ticks/Sec
300K+ Ops/Sec
500K+ Ops/SecFed Agency
Performance
1,400 Servers
1,000+ Servers
250+ Servers
Entertainment Co.
Cluster
Petabytes
10s of billions of objects
13B documents
Data
Asian Internet Co.
3.2 Features Relevant for EDM
1. WiredTiger as default storage engine
2. In-memory storage engine
3. Encryption at rest
4. Document Validation Rules
5. Compass (data viewer & query builder)
6. Connector for BI (Visualization)
7. $lookUp (left outer join)
Data Governance with Document Validation
Implement data governance
without sacrificing agility that
comes from dynamic schema
• Enforce data quality across
multiple teams and applications
• Use familiar MongoDB
expressions to control document
structure
• Validation is optional and can be
as simple as a single field, all
the way to every field, including
existence, data types, and
regular expressions
MongoDB Compass
For fast schema discovery
and visual construction of
ad-hoc queries
• Visualize schema
– Frequency of fields
– Frequency of types
– Determine validator
rules
• View Documents
• Graphically build queries
• Authenticated access
MongoDB Connector for BI
Visualize and explore multi-
dimensional documents using
SQL-based BI tools. The
connector does the following:
• Provides the BI tool with the schema
of the MongoDB collection to be
visualized
• Translates SQL statements issued
by the BI tool into equivalent
MongoDB queries that are sent to
MongoDB for processing
• Converts the results into the tabular
format expected by the BI tool,
which can then visualize the data
based on user requirements
Dynamic Lookup
Combine data from multiple
collections with left outer joins for
richer analytics & more flexibility in
data modeling
• Blend data from multiple sources for
analysis
• Higher performance analytics with less
application-side code and less effort
from your developers
• Executed via the new $lookup
operator, a stage in the MongoDB
Aggregation Framework pipeline
Aggregation Framework – PipelinedAnalysis
Start with the original collection; each record
(document) contains a number of shapes
(keys), each with a particular color (value)
• $match filters out documents that don’t
contain a red diamond
• $project adds a new “square” attribute
with a value computed from the value
(color) of the snowflake and triangle
attributes
• $lookup performs a left outer join with
another collection, with the star being the
comparison key
• Finally, the $group stage groups the data
by the color of the square and produces
statistics for each group
DB & Partner Ecosystem
4th Most Popular, Fastest Growing
RANK DBMS MODEL SCORE GROWTH (20 MO)
1. Oracle Relational DBMS 1,442 -5%
2. MySQL Relational DBMS 1,294 2%
3.
Microsoft SQL
Server
Relational DBMS 1,131 -10%
4. MongoDB Document Store 277 172%
5. PostgreSQL Relational DBMS 273 40%
6. DB2 Relational DBMS 201 11%
7. Microsoft Access Relational DBMS 146 -26%
8. Cassandra Wide Column 107 87%
9. SQLite Relational DBMS 105 19%
Source: DB-engines database popularity rankings; May 2015
Only non-relational in the top 5; 2.5x ahead of nearest NoSQL Competitor
Partner Ecosystem (500+)
* BI Connector (ODBC driver) and $lookUp (left outer join) are planned to be released with v3.2 in Q4
1. Operational Data Store (ODS)
2. Enterprise Data Service
3. Datamart/Cache
4. Master Data Distribution
5. Single Operational View
6. Operationalizing Hadoop
MongoDB Architecture Patterns
System of Record
System of Engagement
MongoDB Hadoop/Spark Connector
• Sub-second latency
• Expressive querying
• Flexible indexing
• Aggregations in database
• Great for any subset of
data
• Longer jobs
• Batch analytics
• Append only files
• Great for scanning all data or
large subsets in files
Distributed processing/analytics
- MongoDB Hadoop
Connector
- Spark-mongodb
Both provide:
• Schema-on-read
• Low TCO
• Horizontal scale
How to Choose Data Management
Products in the EDM Pipeline
Enterprise Data Management Pipeline
…
Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Transform
Store raw
data
AnalyzeAggregate
Pub-sub,ETL,fileimports
Stream Processing
Users
Other
Systems
How to choose the data management layer for each or all
stages?
Processing
Layer
?
When you want:
1. Secondary indexes
2. Sub-second latency
3. Aggregations in DB
4. Updates of data
For:
1. Scanning files
2. When indexes not
needed
Wide column store
(e.g. HBase)
For:
1. Primary key queries
2. If multiple indexes
& slices not needed
3. Optimized for
writing, not reading
Transform
…
Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
AnalyzeAggregate
Pub-sub,ETL,fileimports
Stream Processing
Users
Other
Systems
Data Store for Raw Dataset
Store raw
data
Users
Transform
- Typically just writing record-
by-record from source data
- Usually just need high write
volumes
- All 3 options handle that
Transform read requirements
- Benefits to reading multiple datasets
sorted [by index], e.g. to do a merge
- Might want to look up across tables with
indexes (and join functionality in MDB
v3.2)
- Want high read performance while writes
are happening
Interactive querying on
the raw data could use
indexes with MongoDB
Store raw
data
Transform
…
Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Analyze
Pub-sub,ETL,fileimports
Stream Processing
Users
Other
Systems
Data Store for Transformed Dataset
Users
AggregateTransform
Often benefits to
updating data as
merging multiple
datasets
Dashboards & reports
can have sub-second
latency with indexes
Aggregate read requirements
- Benefits to using indexes for grouping
- Aggregations natively in the DB would help
- With indexes, can do aggregations on slices of data
- Might want to look up across tables with indexes to
aggregate
Transform
Store raw
data
Transform
…
Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Pub-sub,ETL,fileimports
Stream Processing
Users
Other
Systems
Data Store for Aggregated Dataset
Users
AnalyzeAggregate
Dashboards &
reports can have
sub-second latency
with indexes
Analytics read requirements
- Often want to analyze a slice of
data (using indexes)
- For scanning all of data, could be
in any data store
- Querying on slices is best in
MongoDB
AggregateTransform
Store raw
data
Transform
…
Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Pub-sub,ETL,fileimports
Stream Processing
Users
Data Store for Last Dataset
Analyze
Users
Dashboards &
reports can have
sub-second latency
with indexes
- At the last step, there are many
consuming systems and users
- Need expressive querying with
secondary indexes
- MongoDB is best option for the
publication or distribution of
analytical results and
operationalization of data
Other
Systems
Often digital applications
- High scale
- Expressive querying
- JSON preferred
Often RESTful
services, APIs
Future State EDM Architecture
More Complete EDMArchitecture & Data Lake
…
Siloed source
databases
External feeds
(batch)
Streams
Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png
Data processing pipeline
Pub-sub,ETL,fileimports
Stream Processing
Downstream
Systems
… …
Single CSR
Application
Unified
Digital
Apps
Operationa
l Reporting
…
… …
Analytic
Reporting
Drivers & Stacks
Custome
r
Clusterin
g
Churn
Analysis
Predictiv
e
Analytics
…
Distributed Processing
Operational Applications & Reporting
Governance to
choose where to load
and process data
Optimal location
for providing
operational
response times
& slices
Can run
processing on
all data or slices
Data Lake
Example scenarios
1. Single Customer View
a. Operational
b. Analytics on customer segments
c. Analytics on all customers
2. Customer profiles & clustering
3. Presenting churn analytics on high value customers
Top 10 Bank
Case Study
Unified real-time monitoring platform for customer-facing
channels via Stratio’s Big Data Platform
Problem Why MongoDB Results
Problem Solution Results
Wanted a high quality of
service across online
channels
Many untapped data
sources & streams (logs,
clicks, social, etc.)
Want to be able to monitor
service response times &
provide root cause analysis
Used Flume for log data,
MongoDB for persisting and
KPIs, and Spark for analysis
Flexible data model
provided support for wide
variety of machine data
Linear scalability made it
easy to handle additional
load for each data source
Solution impacts
infrastructure for 31
countries and 51 million
customers
Can now adhere to strict
SLAs across infrastructure
Improved response times
are driving higher customer
satisfaction and increased
revenue
Data Lake Lessons Learned
1. Define the objectives
2. Design the future state
3. Consider the full data lifecycle towards operationalizing
4. Have a plan for managing metadata to avoid data swamp
5. Deliver business value incrementally towards future state
6. Decide data management layer based on how you will
use the data (esp. read requirements)
7. MongoDB fills common gaps with low latency & indexes
Benefits of MongoDB & Hadoop Combined Data Lake
• Low TCO from commodity hardware
• Greater agility and faster time-to-market from
schema-on-read
• Greater insight as differences in data are
present for drill down
• Can scale cheaply to meet any SLAs
• Data can be current for intraday decision
making
• Low latency response times
• Optimal use of resources with indexing
• Overall greater insight and business impact
For More Information
Resource Location
Tutorial for Operationalizing Spark with
MongoDB
www.mongodb.com/blog/post/tutorial-for-operationalizing-
spark-with-mongodb
Using MongoDB with Hadoop & Spark
www.mongodb.com/blog/post/using-mongodb-hadoop-spark-
part-1-introduction-setup
Scalability Benchmarks
www.mongodb.com/collateral/scalability-benchmarking-
mongodb-and-nosql-systems-report
Case Studies mongodb.com/customers
Presentations mongodb.com/presentations
Free Online Training education.mongodb.com
Webinars and Events mongodb.com/events
Documentation docs.mongodb.org
MongoDB Downloads mongodb.com/download
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes

Weitere ähnliche Inhalte

Was ist angesagt?

Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data WarehouseTop 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
MongoDB
 

Was ist angesagt? (20)

[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
 
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data WarehouseTop 5 Things to Know About Integrating MongoDB into Your Data Warehouse
Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse
 
How Insurance Companies Use MongoDB
How Insurance Companies Use MongoDB How Insurance Companies Use MongoDB
How Insurance Companies Use MongoDB
 
MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB and RDBMS: Using Polyglot Persistence at Equifax MongoDB and RDBMS: Using Polyglot Persistence at Equifax
MongoDB and RDBMS: Using Polyglot Persistence at Equifax
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 edition
 
MongoDB company and case studies - john hong
MongoDB company and case studies - john hong MongoDB company and case studies - john hong
MongoDB company and case studies - john hong
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
 
Tableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of ThoughtTableau & MongoDB: Visual Analytics at the Speed of Thought
Tableau & MongoDB: Visual Analytics at the Speed of Thought
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedIn
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Webinar: Live Data Visualisation with Tableau and MongoDB
Webinar: Live Data Visualisation with Tableau and MongoDBWebinar: Live Data Visualisation with Tableau and MongoDB
Webinar: Live Data Visualisation with Tableau and MongoDB
 
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauWebinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
 
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
 
MongoDB Evenings DC: MongoDB - The New Default Database for Giant Ideas
MongoDB Evenings DC: MongoDB - The New Default Database for Giant IdeasMongoDB Evenings DC: MongoDB - The New Default Database for Giant Ideas
MongoDB Evenings DC: MongoDB - The New Default Database for Giant Ideas
 
MongoDB in the Big Data Landscape
MongoDB in the Big Data LandscapeMongoDB in the Big Data Landscape
MongoDB in the Big Data Landscape
 
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionApache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
 
MongoDB Days Silicon Valley: Jumpstart: The Right and Wrong Use Cases for Mon...
MongoDB Days Silicon Valley: Jumpstart: The Right and Wrong Use Cases for Mon...MongoDB Days Silicon Valley: Jumpstart: The Right and Wrong Use Cases for Mon...
MongoDB Days Silicon Valley: Jumpstart: The Right and Wrong Use Cases for Mon...
 
MongoATL: How Sourceforge is Using MongoDB
MongoATL: How Sourceforge is Using MongoDBMongoATL: How Sourceforge is Using MongoDB
MongoATL: How Sourceforge is Using MongoDB
 
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardDelta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
 

Ähnlich wie Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes

L’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneL’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazione
MongoDB
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 

Ähnlich wie Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes (20)

L’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneL’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazione
 
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
MongoDB What's new in 3.2 version
MongoDB What's new in 3.2 versionMongoDB What's new in 3.2 version
MongoDB What's new in 3.2 version
 
MongoDB Atlas - eHarmony’s New Message Store
MongoDB Atlas - eHarmony’s New Message StoreMongoDB Atlas - eHarmony’s New Message Store
MongoDB Atlas - eHarmony’s New Message Store
 
MongoDB Atlas - eHarmony’s New Message Store
MongoDB Atlas - eHarmony’s New Message StoreMongoDB Atlas - eHarmony’s New Message Store
MongoDB Atlas - eHarmony’s New Message Store
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Denodo Partner Connect: Technical Webinar - Ask Me Anything
Denodo Partner Connect: Technical Webinar - Ask Me AnythingDenodo Partner Connect: Technical Webinar - Ask Me Anything
Denodo Partner Connect: Technical Webinar - Ask Me Anything
 
How Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and AnalyticsHow Kafka and Modern Databases Benefit Apps and Analytics
How Kafka and Modern Databases Benefit Apps and Analytics
 
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang PengBuilding Operational Data Lake using Spark and SequoiaDB with Yang Peng
Building Operational Data Lake using Spark and SequoiaDB with Yang Peng
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
Webinar: What's New in MongoDB 3.2
Webinar: What's New in MongoDB 3.2Webinar: What's New in MongoDB 3.2
Webinar: What's New in MongoDB 3.2
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
COBOL to Apache Spark
COBOL to Apache SparkCOBOL to Apache Spark
COBOL to Apache Spark
 
Confluent & MongoDB APAC Lunch & Learn
Confluent & MongoDB APAC Lunch & LearnConfluent & MongoDB APAC Lunch & Learn
Confluent & MongoDB APAC Lunch & Learn
 
DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
 

Mehr von MongoDB

Mehr von MongoDB (20)

MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
MongoDB SoCal 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
MongoDB SoCal 2020: Using MongoDB Services in Kubernetes: Any Platform, Devel...
 
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDBMongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
MongoDB SoCal 2020: A Complete Methodology of Data Modeling for MongoDB
 
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
MongoDB SoCal 2020: From Pharmacist to Analyst: Leveraging MongoDB for Real-T...
 
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series DataMongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
MongoDB SoCal 2020: Best Practices for Working with IoT and Time-series Data
 
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 MongoDB SoCal 2020: MongoDB Atlas Jump Start MongoDB SoCal 2020: MongoDB Atlas Jump Start
MongoDB SoCal 2020: MongoDB Atlas Jump Start
 
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
MongoDB .local San Francisco 2020: Powering the new age data demands [Infosys]
 
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
MongoDB .local San Francisco 2020: Using Client Side Encryption in MongoDB 4.2
 
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
MongoDB .local San Francisco 2020: Using MongoDB Services in Kubernetes: any ...
 
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
MongoDB .local San Francisco 2020: Go on a Data Safari with MongoDB Charts!
 
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your MindsetMongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
MongoDB .local San Francisco 2020: From SQL to NoSQL -- Changing Your Mindset
 
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas JumpstartMongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
MongoDB .local San Francisco 2020: MongoDB Atlas Jumpstart
 
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
MongoDB .local San Francisco 2020: Tips and Tricks++ for Querying and Indexin...
 
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
MongoDB .local San Francisco 2020: Aggregation Pipeline Power++
 
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
MongoDB .local San Francisco 2020: A Complete Methodology of Data Modeling fo...
 
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local San Francisco 2020: MongoDB Atlas Data Lake Technical Deep Dive
 
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & GolangMongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
MongoDB .local San Francisco 2020: Developing Alexa Skills with MongoDB & Golang
 
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
MongoDB .local Paris 2020: Realm : l'ingrédient secret pour de meilleures app...
 
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
MongoDB .local Paris 2020: Upply @MongoDB : Upply : Quand le Machine Learning...
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes

  • 1. Enterprise Data Management in the Era of MongoDB & Data Lakes Matt Kalan Sr. Solution Architect matt.kalan@mongodb.com @matthewkalan
  • 2. Agenda 1. EDM Pipeline Overview 2. Current issues 3. Quick MongoDB Overview 4. Each Stage of EDM Pipeline 5. Future State EDM Architecture 6. Case Study & Scenarios 7. Data Lake Lessons Learned
  • 3. Enterprise Data Management Pipeline … Siloed source databases External feeds (batch) Streams Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png Transform Store raw data AnalyzeAggregate Pub-sub,ETL,fileimports Stream Processing Users Other Systems
  • 4. Data Management Requirements in Today’s World Data • Volume • Velocity • Variety Time • Iterative • Agile • Short Cycles Risk • Always On • Scale • Global Cost • Open-Source • Cloud • Commodity
  • 6. More specifically • Unnecessary joins cause bad performance • Expensive to scale up or horizontally • Rigid schema make consolidating difficult • Poor fit with variably structured or unstructured data • Common models cause differences in records to be removed when aggregating • Process often takes many hours overnight • Data is too stale for intraday decisions and engagement
  • 7. Ways to Improve What data management technology improves on these requirements? Transform Store raw data AnalyzeAggregate
  • 8. First, Quick MongoDB Background
  • 11. Documents Enable Dynamic Schema & Optimal Performance Relational MongoDB { customer_id : 1, first_name : "Mark", last_name : "Smith", city : "San Francisco", phones: [ { number : “1-212-777-1212”, dnc : true, type : “home” }, number : “1-212-777-1213”, type : “cell” }] } Customer ID First Name Last Name City 0 John Doe New York 1 Mark Smith San Francisco 2 Jay Black Newark 3 Meagan White London 4 Edward Daniels Boston Phone Number Type DNC Customer ID 1-212-555-1212 home T 0 1-212-555-1213 home T 0 1-212-555-1214 cell F 0 1-212-777-1212 home T 1 1-212-777-1213 cell (null) 1 1-212-888-1212 home F 2
  • 12. Document Model Benefits Agility and flexibility Data model supports business change Rapidly iterate to meet new requirements Intuitive, natural data representation Eliminates ORM layer Developers are more productive Reduces the need for joins, disk seeks Programming is more simple Performance delivered at scale { customer_id : 1, first_name : "Mark", last_name : "Smith", city : "San Francisco", phones: [ { number : “1-212-777-1212”, dnc : true, type : “home” }, number : “1-212-777-1213”, type : “cell” }] }
  • 13. MongoDB Technical Capabilities Application Driver Mongos Primary Secondary Secondary Shard 1 Primary Secondary Secondary Shard 2 … Primary Secondary Secondary Shard N db.customer.insert({…}) db.customer.find({ name: ”John Smith”}) 1.Dynamic Document Schema { name: “John Smith”, date: “2013-08-01”, address: “10 3rd St.”, phone: { home: 1234567890, mobile: 1234568138 } } 2. Native language drivers 5. High performance - Data locality - Indexes - RAM 3. High availability - Replica sets 6. Horizontal scalability - Sharding 4. Workload Isolation - Reading from secondaries
  • 14. Morphia MEAN Stack Java Python PerlRuby Support for the most popular languages and frameworks Drivers & Ecosystem
  • 15. 15 Better Data Locality In-Memory Caching Flexible Indexes Performance vs. Relational MongoDB
  • 16. Scale 250M Ticks/Sec 300K+ Ops/Sec 500K+ Ops/SecFed Agency Performance 1,400 Servers 1,000+ Servers 250+ Servers Entertainment Co. Cluster Petabytes 10s of billions of objects 13B documents Data Asian Internet Co.
  • 17. 3.2 Features Relevant for EDM 1. WiredTiger as default storage engine 2. In-memory storage engine 3. Encryption at rest 4. Document Validation Rules 5. Compass (data viewer & query builder) 6. Connector for BI (Visualization) 7. $lookUp (left outer join)
  • 18. Data Governance with Document Validation Implement data governance without sacrificing agility that comes from dynamic schema • Enforce data quality across multiple teams and applications • Use familiar MongoDB expressions to control document structure • Validation is optional and can be as simple as a single field, all the way to every field, including existence, data types, and regular expressions
  • 19. MongoDB Compass For fast schema discovery and visual construction of ad-hoc queries • Visualize schema – Frequency of fields – Frequency of types – Determine validator rules • View Documents • Graphically build queries • Authenticated access
  • 20. MongoDB Connector for BI Visualize and explore multi- dimensional documents using SQL-based BI tools. The connector does the following: • Provides the BI tool with the schema of the MongoDB collection to be visualized • Translates SQL statements issued by the BI tool into equivalent MongoDB queries that are sent to MongoDB for processing • Converts the results into the tabular format expected by the BI tool, which can then visualize the data based on user requirements
  • 21. Dynamic Lookup Combine data from multiple collections with left outer joins for richer analytics & more flexibility in data modeling • Blend data from multiple sources for analysis • Higher performance analytics with less application-side code and less effort from your developers • Executed via the new $lookup operator, a stage in the MongoDB Aggregation Framework pipeline
  • 22. Aggregation Framework – PipelinedAnalysis Start with the original collection; each record (document) contains a number of shapes (keys), each with a particular color (value) • $match filters out documents that don’t contain a red diamond • $project adds a new “square” attribute with a value computed from the value (color) of the snowflake and triangle attributes • $lookup performs a left outer join with another collection, with the star being the comparison key • Finally, the $group stage groups the data by the color of the square and produces statistics for each group
  • 23. DB & Partner Ecosystem
  • 24. 4th Most Popular, Fastest Growing RANK DBMS MODEL SCORE GROWTH (20 MO) 1. Oracle Relational DBMS 1,442 -5% 2. MySQL Relational DBMS 1,294 2% 3. Microsoft SQL Server Relational DBMS 1,131 -10% 4. MongoDB Document Store 277 172% 5. PostgreSQL Relational DBMS 273 40% 6. DB2 Relational DBMS 201 11% 7. Microsoft Access Relational DBMS 146 -26% 8. Cassandra Wide Column 107 87% 9. SQLite Relational DBMS 105 19% Source: DB-engines database popularity rankings; May 2015 Only non-relational in the top 5; 2.5x ahead of nearest NoSQL Competitor
  • 25. Partner Ecosystem (500+) * BI Connector (ODBC driver) and $lookUp (left outer join) are planned to be released with v3.2 in Q4
  • 26. 1. Operational Data Store (ODS) 2. Enterprise Data Service 3. Datamart/Cache 4. Master Data Distribution 5. Single Operational View 6. Operationalizing Hadoop MongoDB Architecture Patterns System of Record System of Engagement
  • 27. MongoDB Hadoop/Spark Connector • Sub-second latency • Expressive querying • Flexible indexing • Aggregations in database • Great for any subset of data • Longer jobs • Batch analytics • Append only files • Great for scanning all data or large subsets in files Distributed processing/analytics - MongoDB Hadoop Connector - Spark-mongodb Both provide: • Schema-on-read • Low TCO • Horizontal scale
  • 28. How to Choose Data Management Products in the EDM Pipeline
  • 29. Enterprise Data Management Pipeline … Siloed source databases External feeds (batch) Streams Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png Transform Store raw data AnalyzeAggregate Pub-sub,ETL,fileimports Stream Processing Users Other Systems
  • 30. How to choose the data management layer for each or all stages? Processing Layer ? When you want: 1. Secondary indexes 2. Sub-second latency 3. Aggregations in DB 4. Updates of data For: 1. Scanning files 2. When indexes not needed Wide column store (e.g. HBase) For: 1. Primary key queries 2. If multiple indexes & slices not needed 3. Optimized for writing, not reading
  • 31. Transform … Siloed source databases External feeds (batch) Streams Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png AnalyzeAggregate Pub-sub,ETL,fileimports Stream Processing Users Other Systems Data Store for Raw Dataset Store raw data Users Transform - Typically just writing record- by-record from source data - Usually just need high write volumes - All 3 options handle that Transform read requirements - Benefits to reading multiple datasets sorted [by index], e.g. to do a merge - Might want to look up across tables with indexes (and join functionality in MDB v3.2) - Want high read performance while writes are happening Interactive querying on the raw data could use indexes with MongoDB
  • 32. Store raw data Transform … Siloed source databases External feeds (batch) Streams Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png Analyze Pub-sub,ETL,fileimports Stream Processing Users Other Systems Data Store for Transformed Dataset Users AggregateTransform Often benefits to updating data as merging multiple datasets Dashboards & reports can have sub-second latency with indexes Aggregate read requirements - Benefits to using indexes for grouping - Aggregations natively in the DB would help - With indexes, can do aggregations on slices of data - Might want to look up across tables with indexes to aggregate
  • 33. Transform Store raw data Transform … Siloed source databases External feeds (batch) Streams Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png Pub-sub,ETL,fileimports Stream Processing Users Other Systems Data Store for Aggregated Dataset Users AnalyzeAggregate Dashboards & reports can have sub-second latency with indexes Analytics read requirements - Often want to analyze a slice of data (using indexes) - For scanning all of data, could be in any data store - Querying on slices is best in MongoDB
  • 34. AggregateTransform Store raw data Transform … Siloed source databases External feeds (batch) Streams Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png Pub-sub,ETL,fileimports Stream Processing Users Data Store for Last Dataset Analyze Users Dashboards & reports can have sub-second latency with indexes - At the last step, there are many consuming systems and users - Need expressive querying with secondary indexes - MongoDB is best option for the publication or distribution of analytical results and operationalization of data Other Systems Often digital applications - High scale - Expressive querying - JSON preferred Often RESTful services, APIs
  • 35. Future State EDM Architecture
  • 36. More Complete EDMArchitecture & Data Lake … Siloed source databases External feeds (batch) Streams Stream icon from: https://en.wikipedia.org/wiki/File:Activity_Streams_icon.png Data processing pipeline Pub-sub,ETL,fileimports Stream Processing Downstream Systems … … Single CSR Application Unified Digital Apps Operationa l Reporting … … … Analytic Reporting Drivers & Stacks Custome r Clusterin g Churn Analysis Predictiv e Analytics … Distributed Processing Operational Applications & Reporting Governance to choose where to load and process data Optimal location for providing operational response times & slices Can run processing on all data or slices Data Lake
  • 37. Example scenarios 1. Single Customer View a. Operational b. Analytics on customer segments c. Analytics on all customers 2. Customer profiles & clustering 3. Presenting churn analytics on high value customers
  • 38. Top 10 Bank Case Study Unified real-time monitoring platform for customer-facing channels via Stratio’s Big Data Platform Problem Why MongoDB Results Problem Solution Results Wanted a high quality of service across online channels Many untapped data sources & streams (logs, clicks, social, etc.) Want to be able to monitor service response times & provide root cause analysis Used Flume for log data, MongoDB for persisting and KPIs, and Spark for analysis Flexible data model provided support for wide variety of machine data Linear scalability made it easy to handle additional load for each data source Solution impacts infrastructure for 31 countries and 51 million customers Can now adhere to strict SLAs across infrastructure Improved response times are driving higher customer satisfaction and increased revenue
  • 39. Data Lake Lessons Learned 1. Define the objectives 2. Design the future state 3. Consider the full data lifecycle towards operationalizing 4. Have a plan for managing metadata to avoid data swamp 5. Deliver business value incrementally towards future state 6. Decide data management layer based on how you will use the data (esp. read requirements) 7. MongoDB fills common gaps with low latency & indexes
  • 40. Benefits of MongoDB & Hadoop Combined Data Lake • Low TCO from commodity hardware • Greater agility and faster time-to-market from schema-on-read • Greater insight as differences in data are present for drill down • Can scale cheaply to meet any SLAs • Data can be current for intraday decision making • Low latency response times • Optimal use of resources with indexing • Overall greater insight and business impact
  • 41. For More Information Resource Location Tutorial for Operationalizing Spark with MongoDB www.mongodb.com/blog/post/tutorial-for-operationalizing- spark-with-mongodb Using MongoDB with Hadoop & Spark www.mongodb.com/blog/post/using-mongodb-hadoop-spark- part-1-introduction-setup Scalability Benchmarks www.mongodb.com/collateral/scalability-benchmarking- mongodb-and-nosql-systems-report Case Studies mongodb.com/customers Presentations mongodb.com/presentations Free Online Training education.mongodb.com Webinars and Events mongodb.com/events Documentation docs.mongodb.org MongoDB Downloads mongodb.com/download

Hinweis der Redaktion

  1. Stream processing is often separate processing layer than the batch processing, but it can be stored into the data stores at various stages
  2. Now that we understand some of the challenges you’re facing and where you’d like to get, perhaps I can tell you a bit about why MongoDB exists and where we might be able to help. Our founders observed some technological and business changes in the market. We built MongoDB to address the way the world is changing… Data [tie back to what you’ve heard from customer if possible] 90% data created in last 2 years 80% enterprise data is unstructured Unstructured data growing 2X rate of structured data Time [tie back to what you’ve heard from customer if possible] Development methods shifted from waterfall (12-24 months) to iterative Leading edge companies like Facebook + Etsy shipping code multiple times a day Risk [tie back to what you’ve heard from customer if possible] User bases shifted from internal (thousands) to external (millions) Can’t go down All across the globe Cost [tie back to what you’ve heard from customer if possible] Shift to open-source and SaaS business models to pay for value over time Ability to leverage cloud and commodity architectures to lower infrastructure costs
  3. Looking at the other technologies in the market… Relational databases laid the foundation for what you’d want out of your database Rich and fast access to the data, using an expressive query language and secondary indexes Strong consistency, so you know you’re always getting the most up to date version of the data But they weren’t built for the world we just talked about Built for waterfall dev cycles, structured data Built for internal users, not large numbers of users all across the global (From vendors who want large license fees upfront) --> So what they have in data access and consistency, they lack in flexibility, scalability and performance
  4. Could make more visual
  5. MongoDB was built to address the way the world has changed while preserving the core database capabilities required to build functional apps MongoDB is the only database that harnesses the innovations of NoSQL and maintains the foundation of relational databases
  6. MongoDB was built to address the way the world has changed while preserving the core database capabilities required to build functional apps MongoDB is the only database that harnesses the innovations of NoSQL and maintains the foundation of relational databases
  7. More info: http://www.mongodb.com/mongodb-scale
  8. Kernel 3.2 Scope tracking: https://docs.google.com/spreadsheets/d/1L1EbbWoshUIHXBzCh5e3sALtAFxm_dJ52SRPR6GzeAY/edit#gid=0 Release notes for 3.1.6: http://docs.mongodb.org/manual/release-notes/3.1-dev-series/
  9. Determine validator rules: You can use the tool to figure out what you want to set as validation rules
  10. $lookup – this creates new documents which contain everything from the previous stage but augmented with data from any document from the second collection containing a matching colored star (i.e., the blue and yellow stars had matching lookup values, whereas the red star had none)
  11. In terms of reporting, A number of Business Intelligence (BI) vendors have developed connectors to integrate MongoDB as a data source with their suites, alongside traditional relational dbs. This integration provides reporting, visualizations, dash-boarding of MongoDB data
  12. Stream processing is often separate processing layer than the batch processing, but it can be stored into the data stores at various stages
  13. Just a logical diagram. Processing could be on same physical servers as storage nodes to minimize data movement
  14. Could make more visual