Training:
MongoDB for Coder
Uwe Seiler
uweseiler
About me

Big Data Nerd

Hadoop Trainer MongoDB Author

Photography Enthusiast

Travelpirate
About us
is a bunch of…

Big Data Nerds

Agile Ninjas

Continuous Delivery Gurus

Join us!
Enterprise Java Specialists Per...
Agenda I
1. Introduction to NoSQL & MongoDB
2. Data manipulation: Learn how to CRUD

with MongoDB
3. Indexing: Speed up yo...
Agenda
5. Aggregation Framework: Data

aggregation done the MongoDB way
6. Replication: High Availability with

MongoDB
7....
Ingredients

•

Slides

•

Live Coding

•

Discussion

•

Labs on your own computer
And please…

If you have
questions, please
share them with us!
And now start your downloads…

Lab files:
http://bit.ly/1aT8RXY
Buzzword Bingo
NoSQL
Classification of NoSQL
Key-Value Stores
K

V

K

V

K

V

K

1

V

K

Column Stores

V

Graph Databases

1

1
1
1

1
1
1
...
Big Data
My favorite definition
The classic definition
•

The 3 V’s of Big Data

Volume Velocity •Variety
«Big Data» != Hadoop
Horizontal
Scaling
Vertical Scaling

RAM
CPU
Storage
Vertical Scaling

RAM
CPU
Storage
Vertical Scaling

RAM
CPU
Storage
Horizontal Scaling

RAM
CPU
Storage
Horizontal Scaling

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage
Horizontal Scaling
RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

RAM
CPU
Storage

R...
The problem
with
distributed
data
The CAP Theorem
Availability
a guarantee
that every
request
receives a
response

Consistency
all nodes see
the same data
a...
Overview of NoSQL systems

Availability
a guarantee
that every
request
receives a
response

C

Partition
onsistency

Toler...
The problem
with
consistency
ACID

vs.
BASE
ACID vs. BASE

1983

Atomicity RDBMS
Consistency
Isolation
Durability
ACID vs. BASE

ACID is a good
concept but it is not
a written law!
ACID vs. BASE

Basically Available
Soft State
2008

NoSQL

Eventually consistent
ACID vs. BASE
ACID

BASE

-

-

Strong consistency
Isolation & Transactions
Two-Phase-Commit
Complex Development
More reli...
Overview of MongoDB
MongoDB is a…
•

document

•

open source

•

highly performant

•

flexible

•

scalable

•

highly available

•

feature...
Document Database
•

Not PDF, Word, etc. … JSON!
Open Source Database
•

MongoDB is a open source project

•

Available on GitHub
– https://github.com/mongodb/mongo

•

Us...
Performance

Data
locality

In-Memory
Caching

In-Place
Updates
Flexible Schema
RDBMS

MongoDB
{
_id :
ObjectId("4c4ba5e5e8aabf3"),
employee_name: "Dunham, Justin",
department : "Marketi...
Scalability
Auto Sharding

• Increase capacity as you go
• Commodity and cloud architectures
• Improved operational simpli...
High Availability

• Automated replication and failover
• Multi-data center support
• Improved operational simplicity (e.g...
MongoDB Architecture
Rich Query Language
Aggregation Framework
Map/Reduce
MongoDB

Data

Map()
emit(k,v)

Group(k)

Shard 1
Sort(k)
Shard 2

…
Shard
n

Reduce(k, values)

Finalize(k, v)
Geo Information
Driver & Shell
Drivers are available
for almost all popular
programming
languages and
frameworks

Java

JavaScript

Python...
NoSQL Trends
Google Search

LinkedIn Job Skills
MongoDB
Competitor 1
Competitor 2
Competitor 3
Competitor 4
Competitor 5

...
Data manipulation
Terminology
RDBMS
Table / View
Row
Index
Join
Foreign Key
Partition

MongoDB
➜
➜
➜
➜
➜
➜

Collection
Document
Index
Embedd...
Example: Simple blog model
MongoDB Collections
•

User

•

Article

•

Tag

•

Category
Schema design for the blog
Let’s have a look…
Create a database
// Show all databases
> show dbs
digg 0.078125GB
enron 1.49951171875GB
// Switch to a database
> use blo...
Create a collection I
// Show all collections
> show collections
// Insert a user
> db.user.insert(
{ name : “Sheldon“,
ma...
Create a collection II
// Show all collections
> show collections
system.indexes
user
// Show all databases
> show dbs
blo...
Read from a collection
// Show the first document
> db.user.findOne()
{
"_id" : ObjectId("516684a32f391f3c2fcb80ed"),
"nam...
Find documents
// Find a specific document
> db.user.find( { name : ”Penny” } )
{
"_id" : ObjectId("5166a9dc2f391f3c2fcb80...
_id
•

_id is the primary key in MongoDB

•

_id is created automatically

•

If not specified differently, it‘s type is

...
ObjectId
•

A ObjectId is a special 12 Byte value

•

It‘s uniqueness in the whole cluster is
guaranteed as following:
Obj...
Cursor
// Use a cursor with find()
> var myCursor = db.user.find( )
// Get the next document
> var myDocument =
myCursor.h...
Logical operators
// Find documents using OR
> db.user.find(
{$or : [ { name : “Sheldon“ },
{ mail : amy@bigbang.com }
]
}...
Manipulating results
// Sort documents
> db.user.find().sort( { name : 1 } ) // Aufsteigend
> db.user.find().sort( { name ...
Updating documents I
// Updating only the mail address (How not to do…)
> db.user.update( { name : “Sheldon“ },
{ mail : “...
Deleting documents
// Deleting a document
> db.user.remove(
{ mail : “sheldon@howimetyourmother.com“ }
)
// Deleting all d...
Updating documents II
// Updating only the mail address (This time for real)
> db.user.update( { name : “Sheldon“ },
{ $se...
Adding to arrays
// Adding a array
> db.user.update( {name : “Sheldon“ },
{ $set : {enemies :
[ { name : “Wil Wheaton“ },
...
Deleting from arrays
// Deleting a value from an array
> db.user.update( { name : “Sheldon“ },
{$pull : {enemies :
{name :...
Adding a subdocument
// Adding a subdocument to an existing document
> db.user.update( { name : “Sheldon“}, {
$set : { mot...
Querying subdocuments
// Finding out the name of the mother
> db.user.find( { name : “Sheldon“},
{“mother.name“ : 1 } )
{
...
Overview of all update operators
For fields:
$inc
$rename
$set
$unset
Bitwise operation:
$bit
Isolation:
$isolated

For ar...
Dokumentation
Create
http://docs.mongodb.org/manual/core/create/

Read
http://docs.mongodb.org/manual/core/read/

Update
h...
Lab time!

Lab Nr. 02
Time box:
20 min
Indexing
What is an index?
1

2

3

4

5

Chained lists

6

7
1

2

3

4

5

6

Find Nr. 7 in the chained list!

7
4
2

1

6

3

5

Find Nr. 7 in a tree!

7
Indices in MongoDB are B-Trees
Find, Insert and Delete Operations:

O(log(n))
Missing or non-optimal
indices are the singlemost avoidable
performance issue
How do I create an index?
// Create a non-existing index for a field
> db.recipes.createIndex({ main_ingredient: 1 })

// ...
What can be indexed?
// Multiple fields (Compound Key Indexes)
> db.recipes.ensureIndex({
main_ingredient: 1,
calories: -1...
What can be indexed?
// Subdocuments
{
name : 'Apple Pie',
contributor: {
name: 'Joe American',
id: 'joea123'
}
}
db.recip...
How to maintain indices?
// List all indices of a collection
> db.recipes.getIndexes()
> db.recipes.getIndexKeys()

// Dro...
More options
•

Unique Index
– Allows only unique values in the indexed field(s)

•

Sparse Index
– For fields that are no...
Unique Index
// Make sure the name of a recipe is unique
> db.recipes.ensureIndex( { name: 1 }, { unique: true } )

// For...
Sparse Index
// Only documents with the field calories will be indexed
> db.recipes.ensureIndex(
{ calories: -1 },
{ spars...
Geospatial Index
// Add longitude and altitude
{
name: ‚codecentric Frankfurt’,
loc: [ 50.11678, 8.67206]
}
// Index the 2...
TTL Collections
// Documents need a field of type BSON UTC
{ ' submitted_date ' : ISODate('2012-10-12T05:24:07.211Z'), … }...
Limitations of indices
•

Collections can‘t have more than 64 indices

•

Index keys are not allowed to be larger than 102...
Optimizing indices
Best practice
1. Identify slow queries
2. Find out more about the slow queries

using explain()
3. Create appropriate indi...
1. Identify slow queries
> db.setProfilingLevel( n , slowms=100ms )

n=0: Profiler off
n=1: Log all operations slower than...
2. Usage of explain()
> db.recipes.find( { calories:
{ $lt : 40 } }
).explain( )
{
"cursor" : "BasicCursor" ,
"n" : 42,
"n...
2. Metrics of the execution plan I
• Cursor
– The type of the cursor: BasicCursor means no idex

has been used

• n
– The ...
2. Metrics of the execution plan II
• millis
– Execution time of the query

• Complete reference can be found here
– http:...
3. Create appropriate indices
on the fields being queried
4. Optimize queries taking the
available indices into account
// Using the following index…
> db.collection.ensureIndex({ ...
4. Optimize queries taking the
available indices into account
// Using the following index…
> db.collection.ensureIndex({ ...
4. Optimize queries taking the
available indices into account
// Using the following index…
> db.recipes.ensureIndex({ mai...
Use specific indices
// Tell MongoDB explicitly which index to use
> db.recipes.find({
calories: { $lt: 1000 } }
).hint({ ...
Caveats using indices
Using multiple indices
// MongoDB can only use one index per query!
> db.collection.ensureIndex({ a: 1 })
> db.collection....
Compound indices
// Compound indices are often very efficient!
> db.collection.ensureIndex({ a: 1, b: 1, c: 1 })

// But o...
Indices with low selectivity
// The following field has only few distinct values
> db.collection.distinct('status’)
[ 'new...
Regular expressions & Indices
> db.users.ensureIndex({ username: 1 })

// Left-bound regular expressions can make usage of...
Negations & Indices
// Negations can not make use of indices
> db.things.ensureIndex({ x: 1 })
// e.g. queries using not e...
Lab time!

Lab Nr. 03
Time box:
20 min
Map/Reduce
What is Map/Reduce?
•

Programming model coming from
functional languages

•

Framework for
– parallel processing
– of big...
Basics
•

Not something special about MongoDB
–
–
–
–

Hadoop
Disco
Amazon Elastic MapReduce
…

•

Based on key-value-pair...
The „Hello world“ of
Map/Reduce: Word Count
Word Count: Problem
INPUT
{
MongoDB
uses
MapReduce
}

{
There is a
map phase
}

{
There is a
reduce
phase
}

MAPPER

GROUP...
Word Count: Mapping
INPUT
{
MongoDB
uses
MapReduce
}

{
There is a
map phase
}

{
There is a
reduce
phase
}

MAPPER

GROUP...
Word Count: Group/Sort
INPUT
{
MongoDB
uses
MapReduce
}

MAPPER

GROUP/SORT

REDUCER

a-l
(doc1,
“…“)

m-q
{
There is a
ma...
Word Count: Reduce
INPUT
{
MongoDB
uses
MapReduce
}

MAPPER

GROUP/SORT

REDUCER

(doc1,
“…“)

(a, [1, 1])
(is, [1, 1])
(m...
Word Count: Result
INPUT
{
MongoDB
uses
MapReduce
}

MAPPER

GROUP/SORT

REDUCER

OUTPUT

(doc1,
“…“)

(a, [1, 1])
(is, [1...
Word Count: In a nutshell
INPUT
{
MongoDB
uses
MapReduce
}

MAPPER

GROUP/SORT

(doc1,
“…“)

REDUCER

(a, [1, 1])
(is, [1,...
Map/Reduce: Overview
MongoDB

Data

group(k)

map()
emit(k,v)

Shard 1

Iterates all
documents

sort(k)

Shard 2

…
Shard ...
Word Count: Tweets
// Example: Twitter database with tweets
> db.tweets.findOne()
{
"_id" : ObjectId("4fb9fb91d066d657de8d...
Word Count: map()
// Map function with simple data cleansing
map = function() {
this.text.split(' ').forEach(function(word...
Word Count: reduc()
// Reduce function
reduce = function(key, values) {
return values.length;
};
Word Count: Call
// Show the results using the console
> db.tweets.mapReduce(map, reduce, { out : { inline : 1 } } );
// S...
Word Count: Result
// Top-10 of most common words in tweets
> db.tweets_word_count.find().sort({"value" : -1}).limit(10)
{...
Recommendation
Typical use cases
•

Counting, Aggregating & Suming up
– Analyzing log entries & Generating log reports
– Generating an in...
Summary
•

The Map/Reduce framework is very
versatile & powerful

•

Is implemented in JavaScript
– Necessity to write own...
Map/Reduce should be used as
ultima ratio!
Lab time!

Lab Nr. 04
Time box:
20 min
Aggregation Framework
Why?
SELECT customer_id, SUM(price)
FROM orders
WHERE active=true
GROUP BY customer_id
That‘s why!
SELECT customer_id, SUM(price)
FROM orders
Calculation
WHERE active=true
of fields
GROUP BY customer_id
Groupi...
The Aggregation Framework
Has been introduced to allow 90% of realworld aggregation use cases without using
the „big hamme...
The Aggregation Pipeline

Pipeline
Operator
Pipeline
Operator
Pipeline
Operator

{
document
}

Result
{
sum: 337
avg: 24,5...
The Aggregation Pipeline
•

Processes a stream of documents
– Input is a complete collection
– Output is a document contai...
Call
db.tweets.aggregate(
{ $pipeline_operator_1
{ $pipeline_operator_2
{ $pipeline_operator_3
{ $pipeline_operator_4
...
...
Pipeline Operators
// Old friends*
$match
$sort
$limit
$skip
* from the query functionality

// New friends
$project
$grou...
Example: Tweets
// Example: Twitter database with tweets
> db.tweets.findOne()
{
"_id" : ObjectId("4fb9fb91d066d657de8d6f3...
$match
// Show all german users
> db.tweets.aggregate(
{ $match : {"user.lang" : "de"}},
);
// Show all users with 0 to 10...
$sort
// Sorting using one field
> db.tweets.aggregate(
{ $sort : {"user.friends_count" : -1} },
);
// Sorting using multi...
$limit
// Limit the number of resulting documents to 3
> db.tweets.aggregate(
{ $sort : {"user.friends_count" : -1} },
{ $...
$skip
// Get the No.4-Twitterer according to number of friends
> db.tweets.aggregate(
{ $sort : {"user.friends_count" : -1...
$project I
// Limit the result document to only one field
> db.tweets.aggregate(
{ $project : {text : 1} },
);
// Remove _...
$project II
// Rename a field
> db.tweets.aggregate(
{ $project : {_id: 0, content_of_tweet : "$text"} },
);
// Add a calc...
$project III
// Add a subdocument
> db.tweets.aggregate(
{ $project : {_id: 0,
content_of_tweet : "$text",
user : {
name :...
$group I
// Grouping using a single field
> db.tweets.aggregate(
{ $group : {
_id : "$user.lang",
anzahl_tweets : {$sum : ...
$group II
// Grouping using multiple fields
> db.tweets.aggregate(
{ $group : {
_id : { background_image:
"$user.profile_u...
$group III
// Grouping with multiple calculated fields
> db.tweets.aggregate(
{ $group : {
_id : "$user.lang",
number_of_t...
Group Aggregation Functions

$min
$max
$avg
$sum

$addToSet
$first
$last
$push
$unwind I
// Unwind an array
> db.tweets.aggregate(
{ $project : {_id: 0, content_of_tweet : "$text",
mentioned_users : "$...
$unwind II
// Resulting document without $unwind
{
„content_of_tweet" : "RT @Philanthropy: How should
nonprofit groups mea...
$unwind III
// Resulting documents with $unwind
{
" content_of_tweet " : "RT @Philanthropy: How should
nonprofit groups me...
Best Practices
Place $match at the beginning of
the pipeline to reduce the
number of documents as soon as
possible!

Best Practice #1
Use $project to remove not
needed fields in the documents
as soon as possible!

Best Practice #2
When being placed at the beginning of the pipeline these
operators can make use of indices:

$match
$sort
$limit
$skip
The...
Mapping of MongoDB
to SQL
Mapping
SQL

MongoDB Aggregation

WHERE

$match

GROUP BY

$group

HAVING

$match

SELECT

$project

ORDER BY

$sort

LIMI...
Example: Online shopping
{
cust_id: “sheldon1",
ord_date:
ISODate("2013-04-018T19:38:11.102Z"),
status: ‘purchased',
price...
Count all orders
SQL

MongoDB Aggregation

SELECT COUNT(*) AS
count FROM orders

db.orders.aggregate( [ {
$group: { _id: n...
Average order price per customer
SQL

MongoDB Aggregation

SELECT cust_id, SUM(price)
AS total FROM orders
GROUP BY cust_i...
Sum up all orders over 250$
SQL

MongoDB Aggregation

SELECT cust_id, SUM(price) as db.orders.aggregate( [ {
$match: { sta...
More examples
http://docs.mongodb.org/manual
/reference/sql-aggregationcomparison/
Lab time!

Lab Nr. 05
Time box:
20 min
Replication: High
Availability with MongoDB
Why do we need replication?
•

Hardware is unreliable and is doomed to
fail!

•

Do you want to be the person being called...
Life cycle of a replica set
Replica set – Create
Replica set – Initializing
Replica set – Node down
Replica set – Failover
Replica set – Recovery
Replica set – Back to normal
Roles & Configuration
Replica sets - Roles
Configuration I
> conf = {
_id : "mySet",
members : [
{_id : 0, host : "A”, priority : 3},
{_id : 1, host : "B", priority ...
Configuration II
> conf = {
_id : "mySet”,
members : [

Primary data center

{_id : 0, host : "A”, priority : 3},
{_id : 1...
Configuration III
> conf = {
_id : "mySet”,
members : [

Secondary data center
(Default priority = 1)

{_id : 0, host : "A...
Configuration IV
> conf = {
_id : "mySet”,
members : [
{_id : 0, host : "A”, priority : 3},
{_id : 1, host : "B", priority...
Configuration V
> conf = {
_id : "mySet”,
members : [
{_id : 0, host : "A”, priority : 3},
{_id : 1, host : "B", priority ...
Data consistency
Strong consistency
Eventual consistency
Write Concern
• Different levels of data consistency
• Acknowledged by
– Network
– MongoDB
– Journal
– Secondaries
– Taggi...
Acknowledged by network
„Fire and forget“
Acknowledged by MongoDB
Wait for Error
Acknowledged by Journal
Wait for Journal Sync
Acknowledged by Secondaries
Wait for Replication
Tagging while writing data
•

Available since 2.0

•

Allows for fine granular control

•

Each node can have multiple tag...
Tagging - Example
{
_id : "mySet",
members : [
{_id : 0, host : "A", tags : {"dc": "ny"}},
{_id : 1, host : "B", tags : {"...
Acknowledged by Tagging
Wait for Replication (Tagging)
Configure the Write Concern
// Wait for network acknowledgement
> db.runCommand( { getLastError: 1, w: 0 } )
// Wait for e...
Read Concerns
•

Only primary

(primary)

•

Primary preferred

(primaryPreferred)

•

Only secondaries

(secondary)

•

S...
Read

Only primary
(primary)
Read
Read

Primary preferred
(primaryPreferred)
Read

Read

Only secondaries
(secondary)
Read

Read
Read

Secondaries preferred
(secondaryPreferred)
Read

Read
Read

Nearest node
(nearest)
Tagging while reading data
•

Allows for a more fine granular control
where data will be read from
– e.g. { "disk": "ssd",...
Configure the Read Concern
// Only primary
> cursor.setReadPref( “primary" )
// Primary preferred
> cursor.setReadPref( “p...
MongoDB Operation
Maintenance & Upgrades
•

Zero downtime

•

Rolling upgrades and maintenance
–
–
–
–

•

Start with all secondaries
Step d...
Replica set – 1 data center
•

One
– Data center
– Switch
– Power Supply

•

Possible errors:
– Failure of 2 nodes
– Power...
Replica set – 2 data center
•

Additional node for
data recovery

•

No writing to both
data center since
only one node in...
Replica set – 3 data center
•

Can recover from a
complete data center
failure

•

Allows for usage of
w= { dc : 2 } to
gu...
Commands
•

Administration of the nodes
–
–
–
–
–

•

rs.conf()
rs.initiate(<conf>) & rs.reconfig(<conf>)
rs.add(host:<por...
Best Practices
Best Practices
•

Uneven number of nodes

•

Adapt the write concern to your use case

•

Read from primary except for
– G...
Lab time!

Lab Nr. 06
Time box:
20 min
Sharding: Scaling with
MongoDB
Visual representation of vertical scaling

1970 - 2000: Vertical Scaling
„Scale up“
Visual representation of horizontal scaling

Since 2000: Horizontal Scaling
„Scale out“
When to use Sharding?
Not enough disk space
The working set doesn‘t fit
into the memory
The needs for read-/write throughput
are higher than the I/O capabilities
Sharding MongoDB
Partitioning of data
•

The user needs to define a shard key

•

The shard key defines the distribution of
data across the...
Partitioning of data into chunks
•

Initially all data is in one chunk

•

Maximum chunk size: 64 MB

•

MongoDB divides a...
One chunk contains data of a
certain value range
Chunks & Shards
•

A shard is one node in the cluster

•

A shard can be one single mongod or a
replica set
Metadata Management
•

Config Server
– Stores the value ranges of the chunks and their

location
– Number of config server...
Balancing & Routing Service
•

mongos balances the data
in the cluster

•

mongos distributes data to
new nodes

•

mongos...
Automatic Balancing

Balancing will be automatically done once
the number of chunks between shards hits a
certain threshol...
Splitting of a chunk

•

Once a chunk hits the maximum size it will be split

•

Splitting is only a logical operation, no...
Sharding Infrastructure
MongoDB Auto Sharding
•

Minimal effort
– Usage of the same interfaces for mongod and

mongos

•

Easy configuration
– Ena...
Configuration example
Example of a very simple cluster

•

Never use this in production!
– Only one config server (No fault tolerance)
– Shard i...
Start the config server

// Start the config server (Default port 27019)
> mongod --configsvr
Start the mongos routing service

// Start the mongos router (Default port 27017)
> mongos --configdb <hostname>:27019

//...
Start the shard

// Start a shard with one mongod (Default port 27018)
> mongod --shardsvr

// Shard is not yet added to t...
Add the shard

// Connect to mongos and add the shard
> mongo
> sh.addShard(‘<host>:27018’)
// When adding a replica set, ...
Check configuration

// Check if the shard has been added
> db.runCommand({ listShards:1 })
{ "shards" :
[ { "_id”: "shard...
Configure sharding
// Enable the sharding for a database
> sh.enableSharding(“<dbname>”)

// Shard a collection using a sh...
Shard Key
Shard Key
•

The shard key can not be changed

•

The values of a shard key can not be
changed

•

The shard key needs to ...
Considerations for the shard key
•

Cardinality of data
– The value range needs to be rather large. For example sharding

...
Choices for the shard key
•

Single field
– If the value range is big enough and data is distributed almost

equally

•

C...
Example: User
{
_id: 346,
username: “sheldinator”,
password: “238b8be8bd133b86d1e2ba191a94f549”,
first_name: “Sheldon”
las...
Example: Log data
{
log_type: “error”

// Possible values “error, “warn”, “info“

application: “JBoss v. 4.2.3”
message: “...
Routing of queries
Possible types of queries
•

Exact queries
– Data is exactly on one shard

•

Distributed query
– Data is distributed on d...
Exact queries
1. mongos receives the query
from the client
2. Query is routed to the shard
with the data
3. Shard returns the data
4. mongos returns the data to
the client
Distributed queries
1. mongos receives the query
from the client
2. mongos routes the query to
all shards
3. Shards return the data
4. mongos returns the data to
the client
Distributed queries with sorting
1. mongos receives the query
from the client
2. mongos routes the query to
all shards
3. Execute the query and local
sorting
4. Shards return sorted data
5. mongos sorts the data
globally
6. mongos returns the sorted
data to the client
Lab time!

Lab Nr. 07
Time box:
20 min
Still want moar?

https://education.mongodb.com
Nächste SlideShare
Wird geladen in …5
×

MongoDB for Coder Training (Coding Serbia 2013)

3.596 Aufrufe

Veröffentlicht am

Slides of my MongoDB Training given at Coding Serbia Conference on 18.10.2013

Agenda:
1. Introduction to NoSQL & MongoDB
2. Data manipulation: Learn how to CRUD with MongoDB
3. Indexing: Speed up your queries with MongoDB
4. MapReduce: Data aggregation with MongoDB
5. Aggregation Framework: Data aggregation done the MongoDB way
6. Replication: High Availability with MongoDB
7. Sharding: Scaling with MongoDB

Veröffentlicht in: Technologie, Business
0 Kommentare
8 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

Keine Downloads
Aufrufe
Aufrufe insgesamt
3.596
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
22
Aktionen
Geteilt
0
Downloads
157
Kommentare
0
Gefällt mir
8
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

MongoDB for Coder Training (Coding Serbia 2013)

  1. 1. Training: MongoDB for Coder Uwe Seiler uweseiler
  2. 2. About me Big Data Nerd Hadoop Trainer MongoDB Author Photography Enthusiast Travelpirate
  3. 3. About us is a bunch of… Big Data Nerds Agile Ninjas Continuous Delivery Gurus Join us! Enterprise Java Specialists Performance Geeks
  4. 4. Agenda I 1. Introduction to NoSQL & MongoDB 2. Data manipulation: Learn how to CRUD with MongoDB 3. Indexing: Speed up your queries with MongoDB 4. MapReduce: Data aggregation with MongoDB
  5. 5. Agenda 5. Aggregation Framework: Data aggregation done the MongoDB way 6. Replication: High Availability with MongoDB 7. Sharding: Scaling with MongoDB
  6. 6. Ingredients • Slides • Live Coding • Discussion • Labs on your own computer
  7. 7. And please… If you have questions, please share them with us!
  8. 8. And now start your downloads… Lab files: http://bit.ly/1aT8RXY
  9. 9. Buzzword Bingo
  10. 10. NoSQL
  11. 11. Classification of NoSQL Key-Value Stores K V K V K V K 1 V K Column Stores V Graph Databases 1 1 1 1 1 1 1 1 1 1 Document Stores _id _id _id
  12. 12. Big Data
  13. 13. My favorite definition
  14. 14. The classic definition • The 3 V’s of Big Data Volume Velocity •Variety
  15. 15. «Big Data» != Hadoop
  16. 16. Horizontal Scaling
  17. 17. Vertical Scaling RAM CPU Storage
  18. 18. Vertical Scaling RAM CPU Storage
  19. 19. Vertical Scaling RAM CPU Storage
  20. 20. Horizontal Scaling RAM CPU Storage
  21. 21. Horizontal Scaling RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage
  22. 22. Horizontal Scaling RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage RAM CPU Storage
  23. 23. The problem with distributed data
  24. 24. The CAP Theorem Availability a guarantee that every request receives a response Consistency all nodes see the same data at the same time Partition Tolerance failure of single nodes doesn‘t effect the overall system
  25. 25. Overview of NoSQL systems Availability a guarantee that every request receives a response C Partition onsistency Tolerance all nodes see the same data at the same time failure of single nodes doesn‘t effect the overall system
  26. 26. The problem with consistency
  27. 27. ACID vs. BASE
  28. 28. ACID vs. BASE 1983 Atomicity RDBMS Consistency Isolation Durability
  29. 29. ACID vs. BASE ACID is a good concept but it is not a written law!
  30. 30. ACID vs. BASE Basically Available Soft State 2008 NoSQL Eventually consistent
  31. 31. ACID vs. BASE ACID BASE - - Strong consistency Isolation & Transactions Two-Phase-Commit Complex Development More reliable Eventual consistency Highly Available "Fire-and-forget" Eases development Faster
  32. 32. Overview of MongoDB
  33. 33. MongoDB is a… • document • open source • highly performant • flexible • scalable • highly available • feature-rich …database
  34. 34. Document Database • Not PDF, Word, etc. … JSON!
  35. 35. Open Source Database • MongoDB is a open source project • Available on GitHub – https://github.com/mongodb/mongo • Uses the AGPL Lizenz • Started and sponsored by MongoDB Inc. (prior: 10gen) • Commercial version and support available • Join the crowd! – https://jira.mongodb.org
  36. 36. Performance Data locality In-Memory Caching In-Place Updates
  37. 37. Flexible Schema RDBMS MongoDB { _id : ObjectId("4c4ba5e5e8aabf3"), employee_name: "Dunham, Justin", department : "Marketing", title : "Product Manager, Web", report_up: "Neray, Graham", pay_band: “C", benefits : [ { type : "Health", plan : "PPO Plus" }, { type : "Dental", plan : "Standard" } ] }
  38. 38. Scalability Auto Sharding • Increase capacity as you go • Commodity and cloud architectures • Improved operational simplicity and cost visibility
  39. 39. High Availability • Automated replication and failover • Multi-data center support • Improved operational simplicity (e.g., HW swaps) • Data durability and consistency
  40. 40. MongoDB Architecture
  41. 41. Rich Query Language
  42. 42. Aggregation Framework
  43. 43. Map/Reduce MongoDB Data Map() emit(k,v) Group(k) Shard 1 Sort(k) Shard 2 … Shard n Reduce(k, values) Finalize(k, v)
  44. 44. Geo Information
  45. 45. Driver & Shell Drivers are available for almost all popular programming languages and frameworks Java JavaScript Python Shell to interact with the database Ruby Perl Haskell > db.collection.insert({product:“MongoDB”, type:“Document Database”}) > > db.collection.findOne() { “_id” : ObjectId(“5106c1c2fc629bfe52792e86”), “product” : “MongoDB” “type” : “Document Database” }
  46. 46. NoSQL Trends Google Search LinkedIn Job Skills MongoDB Competitor 1 Competitor 2 Competitor 3 Competitor 4 Competitor 5 MongoDB Competitor 2 Competitor 1 Competitor 4 Competitor 3 All Others Jaspersoft Big Data Index Indeed.com Trends Top Job Trends Direct Real-Time Downloads MongoDB Competitor 1 Competitor 2 Competitor 3 1.HTML 5 2.MongoDB 3.iOS 4.Android 5.Mobile Apps 6.Puppet 7.Hadoop 8.jQuery 9.PaaS 10.Social Media
  47. 47. Data manipulation
  48. 48. Terminology RDBMS Table / View Row Index Join Foreign Key Partition MongoDB ➜ ➜ ➜ ➜ ➜ ➜ Collection Document Index Embedded document Referenced document Shard
  49. 49. Example: Simple blog model
  50. 50. MongoDB Collections • User • Article • Tag • Category
  51. 51. Schema design for the blog
  52. 52. Let’s have a look…
  53. 53. Create a database // Show all databases > show dbs digg 0.078125GB enron 1.49951171875GB // Switch to a database > use blog // Show all databases again > show dbs digg 0.078125GB enron 1.49951171875GB
  54. 54. Create a collection I // Show all collections > show collections // Insert a user > db.user.insert( { name : “Sheldon“, mail : “sheldon@bigbang.com“ } ) No feedback about the result of the insert, use: db.runCommand( { getLastError: 1} )
  55. 55. Create a collection II // Show all collections > show collections system.indexes user // Show all databases > show dbs blog 0.0625GB digg 0.078125GB enron 1.49951171875GB Databases and collections are automatically created during the first insert operation!
  56. 56. Read from a collection // Show the first document > db.user.findOne() { "_id" : ObjectId("516684a32f391f3c2fcb80ed"), "name" : "Sheldon", "mail" : "sheldon@bigbang.com" } // Show all documents of a collection > db.user.find() { "_id" : ObjectId("516684a32f391f3c2fcb80ed"), "name" : "Sheldon", "mail" : "sheldon@bigbang.com" }
  57. 57. Find documents // Find a specific document > db.user.find( { name : ”Penny” } ) { "_id" : ObjectId("5166a9dc2f391f3c2fcb80f1"), "name" : "Penny", "mail" : "penny@bigbang.com" } // Show only certain fields of the document > db.user.find( { name : ”Penny” }, {_id: 0, mail : 1} ) { "mail" : "sheldon@bigbang.com" }
  58. 58. _id • _id is the primary key in MongoDB • _id is created automatically • If not specified differently, it‘s type is ObjectId • _id can be specified by the user during the insert of documents, but needs to be unique (and can not be edited afterwards)
  59. 59. ObjectId • A ObjectId is a special 12 Byte value • It‘s uniqueness in the whole cluster is guaranteed as following: ObjectId("50804d0bd94ccab2da652599") |-------------||---------||-----||----------| ts mac pid inc
  60. 60. Cursor // Use a cursor with find() > var myCursor = db.user.find( ) // Get the next document > var myDocument = myCursor.hasNext() ? myCursor.next() : null; > if (myDocument) { printjson(myDocument.mail); } // Show all other documents > myCursor.forEach(printjson); By default the shell displays 20 documents
  61. 61. Logical operators // Find documents using OR > db.user.find( {$or : [ { name : “Sheldon“ }, { mail : amy@bigbang.com } ] }) // Find documents using AND > db.user.find( {$and : [ { name : “Sheldon“ }, { mail : amy@bigbang.com } ] })
  62. 62. Manipulating results // Sort documents > db.user.find().sort( { name : 1 } ) // Aufsteigend > db.user.find().sort( { name : -1 } ) // Absteigend // Limit the number of documents > db.user.find().limit(3) // Skip documents > db.user.find().skip(2) // Combination of both methods > db.user.find().skip(2).limit(3)
  63. 63. Updating documents I // Updating only the mail address (How not to do…) > db.user.update( { name : “Sheldon“ }, { mail : “sheldon@howimetyourmother.com“ } ) // Result of the update operation db.user.findOne() { "_id" : ObjectId("516684a32f391f3c2fcb80ed"), "mail" : "sheldon@howimetyourmother.com" } Be careful when updating documents!
  64. 64. Deleting documents // Deleting a document > db.user.remove( { mail : “sheldon@howimetyourmother.com“ } ) // Deleting all documents in a collection > db.user.remove() // Use a condition to delete documents > db.user.remove( { mail : /.*mother.com$/ } ) // Delete only the first document using a condition > db.user.remove( { mail : /.*.com$/ }, true )
  65. 65. Updating documents II // Updating only the mail address (This time for real) > db.user.update( { name : “Sheldon“ }, { $set : { mail : “sheldon@howimetyourmother.com“ }}) // Show the result of the update operation db.user.find(name : “Sheldon“) { "_id" : ObjectId("5166ba122f391f3c2fcb80f5"), "mail" : "sheldon@howimetyourmother.com", "name" : "Sheldon" }
  66. 66. Adding to arrays // Adding a array > db.user.update( {name : “Sheldon“ }, { $set : {enemies : [ { name : “Wil Wheaton“ }, { name : “Barry Kripke“ } ] }}) // Adding a value to the array > db.user.update( { name : “Sheldon“}, { $push : {enemies : { name : “Leslie Winkle“} }})
  67. 67. Deleting from arrays // Deleting a value from an array > db.user.update( { name : “Sheldon“ }, {$pull : {enemies : {name : “Barry Kripke“ } }}) // Deleting of a complete array > db.user.update( {name : “Sheldon“}, {$unset : {enemies : 1}} )
  68. 68. Adding a subdocument // Adding a subdocument to an existing document > db.user.update( { name : “Sheldon“}, { $set : { mother :{ name : “Mary Cooper“, residence : “Galveston, Texas“, religion : “Evangelical Christian“ }}}) { "_id" : ObjectId("5166cf162f391f3c2fcb80f7"), "mail" : "sheldon@bigbang.com", "mother" : { "name" : "Mary Cooper", "residence" : "Galveston, Texas", "religion" : "Evangelical Christian" }, "name" : "Sheldon" }
  69. 69. Querying subdocuments // Finding out the name of the mother > db.user.find( { name : “Sheldon“}, {“mother.name“ : 1 } ) { "_id" : ObjectId("5166cf162f391f3c2fcb80f7"), "mother" : { "name" : "Mary Cooper" } } Compound field names need to be in “…“!
  70. 70. Overview of all update operators For fields: $inc $rename $set $unset Bitwise operation: $bit Isolation: $isolated For arrays: $addToSet $pop $pullAll $pull $pushAll $push $each (Modifier) $slice (Modifier) $sort (Modifier)
  71. 71. Dokumentation Create http://docs.mongodb.org/manual/core/create/ Read http://docs.mongodb.org/manual/core/read/ Update http://docs.mongodb.org/manual/core/update/ Delete http://docs.mongodb.org/manual/core/delete/
  72. 72. Lab time! Lab Nr. 02 Time box: 20 min
  73. 73. Indexing
  74. 74. What is an index?
  75. 75. 1 2 3 4 5 Chained lists 6 7
  76. 76. 1 2 3 4 5 6 Find Nr. 7 in the chained list! 7
  77. 77. 4 2 1 6 3 5 Find Nr. 7 in a tree! 7
  78. 78. Indices in MongoDB are B-Trees
  79. 79. Find, Insert and Delete Operations: O(log(n))
  80. 80. Missing or non-optimal indices are the singlemost avoidable performance issue
  81. 81. How do I create an index? // Create a non-existing index for a field > db.recipes.createIndex({ main_ingredient: 1 }) // Make sure there is an index on the field > db.recipes.ensureIndex({ main_ingredient: 1 }) * 1 for ascending, -1 for descending
  82. 82. What can be indexed? // Multiple fields (Compound Key Indexes) > db.recipes.ensureIndex({ main_ingredient: 1, calories: -1 }) // Arrays with values (Multikey Indexes) { name: 'Chicken Noodle Soup’, ingredients : ['chicken', 'noodles'] } > db.recipes.ensureIndex({ ingredients: 1 })
  83. 83. What can be indexed? // Subdocuments { name : 'Apple Pie', contributor: { name: 'Joe American', id: 'joea123' } } db.recipes.ensureIndex({ 'contributor.id': 1 }) db.recipes.ensureIndex({ 'contributor': 1 })
  84. 84. How to maintain indices? // List all indices of a collection > db.recipes.getIndexes() > db.recipes.getIndexKeys() // Drop an index > db.recipes.dropIndex({ ingredients: 1 }) // Drop and recreate all indices of a collection db.recipes.reIndex()
  85. 85. More options • Unique Index – Allows only unique values in the indexed field(s) • Sparse Index – For fields that are not available in all documents • Geospatial Index – For modelling 2D and 3D geospatial indices • TTL Collections – Are automatically deleted after x seconds
  86. 86. Unique Index // Make sure the name of a recipe is unique > db.recipes.ensureIndex( { name: 1 }, { unique: true } ) // Force an index on a collection with non-unique values // Duplicates will be deleted more or less randomly! > db.recipes.ensureIndex( { name: 1 }, { unique: true, dropDups: true } ) * dropDups should be used only with caution!
  87. 87. Sparse Index // Only documents with the field calories will be indexed > db.recipes.ensureIndex( { calories: -1 }, { sparse: true } ) // Combination with unique index is possible > db.recipes.ensureIndex( { name: 1 , calories: -1 }, { unique: true, sparse: true } ) * Missing fields will be saved as null in the index!
  88. 88. Geospatial Index // Add longitude and altitude { name: ‚codecentric Frankfurt’, loc: [ 50.11678, 8.67206] } // Index the 2D coordinates > db.locations.ensureIndex( { loc : '2d' } ) // Find locations near codecentric Frankfurt > db.locations.find({ loc: { $near: [ 50.1, 8.7 ] } })
  89. 89. TTL Collections // Documents need a field of type BSON UTC { ' submitted_date ' : ISODate('2012-10-12T05:24:07.211Z'), … } // Documents will be deleted automatically by a daemon process // after 'expireAfterSeconds' > db.recipes.ensureIndex( { submitted_date: 1 }, { expireAfterSeconds: 3600 } )
  90. 90. Limitations of indices • Collections can‘t have more than 64 indices • Index keys are not allowed to be larger than 1024 Byte • The name of an index (including name space) must be less than 128 character • Queries can only make use of one index – Exception: Queries using $or • Indices are tried to be kept in-memory • Indices slow down the writing of data
  91. 91. Optimizing indices
  92. 92. Best practice 1. Identify slow queries 2. Find out more about the slow queries using explain() 3. Create appropriate indices on the fields being queried 4. Optimize the query taking the available indices into account
  93. 93. 1. Identify slow queries > db.setProfilingLevel( n , slowms=100ms ) n=0: Profiler off n=1: Log all operations slower than slowms n=2: Log all operations > db.system.profile.find() * The collection profile is a capped collection with a limited number of entries
  94. 94. 2. Usage of explain() > db.recipes.find( { calories: { $lt : 40 } } ).explain( ) { "cursor" : "BasicCursor" , "n" : 42, "nscannedObjects” : 53641 "nscanned" : 53641, ... "millis" : 252, ... }
  95. 95. 2. Metrics of the execution plan I • Cursor – The type of the cursor: BasicCursor means no idex has been used • n – The number of matched documents • nscannedObjects – The number of scanned documents • nscanned – The number of scanned entries (Index entries or documents)
  96. 96. 2. Metrics of the execution plan II • millis – Execution time of the query • Complete reference can be found here – http://docs.mongodb.org/manual/reference/explain Optimize for ℎ =1
  97. 97. 3. Create appropriate indices on the fields being queried
  98. 98. 4. Optimize queries taking the available indices into account // Using the following index… > db.collection.ensureIndex({ a:1, b:1 , c:1, d:1 }) // … these queries and sorts can make use of the index > db.collection.find( ).sort({ a:1 }) > db.collection.find( ).sort({ a:1, b:1 }) > db.collection.find({ a:4 }).sort({ a:1, b:1 }) > db.collection.find({ b:5 }).sort({ a:1, b:1 })
  99. 99. 4. Optimize queries taking the available indices into account // Using the following index… > db.collection.ensureIndex({ a:1, b:1, c:1, d:1 }) // … the these queries can not make use of it > db.collection.find( ).sort({ b: 1 }) > db.collection.find({ b: 5 }).sort({ b: 1 })
  100. 100. 4. Optimize queries taking the available indices into account // Using the following index… > db.recipes.ensureIndex({ main_ingredient: 1, name: 1 }) // … this query can be complete satisfied using the index! > db.recipes.find( { main_ingredient: 'chicken’ }, { _id: 0, name: 1 } ) // The metric indexOnly using explain() verifies this: > db.recipes.find( { main_ingredient: 'chicken' }, { _id: 0, name: 1 } ).explain() { "indexOnly": true, }
  101. 101. Use specific indices // Tell MongoDB explicitly which index to use > db.recipes.find({ calories: { $lt: 1000 } } ).hint({ _id: 1 }) // Switch the usage of idices completely off (e.g. for performance // measurements) > db.recipes.find( { calories: { $lt: 1000 } } ).hint({ $natural: 1 })
  102. 102. Caveats using indices
  103. 103. Using multiple indices // MongoDB can only use one index per query! > db.collection.ensureIndex({ a: 1 }) > db.collection.ensureIndex({ b: 1 }) // For this query only one of those two indices can be used > db.collection.find({ a: 3, b: 4 })
  104. 104. Compound indices // Compound indices are often very efficient! > db.collection.ensureIndex({ a: 1, b: 1, c: 1 }) // But only if the query is a prefix of the index... // This query can make use of the index db.collection.find({ c: 2 }) // …but this query can db.collection.find({ a: 3, b: 5 })
  105. 105. Indices with low selectivity // The following field has only few distinct values > db.collection.distinct('status’) [ 'new', 'processed' ] // A index on this field is not the best idea… > db.collection.ensureIndex({ status: 1 }) > db.collection.find({ status: 'new' }) // Better use a adequate compound index with other fields > db.collection.ensureIndex({ status: 1, created_at: -1 }) > db.collection.find( { status: 'new' } ).sort({ created_at: -1 })
  106. 106. Regular expressions & Indices > db.users.ensureIndex({ username: 1 }) // Left-bound regular expressions can make usage of this index > db.users.find({ username: /^joe smith/ }) // But not queries with regular expressions in general… > db.users.find({username: /smith/ }) // Also not case-insensitive queries… > db.users.find({ username: /^Joe/i })
  107. 107. Negations & Indices // Negations can not make use of indices > db.things.ensureIndex({ x: 1 }) // e.g. queries using not equal > db.things.find({ x: { $ne: 3 } }) // …or queries with not in > db.things.find({ x: { $nin: [2, 3, 4 ] } }) // …or queries with the $not operator > db.people.find({ name: { $not: 'John Doe' } })
  108. 108. Lab time! Lab Nr. 03 Time box: 20 min
  109. 109. Map/Reduce
  110. 110. What is Map/Reduce? • Programming model coming from functional languages • Framework for – parallel processing – of big volume data – using distributed systems • Made popular by Google – Has been invented to calculate the inverted search index for web sites to keywords (Page Rank) – http://research.google.com/archive/mapreduce.html
  111. 111. Basics • Not something special about MongoDB – – – – Hadoop Disco Amazon Elastic MapReduce … • Based on key-value-pairs • Prior to version 2.4 and the introduction of the V8 JavaScript engine only one thread per shard
  112. 112. The „Hello world“ of Map/Reduce: Word Count
  113. 113. Word Count: Problem INPUT { MongoDB uses MapReduce } { There is a map phase } { There is a reduce phase } MAPPER GROUP/SORT REDUCER OUTPUT a: 2 is: 2 map: 1 Problem: How often does one word appear in all documents? mapreduce: 1 mongodb: 1 phase: 2 reduce: 1 there: 2 uses: 1
  114. 114. Word Count: Mapping INPUT { MongoDB uses MapReduce } { There is a map phase } { There is a reduce phase } MAPPER GROUP/SORT (doc1, “…“) (mongodb, 1) (uses, 1) (mapreduce, 1) (doc2, “…“) (there, 1) (is, 1) (a, 1) (map, 1) (phase, 1) (doc3, “…“) (there, 1) (is, 1) (a, 1) (reduce, 1) (phase, 1) REDUCER OUTPUT
  115. 115. Word Count: Group/Sort INPUT { MongoDB uses MapReduce } MAPPER GROUP/SORT REDUCER a-l (doc1, “…“) m-q { There is a map phase } { There is a reduce phase } (doc2, “…“) (map, 1) (phase, 1) r-z (doc3, “…“) (there, 1) (reduce, 1) OUTPUT
  116. 116. Word Count: Reduce INPUT { MongoDB uses MapReduce } MAPPER GROUP/SORT REDUCER (doc1, “…“) (a, [1, 1]) (is, [1, 1]) (map, [1]) { There is a map phase } (doc2, “…“) (mapreduce, [1]) (mongodb, [1]) (phase, [1, 1]) { There is a reduce phase } (doc3, “…“) (reduce, [1]) (there, [1, 1]) (uses, [1]) OUTPUT
  117. 117. Word Count: Result INPUT { MongoDB uses MapReduce } MAPPER GROUP/SORT REDUCER OUTPUT (doc1, “…“) (a, [1, 1]) (is, [1, 1]) (map, [1]) a: 2 is: 2 map: 1 { There is a map phase } (doc2, “…“) (mapreduce, [1]) (mongodb, [1]) (phase, [1, 1]) mapreduce: 1 mongodb: 1 phase: 2 { There is a reduce phase } (doc3, “…“) (reduce, [1]) (there, [1, 1]) (uses, [1]) reduce: 1 there: 2 uses: 1
  118. 118. Word Count: In a nutshell INPUT { MongoDB uses MapReduce } MAPPER GROUP/SORT (doc1, “…“) REDUCER (a, [1, 1]) (is, [1, 1]) (map, [1]) OUTPUT a: 2 is: 2 map: 1 map() reduce() Transforms one keyvalue-pair in 0–N keyvalue-pairs Reduces 0-N keyvalue-pairs into one key-value-pair
  119. 119. Map/Reduce: Overview MongoDB Data group(k) map() emit(k,v) Shard 1 Iterates all documents sort(k) Shard 2 … Shard n reduce(k, values) finalize(k, v) • • Input = Output Can run multiple times
  120. 120. Word Count: Tweets // Example: Twitter database with tweets > db.tweets.findOne() { "_id" : ObjectId("4fb9fb91d066d657de8d6f38"), "text" : "RT @RevRunWisdom: The bravest thing that men do is love women #love", "created_at" : "Thu Sep 02 18:11:24 +0000 2010", … "user" : { "friends_count" : 0, "profile_sidebar_fill_color" : "252429", "screen_name" : "RevRunWisdom", "name" : "Rev Run", }, …
  121. 121. Word Count: map() // Map function with simple data cleansing map = function() { this.text.split(' ').forEach(function(word) { // Remove whitespace word = word.replace(/s/g, ""); // Remove all non-word-characters word = word.replace(/W/gm,""); // Finally emit the cleaned up word if(word != "") { emit(word, 1) } }); };
  122. 122. Word Count: reduc() // Reduce function reduce = function(key, values) { return values.length; };
  123. 123. Word Count: Call // Show the results using the console > db.tweets.mapReduce(map, reduce, { out : { inline : 1 } } ); // Save the results to a collection > db.tweets.mapReduce(map, reduce, { out : "tweets_word_count"} ); { "result" : "tweets_word_count", "timeMillis" : 19026, "counts" : { "input" : 53641, "emit" : 559217, "reduce" : 102057, "output" : 131003 }, "ok" : 1, }
  124. 124. Word Count: Result // Top-10 of most common words in tweets > db.tweets_word_count.find().sort({"value" : -1}).limit(10) { { { { { { { { { { "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" "_id" : : : : : : : : : : "Miley", "value" : 31 } "mil", "value" : 31 } "andthenihitmydougie", "value" : 30 } "programa", "value" : 30 } "Live", "value" : 29 } "Super", "value" : 29 } "cabelo", "value" : 29 } "listen", "value" : 29 } "Call", "value" : 28 } "DA", "value" : 28 }
  125. 125. Recommendation
  126. 126. Typical use cases • Counting, Aggregating & Suming up – Analyzing log entries & Generating log reports – Generating an inversed index – Substitute existing ETL processes • Counting unique values – Counting the number of unique visitors of a website • Filtering, Parsing & Validation – Filtering of user data – Consolidation of user-generated data • Sorting – Data analysis using complex sorting
  127. 127. Summary • The Map/Reduce framework is very versatile & powerful • Is implemented in JavaScript – Necessity to write own map()- und reduce() functions in JavaScript – Difficult to debug – Performance is highly influenced by the JavaScript engine • Can be used for complex data analytics • Lots of overhead for simple aggregation tasks – Suming up of data – Average of data – Grouping of data
  128. 128. Map/Reduce should be used as ultima ratio!
  129. 129. Lab time! Lab Nr. 04 Time box: 20 min
  130. 130. Aggregation Framework
  131. 131. Why? SELECT customer_id, SUM(price) FROM orders WHERE active=true GROUP BY customer_id
  132. 132. That‘s why! SELECT customer_id, SUM(price) FROM orders Calculation WHERE active=true of fields GROUP BY customer_id Grouping of data
  133. 133. The Aggregation Framework Has been introduced to allow 90% of realworld aggregation use cases without using the „big hammer“ Map/Reduce • Framework of methods & operators • – Declarative – No own JavaScript code needed – Fixed set of methods and operators (but constantly under development by MongoDB Inc.) • Implemented in C++ – Limitations on JavaScript Engine are avoided – Better performance
  134. 134. The Aggregation Pipeline Pipeline Operator Pipeline Operator Pipeline Operator { document } Result { sum: 337 avg: 24,53 min: 2 max : 99 }
  135. 135. The Aggregation Pipeline • Processes a stream of documents – Input is a complete collection – Output is a document containing the results • Succession of pipeline operators – Each tier filters or transforms the documents – Input documents of a tier are the output documents of the previous tier
  136. 136. Call db.tweets.aggregate( { $pipeline_operator_1 { $pipeline_operator_2 { $pipeline_operator_3 { $pipeline_operator_4 ... ); }, }, }, },
  137. 137. Pipeline Operators // Old friends* $match $sort $limit $skip * from the query functionality // New friends $project $group $unwind
  138. 138. Example: Tweets // Example: Twitter database with tweets > db.tweets.findOne() { "_id" : ObjectId("4fb9fb91d066d657de8d6f38"), "text" : "RT @RevRunWisdom: The bravest thing that men do is love women #love", "created_at" : "Thu Sep 02 18:11:24 +0000 2010", … "user" : { "friends_count" : 0, "profile_sidebar_fill_color" : "252429", "screen_name" : "RevRunWisdom", "name" : "Rev Run", }, …
  139. 139. $match // Show all german users > db.tweets.aggregate( { $match : {"user.lang" : "de"}}, ); // Show all users with 0 to 10 followers > db.tweets.aggregate( { $match : {"user.followers_count" : { $gte : 0, $lt : 10 } } } ); > Filters documents > Equivalent to .find()
  140. 140. $sort // Sorting using one field > db.tweets.aggregate( { $sort : {"user.friends_count" : -1} }, ); // Sorting using multiple fields > db.tweets.aggregate( { $sort : {"user.lang" : 1, "user.time_zone" : 1, "user.friends_count" : -1} }, ); > Sorts documents > Equivalent to .sort()
  141. 141. $limit // Limit the number of resulting documents to 3 > db.tweets.aggregate( { $sort : {"user.friends_count" : -1} }, { $limit : 3 } ); > Limits resulting documents > Equivalent to .limit()
  142. 142. $skip // Get the No.4-Twitterer according to number of friends > db.tweets.aggregate( { $sort : {"user.friends_count" : -1} }, { $skip : 3 }, { $limit : 1 } ); > Skips documents > Equivalent to .skip()
  143. 143. $project I // Limit the result document to only one field > db.tweets.aggregate( { $project : {text : 1} }, ); // Remove _id > db.tweets.aggregate( { $project : {_id: 0, text : 1} }, ); > Limits the fields in resulting documents
  144. 144. $project II // Rename a field > db.tweets.aggregate( { $project : {_id: 0, content_of_tweet : "$text"} }, ); // Add a calculated field > db.tweets.aggregate( { $project : {_id: 0, content_of_tweet : "$text", number_of_friends : {$add: ["$user.friends_count", 10]} } }, );
  145. 145. $project III // Add a subdocument > db.tweets.aggregate( { $project : {_id: 0, content_of_tweet : "$text", user : { name : "$user.name", number_of_friends : {$add: ["$user.friends_count", 10]} } } } );
  146. 146. $group I // Grouping using a single field > db.tweets.aggregate( { $group : { _id : "$user.lang", anzahl_tweets : {$sum : 1} } } ); > Groups documents > Equivalent to GROUP BY in SQL
  147. 147. $group II // Grouping using multiple fields > db.tweets.aggregate( { $group : { _id : { background_image: "$user.profile_use_background_image", language: "$user.lang" }, number_of_tweets: {$max : 1} } } );
  148. 148. $group III // Grouping with multiple calculated fields > db.tweets.aggregate( { $group : { _id : "$user.lang", number_of_tweets : {$sum : 1}, average_of_followers : {$avg : "$user.followers_count"}, minimum_of_followers : {$min : "$user.followers_count"}, maximum_of_followers : {$max : "$user.followers_count"} } } );
  149. 149. Group Aggregation Functions $min $max $avg $sum $addToSet $first $last $push
  150. 150. $unwind I // Unwind an array > db.tweets.aggregate( { $project : {_id: 0, content_of_tweet : "$text", mentioned_users : "$entities.user_mentions.name" } }, { $skip : 18 }, { $limit : 1 }, { $unwind : "$mentioned_users" } ); > Unwinds arrays and creates one document per value in the array
  151. 151. $unwind II // Resulting document without $unwind { „content_of_tweet" : "RT @Philanthropy: How should nonprofit groups measure their social-media efforts? A new podcast from @afine http://ht.ly/2yFlS", „mentioned_users" : [ "Philanthropy", "Allison Fine" ] }
  152. 152. $unwind III // Resulting documents with $unwind { " content_of_tweet " : "RT @Philanthropy: How should nonprofit groups measure their social-media efforts? A new podcast from @afine http://ht.ly/2yFlS", " mentioned_users " : "Philanthropy" }, { " content_of_tweet " : "RT @Philanthropy: How should nonprofit groups measure their social-media efforts? A new podcast from @afine http://ht.ly/2yFlS", " mentioned_users " : "Allison Fine" }
  153. 153. Best Practices
  154. 154. Place $match at the beginning of the pipeline to reduce the number of documents as soon as possible! Best Practice #1
  155. 155. Use $project to remove not needed fields in the documents as soon as possible! Best Practice #2
  156. 156. When being placed at the beginning of the pipeline these operators can make use of indices: $match $sort $limit $skip The above operators can equally use indices when placed before these operators: $project $unwind $group Best Practice #3
  157. 157. Mapping of MongoDB to SQL
  158. 158. Mapping SQL MongoDB Aggregation WHERE $match GROUP BY $group HAVING $match SELECT $project ORDER BY $sort LIMIT $limit SUM() $sum COUNT() $sum join No equivalent operator ($unwind has somehow equivalent functionality for embedded fields)
  159. 159. Example: Online shopping { cust_id: “sheldon1", ord_date: ISODate("2013-04-018T19:38:11.102Z"), status: ‘purchased', price: 105,69, items: [ { sku: “nobel_price_replica", qty: 3, price: 29,90 }, { sku: “wheaton_voodoo_doll", qty: 1, price: 15,99 } ] }
  160. 160. Count all orders SQL MongoDB Aggregation SELECT COUNT(*) AS count FROM orders db.orders.aggregate( [ { $group: { _id: null, count: { $sum: 1 } } }])
  161. 161. Average order price per customer SQL MongoDB Aggregation SELECT cust_id, SUM(price) AS total FROM orders GROUP BY cust_id ORDER BY total db.orders.aggregate( [ { $group: { _id: "$cust_id", total: { $sum: "$price" } } }, { $sort: { total: 1 } }])
  162. 162. Sum up all orders over 250$ SQL MongoDB Aggregation SELECT cust_id, SUM(price) as db.orders.aggregate( [ { $match: { status: 'A' } }, total { $group: { _id: "$cust_id", FROM orders WHERE status = ‘purchased' total: { $sum: "$price" } } }, GROUP BY cust_id { $match: { total: { $gt: 250 HAVING total > 250 }}}])
  163. 163. More examples http://docs.mongodb.org/manual /reference/sql-aggregationcomparison/
  164. 164. Lab time! Lab Nr. 05 Time box: 20 min
  165. 165. Replication: High Availability with MongoDB
  166. 166. Why do we need replication? • Hardware is unreliable and is doomed to fail! • Do you want to be the person being called at night to do a manual failover? • How about network latency? • Different use cases for your data – “Regular” processing – Data for analysis – Data for backup
  167. 167. Life cycle of a replica set
  168. 168. Replica set – Create
  169. 169. Replica set – Initializing
  170. 170. Replica set – Node down
  171. 171. Replica set – Failover
  172. 172. Replica set – Recovery
  173. 173. Replica set – Back to normal
  174. 174. Roles & Configuration
  175. 175. Replica sets - Roles
  176. 176. Configuration I > conf = { _id : "mySet", members : [ {_id : 0, host : "A”, priority : 3}, {_id : 1, host : "B", priority : 2}, {_id : 2, host : "C”}, {_id : 3, host : "D", hidden : true}, {_id : 4, host : "E", hidden : true, slaveDelay : 3600} ] } > rs.initiate(conf)
  177. 177. Configuration II > conf = { _id : "mySet”, members : [ Primary data center {_id : 0, host : "A”, priority : 3}, {_id : 1, host : "B", priority : 2}, {_id : 2, host : "C”}, {_id : 3, host : "D", hidden : true}, {_id : 4, host : "E", hidden : true, slaveDelay : 3600} ] } > rs.initiate(conf)
  178. 178. Configuration III > conf = { _id : "mySet”, members : [ Secondary data center (Default priority = 1) {_id : 0, host : "A”, priority : 3}, {_id : 1, host : "B", priority : 2}, {_id : 2, host : "C”}, {_id : 3, host : "D", hidden : true}, {_id : 4, host : "E", hidden : true, slaveDelay : 3600} ] } > rs.initiate(conf)
  179. 179. Configuration IV > conf = { _id : "mySet”, members : [ {_id : 0, host : "A”, priority : 3}, {_id : 1, host : "B", priority : 2}, Analytical data e.g. for Hadoop, Storm, BI, … {_id : 2, host : "C”}, {_id : 3, host : "D", hidden : true}, {_id : 4, host : "E", hidden : true, slaveDelay : 3600} ] } > rs.initiate(conf)
  180. 180. Configuration V > conf = { _id : "mySet”, members : [ {_id : 0, host : "A”, priority : 3}, {_id : 1, host : "B", priority : 2}, {_id : 2, host : "C”}, {_id : 3, host : "D", hidden : true}, {_id : 4, host : "E", hidden : true, slaveDelay : 3600} ] } > rs.initiate(conf) Back-up node
  181. 181. Data consistency
  182. 182. Strong consistency
  183. 183. Eventual consistency
  184. 184. Write Concern • Different levels of data consistency • Acknowledged by – Network – MongoDB – Journal – Secondaries – Tagging
  185. 185. Acknowledged by network „Fire and forget“
  186. 186. Acknowledged by MongoDB Wait for Error
  187. 187. Acknowledged by Journal Wait for Journal Sync
  188. 188. Acknowledged by Secondaries Wait for Replication
  189. 189. Tagging while writing data • Available since 2.0 • Allows for fine granular control • Each node can have multiple tags – tags: {dc: "ny"} – tags: {dc: "ny", subnet: „192.168", rack: „row3rk7"} • Allows for creating Write Concern Rules (per replica set) • Tags can be adapted without code changes and restarts
  190. 190. Tagging - Example { _id : "mySet", members : [ {_id : 0, host : "A", tags : {"dc": "ny"}}, {_id : 1, host : "B", tags : {"dc": "ny"}}, {_id : 2, host : "C", tags : {"dc": "sf"}}, {_id : 3, host : "D", tags : {"dc": "sf"}}, {_id : 4, host : "E", tags : {"dc": "cloud"}}], settings : { getLastErrorModes : { allDCs : {"dc" : 3}, someDCs : {"dc" : 2}} } } > db.blogs.insert({...}) > db.runCommand({getLastError : 1, w : "someDCs"})
  191. 191. Acknowledged by Tagging Wait for Replication (Tagging)
  192. 192. Configure the Write Concern // Wait for network acknowledgement > db.runCommand( { getLastError: 1, w: 0 } ) // Wait for error (Default) > db.runCommand( { getLastError: 1, w: 1 } ) // Wait for journal sync > db.runCommand( { getLastError: 1, w: 1, j: "true" } ) // Wait for replication > db.runCommand( { getLastError: 1, w: “majority" } ) > db.runCommand( { getLastError: 1, w: 3 } ) // # of secondaries
  193. 193. Read Concerns • Only primary (primary) • Primary preferred (primaryPreferred) • Only secondaries (secondary) • Secondaries preferred (secondaryPreferred) • Nearest node (Nearest) General: If more than one node is available, the nearest node will be chosen (All modes except Primary)
  194. 194. Read Only primary (primary)
  195. 195. Read Read Primary preferred (primaryPreferred)
  196. 196. Read Read Only secondaries (secondary)
  197. 197. Read Read Read Secondaries preferred (secondaryPreferred)
  198. 198. Read Read Read Nearest node (nearest)
  199. 199. Tagging while reading data • Allows for a more fine granular control where data will be read from – e.g. { "disk": "ssd", "use": "reporting" } • Can be combined with other read modes – Except for mode „Only primary“
  200. 200. Configure the Read Concern // Only primary > cursor.setReadPref( “primary" ) // Primary preferred > cursor.setReadPref( “primaryPreferred" ) … // Only secondaries with tagging > cursor.setReadPref( “secondary“, [ rack : 2 ] ) Read Concern must be configured before using the cursor to read data!
  201. 201. MongoDB Operation
  202. 202. Maintenance & Upgrades • Zero downtime • Rolling upgrades and maintenance – – – – • Start with all secondaries Step down the current primary Primary as last one Restore previous primary (if needed) Commands: – rs.stepDown(<secs>) – db.version() – db.serverBuildInfo()
  203. 203. Replica set – 1 data center • One – Data center – Switch – Power Supply • Possible errors: – Failure of 2 nodes – Power Supply – Network – Data Center • Automatic recovery
  204. 204. Replica set – 2 data center • Additional node for data recovery • No writing to both data center since only one node in data center No. 2
  205. 205. Replica set – 3 data center • Can recover from a complete data center failure • Allows for usage of w= { dc : 2 } to guarantee writing to 2 data centers (via tagging)
  206. 206. Commands • Administration of the nodes – – – – – • rs.conf() rs.initiate(<conf>) & rs.reconfig(<conf>) rs.add(host:<port>) & rs.addArb(host:<port>) rs.status() rs.stepDown(<secs>) Reconfiguration if a minority of the nodes is not available – rs.reconfig( cfg, { force: true} )
  207. 207. Best Practices
  208. 208. Best Practices • Uneven number of nodes • Adapt the write concern to your use case • Read from primary except for – Geographical distribution – Data analytics • Use logical names and not IP addresses for configuration • Monitor the lags of the secondaries (e.g. MMS)
  209. 209. Lab time! Lab Nr. 06 Time box: 20 min
  210. 210. Sharding: Scaling with MongoDB
  211. 211. Visual representation of vertical scaling 1970 - 2000: Vertical Scaling „Scale up“
  212. 212. Visual representation of horizontal scaling Since 2000: Horizontal Scaling „Scale out“
  213. 213. When to use Sharding?
  214. 214. Not enough disk space
  215. 215. The working set doesn‘t fit into the memory
  216. 216. The needs for read-/write throughput are higher than the I/O capabilities
  217. 217. Sharding MongoDB
  218. 218. Partitioning of data • The user needs to define a shard key • The shard key defines the distribution of data across the shards
  219. 219. Partitioning of data into chunks • Initially all data is in one chunk • Maximum chunk size: 64 MB • MongoDB divides and distributes chunks automatically once the maximum size is met
  220. 220. One chunk contains data of a certain value range
  221. 221. Chunks & Shards • A shard is one node in the cluster • A shard can be one single mongod or a replica set
  222. 222. Metadata Management • Config Server – Stores the value ranges of the chunks and their location – Number of config servers is 1 or 3 (Production: 3) – Two Phase Commit
  223. 223. Balancing & Routing Service • mongos balances the data in the cluster • mongos distributes data to new nodes • mongos routes queries to the correct shard or collects results if data is spread on multiple shards • No local data
  224. 224. Automatic Balancing Balancing will be automatically done once the number of chunks between shards hits a certain threshold
  225. 225. Splitting of a chunk • Once a chunk hits the maximum size it will be split • Splitting is only a logical operation, no data needs to be moved • If the splitting of a chunk results in a misbalance of data, automatic rebalancing will be started
  226. 226. Sharding Infrastructure
  227. 227. MongoDB Auto Sharding • Minimal effort – Usage of the same interfaces for mongod and mongos • Easy configuration – Enable sharding for a database • sh.enableSharding("<database>") – Shard a collection in a database • sh.shardCollection("<database>.<collection>", shard-key-pattern)
  228. 228. Configuration example
  229. 229. Example of a very simple cluster • Never use this in production! – Only one config server (No fault tolerance) – Shard is no replica set (No high availability) – Only one mongos and one shard (No performance improvement)
  230. 230. Start the config server // Start the config server (Default port 27019) > mongod --configsvr
  231. 231. Start the mongos routing service // Start the mongos router (Default port 27017) > mongos --configdb <hostname>:27019 // When using 3 config servers > mongos --configdb <host1>:<port1>,<host2>:<port2>,<host3>:<port3>
  232. 232. Start the shard // Start a shard with one mongod (Default port 27018) > mongod --shardsvr // Shard is not yet added to the cluster!
  233. 233. Add the shard // Connect to mongos and add the shard > mongo > sh.addShard(‘<host>:27018’) // When adding a replica set, you only need to add one of the nodes!
  234. 234. Check configuration // Check if the shard has been added > db.runCommand({ listShards:1 }) { "shards" : [ { "_id”: "shard0000”, "host”: ”<hostname>:27018” } ], "ok" : 1 }
  235. 235. Configure sharding // Enable the sharding for a database > sh.enableSharding(“<dbname>”) // Shard a collection using a shard key > sh.shardCollection(“<dbname>.user”, { “name” : 1 } ) // Use a compound shard key > sh.shardCollection(“<dbname>.cars”,{“year”:1, ”uniqueid”:1})
  236. 236. Shard Key
  237. 237. Shard Key • The shard key can not be changed • The values of a shard key can not be changed • The shard key needs to be indexed • The uniqueness of the field _id is only guaranteed within a shard • The size of a shard key is limited to 512 bytes
  238. 238. Considerations for the shard key • Cardinality of data – The value range needs to be rather large. For example sharding on the field loglevel with the 3 values error, warning, info doesn‘t make sense. • Distribution of data – Always strive for equal distribution of data throughout all shards! • Patterns during reading and writing – For example for log data using the timestamp as a shard key can be useful if chronological very close data needs to be read or written together.
  239. 239. Choices for the shard key • Single field – If the value range is big enough and data is distributed almost equally • Compound fields – Use this if a single field is not enough in respect to value range and equal distribution • Hash based – In general a random shard key is a good choice for equal distribution of data – For performance the shard key should be part of the queries – Only available since 2.4 • sh.shardCollection( “user.name", { a: "hashed" } )
  240. 240. Example: User { _id: 346, username: “sheldinator”, password: “238b8be8bd133b86d1e2ba191a94f549”, first_name: “Sheldon” last_name: “Cooper” created_on: “Mon Apr 15 15:30:32 +0000 2013“ modified_on: “Thu Apr 18 08:11:23 +0000 2013“ } Which shard key would you choose and why?
  241. 241. Example: Log data { log_type: “error” // Possible values “error, “warn”, “info“ application: “JBoss v. 4.2.3” message: “Fatal error. Application will quit.” created_on: “Mon Apr 15 15:38:05 +0000 2013“ } Which shard key would you choose and why?
  242. 242. Routing of queries
  243. 243. Possible types of queries • Exact queries – Data is exactly on one shard • Distributed query – Data is distributed on different shards • Distributed query with sorting – Data is distributed on different shards and needs to be sorted
  244. 244. Exact queries
  245. 245. 1. mongos receives the query from the client
  246. 246. 2. Query is routed to the shard with the data
  247. 247. 3. Shard returns the data
  248. 248. 4. mongos returns the data to the client
  249. 249. Distributed queries
  250. 250. 1. mongos receives the query from the client
  251. 251. 2. mongos routes the query to all shards
  252. 252. 3. Shards return the data
  253. 253. 4. mongos returns the data to the client
  254. 254. Distributed queries with sorting
  255. 255. 1. mongos receives the query from the client
  256. 256. 2. mongos routes the query to all shards
  257. 257. 3. Execute the query and local sorting
  258. 258. 4. Shards return sorted data
  259. 259. 5. mongos sorts the data globally
  260. 260. 6. mongos returns the sorted data to the client
  261. 261. Lab time! Lab Nr. 07 Time box: 20 min
  262. 262. Still want moar? https://education.mongodb.com

×