This document provides an introduction to MongoDB, including:
- MongoDB is a document-based NoSQL database that stores data in BSON format and allows indexing on any field.
- It explains the basic concepts of databases, shows how data is structured differently in relational and non-relational databases, and demonstrates some examples of MongoDB documents.
- The document then covers many MongoDB concepts like collections, documents, queries, indexes, updates, and more - providing examples for each. It is a concise yet comprehensive overview of MongoDB fundamentals.
2. About me
● Delta Electronic CTBD Senior Engineer
● Main developer of http://loltw.net
○ Website built via MongoDB with daily 600k PV
○ Data grow up everyday with auto crawler bots
3. MongoDB - Simple Introduction
● Document based NOSQL(Not Only SQL)
database
● Started from 2007 by 10Gen company
● Wrote in C++
● Fast (But takes lots of memory)
● Stores JSON documents in BSON format
● Full index on any document attribute
● Horizontal scalability with auto sharding
● High availability & replica ready
4. What is database?
● Raw data
○ John is a student, he's 12 years old.
● Data
○ Student
■ name = "John"
■ age = 12
● Records
○ Student(name="John", age=12)
○ Student(name="Alice", age=11)
● Database
○ Student Table
○ Grades Table
5. Example of (relational) database
Student Grade
Grade ID
StudentID
Student
Grade
Student ID Grade
Name Grade ID
Age Name
Class ID
Class
Class ID
Name
6. SQL Language - How to find data?
● Find student name is John
○ select * from student where name="John"
● Find class name of John
○ select s.name, c.name as class_name from student
s, class c where name="John" and s.class_id=c.
class_id
7. Why NOSQL?
● Big data
○ Morden data size is too big for single DB server
○ Google search engine
● Connectivity
○ Facebook like button
● Semi-structure data
○ Car equipments database
● High availability
○ The basic of cloud service
8. Common NOSQL DB characteristic
● Schemaless
● No join, stores pre-joined/embedded data
● Horizontal scalability
● Replica ready - High availability
9. Common types of NOSQL DB
● Key-Value
○ Based on Amazon's Dynamo paper
○ Stores K-V pairs
○ Example:
■ Dynomite
■ Voldemort
10. Common types of NOSQL DB
● Bigtable clones
○ Based on Google Bigtable paper
○ Column oriented, but handles semi-structured data
○ Data keyed by: row, column, time, index
○ Example:
■ Google Big Table
■ HBase
■ Cassandra(FB)
11. Common types of NOSQL DB
● Document base
○ Stores multi-level K-V pairs
○ Usually use JSON as document format
○ Example:
■ MongoDB
■ CounchDB (Apache)
■ Redis
12. Common types of NOSQL DB
● Graph
○ Focus on modeling the structure of data -
interconnectivity
○ Example
■ Neo4j
■ AllegroGraph
13. Start using MongoDB - Installation
● From apt-get (debian / ubuntu only)
○ sudo apt-get install mongodb
● Using 10-gen mongodb repository
○ http://docs.mongodb.org/manual/tutorial/install-
mongodb-on-debian-or-ubuntu-linux/
● From pre-built binary or source
○ http://www.mongodb.org/downloads
● Note:
32-bit builds limited to around 2GB of data
14. Manual start your MongoDB
mkdir -p /tmp/mongo
mongod --dbpath /tmp/mongo
or
mongod -f mongodb.conf
15. Verify your MongoDB installation
$ mongo
MongoDB shell version: 2.2.0
connecting to: test
>_
--------------------------------------------------------
mongo localhost/test2
mongo 127.0.0.1/test
22. Save document into MongoDB
s.name = "Alice"
s.age = 14
s.grades.math = 2.0
db.students.save(s)
23. What is _id / ObjectId ?
● _id is the default primary key for indexing
documents, could be any JSON acceptable
value.
● By default, MongoDB will auto generate a
ObjectId as _id
● ObjectId is 12 bytes value of unique
document _id
● Use ObjectId().getTimestamp() to restore
the timestamp in ObjectId
0 1 2 3 4 5 6 7 8 9 10 11
unix timestamp machine process id Increment
24. Save document with id into MongoDB
s.name = "Bob"
s.age = 11
s['favorite subjects'] = ["music", "math", "art"]
s.grades.chinese = 3.0
s._id = 1
db.students.save(s)
26. How to find documents?
● db.xxxx.find()
○ list all documents in collection
● db.xxxx.find(
find spec, //how document looks like
find fields, //which parts I wanna see
...
)
● db.xxxx.findOne()
○ only returns first document match find spec.
29. find by name - equal or not equal
db.students.find({name: "John"})
db.students.find({name: "Alice"})
db.students.find({name: {$ne: "John"}})
● $ne : not equal
30. find by name - ignorecase ($regex)
db.students.find({name: "john"}) => X
db.students.find({name: /john/i}) => O
db.students.find({
name: {
$regex: "^b",
$options: "i"
}
})
31. find by range of names - $in, $nin
db.students.find({name: {$in: ["John", "Bob"]}})
db.students.find({name: {$nin: ["John", "Bob"]}})
● $in : in range (array of items)
● $nin : not in range
32. find by age - $gt, $gte, $lt, $lte
db.students.find({age: {$gt: 12}})
db.students.find({age: {$gte: 12}})
db.students.find({age: {$lt: 12}})
db.students.find({age: {$lte: 12}})
● $gt : greater than
● $gte : greater than or equal
● $lt : lesser than
● $lte : lesser or equal
33. find by field existence - $exists
db.students.find({registered: {$exists: true}})
db.students.find({registered: {$exists: false}})
34. find by field type - $type
db.students.find({_id: {$type: 7}})
db.students.find({_id: {$type: 1}})
1 Double 11 Regular expression
2 String 13 JavaScript code
3 Object 14 Symbol
4 Array 15 JavaScript code with scope
5 Binary Data 16 32 bit integer
7 Object id 17 Timestamp
8 Boolean 18 64 bit integer
9 Date 255 Min key
10 Null 127 Max key
42. find with bool operators - $not
$not could only be used with other find filter
X db.students.find({registered: {$not: false}})
O db.students.find({registered: {$ne: false}})
O db.students.find({age: {$not: {$gte: 12}}})
46. more cursor functions
● snapshot
ensure cursor returns
○ no duplicates
○ misses no object
○ returns all matching objects that were present at
the beginning and the end of the query.
○ usually for export/dump usage
47. more cursor functions
● batchSize
tell MongoDB how many documents should
be sent to client at once
● explain
for performance profiling
● hint
tell MongoDB which index should be used
for querying/sorting
48. list current running operations
● list operations
db.currentOP()
● cancel operations
db.killOP()
49. MongoDB index - when to use index?
● while doing complicate find
● while sorting lots of data
50. MongoDB index - sort() example
for (i=0; i<1000000; i++){
db.many.save({value: i});
}
db.many.find().sort({value: -1})
error: {
"$err" : "too much data for sort() with no index. add an index or specify
a smaller limit",
"code" : 10128
}
51. MongoDB index - how to build index
db.many.ensureIndex({value: 1})
● Index options
○ background
○ unique
○ dropDups
○ sparse
52. MongoDB index - index commands
● list index
db.many.getIndexes()
● drop index
db.many.dropIndex({value: 1})
db.many.dropIndexes() <-- DANGER!
53. MongoDB Index - find() example
db.many.dropIndex({value: 1})
db.many.find({value: 5555}).explain()
db.many.ensureIndex({value: 1})
db.many.find({value: 5555}).explain()
54. MongoDB Index - Compound Index
db.xxx.ensureIndex({a:1, b:-1, c:1})
query/sort with fields
● a
● a, b
● a, b, c
will be accelerated by this index
55. Remove/Drop data from MongoDB
● Remove
db.many.remove({value: 5555})
db.many.find({value: 5555})
db.many.remove()
● Drop
db.many.drop()
● Drop database
db.dropDatabase() EXTREMELY DANGER!!!
56. How to update data in MongoDB
Easiest way:
s = db.students.findOne({_id: 1})
s.registered = true
db.students.save(s)
57. In place update - update()
update( {find spec},
{update spec},
upsert=false)
db.students.update(
{_id: 1},
{$set: {registered: false}}
)
66. Practice: add comments to student
Add a field into students ({_id: 1}):
● field name: comments
● field type: array of dictionary
● field content:
○ {
by: author name, string
text: content of comment, string
}
● add at least 3 comments to this field
67. Example answer to practice
db.students.update({_id: 1}, {
$addToSet: { comments: {$each: [
{by: "teacher01", text: "text 01"},
{by: "teacher02", text: "text 02"},
{by: "teacher03", text: "text 03"},
]}}
})
68. The $ position operator (for array)
db.students.update({
_id: 1,
"comments.by": "teacher02"
}, {
$inc: {"comments.$.vote": 1}
})
69. Atomically update - findAndModify
● Atomically update SINGLE DOCUMENT and
return it
● By default, returned document won't
contain the modification made in
findAndModify command.
70. findAndModify parameters
db.xxx.findAndModify({
query: filter to query
sort: how to sort and select 1st document in query results
remove: set true if you want to remove it
update: update content
new: set true if you want to get the modified object
fields: which fields to fetch
upsert: create object if not exists
})
71. GridFS
● MongoDB has 32MB document size limit
● For storing large binary objects in MongoDB
● GridFS is kind of spec, not implementation
● Implementation is done by MongoDB drivers
● Current supported drivers:
○ PHP
○ Java
○ Python
○ Ruby
○ Perl
72. GridFS - command line tools
● List
mongofiles list
● Put
mongofiles put xxx.txt
● Get
mongofiles get xxx.txt
73. MongoDB config - basic
● dbpath
○ Which folder to put MongoDB database files
○ MongoDB must have write permission to this folder
● logpath, logappend
○ logpath = log filename
○ MongoDB must have write permission to log file
● bind_ip
○ IP(s) MongoDB will bind with, by default is all
○ User comma to separate more than 1 IP
● port
○ Port number MongoDB will use
○ Default port = 27017
74. Small tip - rotate MongoDB log
db.getMongo().getDB("admin").runCommand
("logRotate")
75. MongoDB config - journal
● journal
○ Set journal on/off
○ Usually you should keep this on
76. MongoDB config - http interface
● nohttpinterface
○ Default listen on http://localhost:28017
○ Shows statistic info with http interface
● rest
○ Used with httpinterface option enabled only
○ Example:
http://localhost:28017/test/students/
http://localhost:28017/test/students/?
filter_name=John
77. MongoDB config - authentication
● auth
○ By default, MongoDB runs with no authentication
○ If no admin account is created, you could login with
no authentication through local mongo shell and
start managing user accounts.
78. MongoDB account management
● Add admin user
> mongo localhost/admin
db.addUser("testadmin", "1234")
● Authenticated as admin user
use admin
db.auth("testadmin", "1234")
79. MongoDB account management
● Add user to test database
use test
db.addUser("testrw", "1234")
● Add read only user to test database
db.addUser("testro", "1234", true)
● List users
db.system.users.find()
● Remove user
db.removeUser("testro")
80. MongoDB config - authentication
● keyFile
○ At least 6 characters and size smaller than 1KB
○ Used only for replica/sharding servers
○ Every replica/sharding server should use the same
key file for communication
○ On U*ix system, file permission to key file for
group/everyone must be none, or MongoDB will
refuse to start
81. MongoDB configuration - Replica Set
● replSet
○ Indicate the replica set name
○ All MongoDB in same replica set should use the
same name
○ Limitation
■ Maximum 12 nodes in a single replica set
■ Maximum 7 nodes can vote
○ MongoDB replica set is Eventually consistent
82. How's MongoDB replica set working?
● Each a replica set has single primary
(master) node and multiple slave nodes
● Data will only be wrote to primary node
then will be synced to other slave nodes.
● Use getLastError() for confirming previous
write operation is committed to whole
replica set, otherwise the write operation
may be rolled back if primary node is down
before sync.
83. How's MongoDB replica set working?
● Once primary node is down, the whole
replica set will be marked as fail and can't
do any operation on it until the other nodes
vote and elect a new primary node.
● During failover, any write operation not
committed to whole replica set will be
rolled back
84. Simple replica set configuration
mkdir -p /tmp/db01
mkdir -p /tmp/db02
mkdir -p /tmp/db03
mongod --replSet test --port 29001 --dbpath /tmp/db01
mongod --replSet test --port 29002 --dbpath /tmp/db02
mongod --replSet test --port 29003 --dbpath /tmp/db03
86. Another way to config replica set
rs.initiate()
rs.add("localhost:29001")
rs.add("localhost:29002")
rs.add("localhost:29003")
87. Extra options for setting replica set
● arbiterOnly
○ Arbiter nodes don't receive data, can't become
primary node but can vote.
● priority
○ Node with priority 0 will never be elected as
primary node.
○ Higher priority nodes will be preferred as primary
○ If you want to force some node become primary
node, do not update node's vote result, update
node's priority value and reconfig replica set.
● buildIndexes
○ Can only be set to false on nodes with priority 0
○ Use false for backup only nodes
88. Extra options for setting replica set
● hidden
○ Nodes marked with hidden option will not be
exposed to MongoDB clients.
○ Nodes marked with hidden option will not receive
queries.
○ Only use this option for nodes with usage like
reporting, integration, backup, etc.
● slaveDelay
○ How many seconds slave nodes could fall behind to
primary nodes
○ Can only be set on nodes with priority 0
○ Used for preventing some human errors
89. Extra options for setting replica set
● vote
If set to 1, this node can vote, else not.
91. What is sharding?
Name Value A value
Alice value to value
Amy value F value
Bob value
G value
: value
to value
: value
N value
: value
: value
O value
Yoko value
to value
Zeus value
Z value
93. Elements of MongoDB sharding
cluster
● Config Server
Storing sharding cluster metadata
● mongos Router
Routing database operations to correct
shard server
● Shard Server
Hold real user data
94. Sharding config - config server
● Config server is a MongoDB instance runs
with --configsrv option
● Config servers will automatically synced by
mongos process, so DO NOT run them with
--replSet option
● Synchronous replication protocol is
optimized for three machines.
95. Sharding config - mongos Router
● Use mongos (not mongod) for starting a
mongos router
● mongos routes database operations to
correct shard servers
● Exmaple command for starting mongos
mongos --configdb db01, db02, db03
● With --chunkSize option, you could specify
a smaller sharding chunk if you're just
testing.
96. Sharding config - shard server
● Shard server is a MongoDB instance runs
with --shardsvr option
● Shard server don't need to know where
config server / mongos route is
99. Let us insert some documents
use test
for (i=0; i<1000000; i++) {
db.shardtest.insert({value: i});
}
100. Remove 1 shard & see what happens
use admin
db.runCommand({removeshard: "shard0002"})
Let's add it back
db.runCommand({addshard: "localhost:
29003"})
101. Pick your sharding key wisely
● Sharding key can not be changed after
sharding enabled
● For updating any document in a sharding
cluster, sharding key MUST BE INCLUDED as
find spec
EX:
sharding key= {name: 1, class: 1}
db.xxx.update({name: "xxxx", class: "ooo},{
..... update spec
})
102. Pick your sharding key wisely
● Sharding key will strongly affect your data
distribution model
EX:
sharding by ObjectId
shard001 => data saved 2 months ago
shard002 => data saved 1 months ago
shard003 => data saved recently
103. Other sharding key examples
EX:
sharding by Username
shard001 => Username starts with a to k
shard002 => Username starts with l to r
shard003 => Username starts with s to z
EX:
sharding by md5
completely random distribution
104. What is Mapreduce?
● Map then Reduce
● Map is the procedure to call a function for
emitting keys & values sending to reduce
function
● Reduce is the procedure to call a function
for reducing the emitted keys & values sent
via map function into single reduced result.
● Example: map students grades and reduce
into total students grades.
105. How to call mapreduce in MongoDB
db.xxx.mapreduce(
map function,
reduce function,{
out: output option,
query: query filter, optional,
sort: sort filter, optional,
finalize: finalize function,
.... etc
})
111. Run mapreduce again with finalize
db.grades.mapReduce(
map,
reduce,
{out:{inline: 1}, finalize: finalize}
)
112. Mapreduce output options
● {replace: <result collection name>}
Replace result collection if already existed.
● {merge: <result collection name>}
Always overwrite with new results.
● {reduce: <result collection name>}
Run reduce if same key exists in both
old/current result collections. Will run
finalize function if any.
● {inline: 1}
Put result in memory
113. Other mapreduce output options
● db- put result collection in different
database
● sharded - output collection will be sharded
using key = _id
● nonAtomic - partial reduce result will be
visible will processing.
114. MongoDB backup & restore
● mongodump
mongodump -h localhost:27017
● mongorestore
mongorestore -h localhost:27017 --drop
● mongoexport
mongoexport -d test -c students -h
localhost:27017 > students.json
● mongoimport
mongoimport -d test -c students -h
localhost:27017 < students.json
115. Conclusion - Pros of MongoDB
● Agile (Schemaless)
● Easy to use
● Built in replica & sharding
● Mapreduce with sharding
116. Conclusion - Cons of MongoDB
● Schemaless = everyone need to know how
data look like
● Waste of spaces on keys
● Eats lots of memory
● Mapreduce is hard to handle
117. Cautions of MongoDB
● Global write lock
○ Add more RAM
○ Use newer version (MongoDB 2.2 now has DB level
global write lock)
○ Split your database properly
● Remove document won't free disk spaces
○ You need run compact command periodically
● Don't let your MongoDB data disk full
○ Once freespace of disk used by MongoDB if full, you
won't be able to move/delete document in it.