Mongodb

MongoDB is an open-source document database that provides high performance, high
availability, and automatic scaling.
What is MongoDB ?
Why should i use MongoDB ?
When shuld i use MongoDB ?
• Account and user profiles: can store arrays of addresses
• CMS: the flexible schema of MongoDB is great for heterogeneous collections of content
types
• Form data: MongoDB makes it easy to evolve the structure of form data over time
• logs / user-generated content: can keep data with complex relationships together in one
object
• Messaging: vary message meta-data easily per message or message type without needing
to maintain separate collections or schemas
• System configuration: just a nice object graph of configuration values, which is very
natural in MongoDB
• Log data of any kind: structured log data is the future
• Graphs: just objects and pointers – a perfect fit
• Location based data: MongoDB understands geo-spatial coordinates and natively supports
geo-spatial indexing
• Queries: mongoDb supports field, range queries, regular expressions
• Indexing: Any field in a MongoDB collection can be indexed
• Replication: MongoDb provides high availability with replica sets
• Load Balancing: MongoDB scales horizontally using sharding and shard key which
determines how the data in a collection will be distributed
• File Storage: MongoDb has a Grid File System function
• Aggregation: The mongoDb aggregation framework can be used to map reduce or batch
processing
• Fixed size collection: MongoDb supports fixed size collection called capped collections
Thiago Veiga

Document Database
A record in MongoDB is a document, which is a data structure composed of field and
value pairs. MongoDB documents are similar to JSON objects. The values of fields may
include other documents, arrays, and arrays of documents.
The advantages of using documents are:
•Documents (i.e. objects) correspond to native data types in many
programming languages.
•Embedded documents and arrays reduce need for expensive joins.
•Dynamic schema supports fluent polymorphism.
Thiago Veiga

MongoDB Instalation:
• Download a build from https://www.mongodb.com/download-center
• Decompress and run
Start Mongod
• Create the default data directory in /data/db or c:datadb
• Start mongod or mongod.exe
• To verify that you can connect to the server, start the shell: mongo or mongo.exe.
• Then just type exit and press Enter.
Shutting Down Mongod
• 1. When mongod is running attached to a controlling terminal, entering Control-C.
• 2. Execute the following command from the operating system prompt:
• mongo --eval "db.adminCommand( { "shutdown" : 1 } )"
• 3. On Linux/Unix systems, sending a TERM or INT signal, e.g., kill -TERM <pid-of-mongod>.
Thiago Veiga

Data File Allocation
Each database will have at least two data files, one ending in .ns and the rest in
integers starting with 0.
-rw------- 1 tveiga group 67108864 Aug 29 12:57 pessoa.0
-rw------- 1 tveiga group 134217728 Aug 29 12:57 pessoa.1
-rw------- 1 tveiga group 16777216 Aug 29 12:57 pessoa.ns
The .ns file stores metadata about namespaces (collections and indexes). The number of namespaces is proportional to
the size of the .ns file. Each database can have up to 24,000 namespaces by default, although the size of these files,
and thus the number of namespaces, can be increased with the --nssize option (up to 2GB).
By default, datafiles start at 64MB and double in size with each additional datafile, up to 2GB. Additionally, on some
platforms, mongod allocates one more numbered data file than it needs, to improve throughput.
Thus, it’s possible for allocated size to be much larger than data size. If this presents a problem, you can use some
combination of server options:
--smallfiles // quarters the sizes of data files
--noprealloc // inhibits preallocation of extra files
The Lock File
In order to protect against the possibility that multiple mongod processes might try to use a set of database files in
conflicting ways, there is a lock file called mongod.lock.
-rw------- 1 tveiga group 5 Aug 29 12:57 mongod.lock
The Journal Subdirectory
The mongod process is able to employ a write-ahead journal to speed up data file recovery in the event of a server
crash. The journal’s files are stored in a subdirectory of the dbpath called journal.
Thiago Veiga

Log Files
MongoDB servers log informational messages as a normal operation. By default, a server process’s log is written to standard
output. You can have the server write the log to a file with the options
--logpath /var/mongodb/mongodb.log –logappend
db.adminCommand( { "logRotate" : 1 } ) ;
Config Files
All of these options can be specified in a config file. Any option that takes an argument is specified as option = argument.
Options that don’t take arguments are specified as option = true. An example config file would look something like this:
fork = true
# vvv = true
logpath = /var/mongodb/mongodb.log
You can then invoke mongod with the config file like so:
mongod --config mongod.conf
Thiago Veiga

MongoDB’s concurrency model
∙ Read operations block write operations.
∙ A write operation blocks everything.
∙ A pending write operation prevents new read operations.
∙ All operations yield occasionally, but only between documents.
Indexing
An index is a data structure that is used by Mongo’s query optimizer to quickly sort through and order the
documents in a collection. Formally speaking, these indexes are implemented as B-Tree-style indexes.
Try this query with the twitter data set:
use twitter
db.tweets.find( { "user.followers_count" : 1000 } ) ;
db.tweets.find( { "user.followers_count" : 1000 } ).explain() ;
Look at the output from explain.
Explain()
A great way to get more information on the performance of your database queries is to use the explain
method on the cursor. The result will be a document that contains the explain output. Note that explain
runs the actual query to determine the result.
Some of the important fields in the explain output are explained below:
cursor : This is either a BasicCursor which indicates a table scan operation or a BtreeCursor which means
an index was used.
nscanned : Number of items (documents or index entries) examined.
n : Number of documents matched (on all criteria specified).
The ratio n / nscanned is a rough measure of how effective the index is for that query. For an effective index, this
ratio should be close to 1.
Thiago Veiga

Create Index
MongoDB by default creates a unique index on the _id field for all collections
db.tweets.ensureIndex( { "user.followers_count" : 1 } ) ;
db.tweets.ensureIndex( { "user.screen_name" : 1, "created_at" : -1 } ) ;
{
"name" : "Raleigh",
"tags" : [ "north" , "carolina" , "unc" ]
}
{
"line_items" :
[
{
"sku" : "555b",
"name" : "Coltrane: Impressions"
},
{
"sku" : "123a",
"name" : "Davis: Kind of Blue"
}
]
}
db.cities.ensureIndex( { "tags" : 1 } ) ;
db.cities.find( { "tags" : "south" } ) ;
db.orders.ensureIndex( { "line_items.sku" : 1 } ) ;
db.orders.find( { "line_items.sku" : "123a" } ) ;
Thiago Veiga

Schema Design
In MongoDB, the basic rubric for schema design is store your data the way your application wants to see it.
Some things to keep in mind
1. Whether to embed data in subdocuments or to refer to separate documents by key fields. Usually,
one embeds data that is seldom changed (either truly immutable or only rarely mutated), and
data that is not interesting enough to be represented as a document on its own (e.g., tags or
labels tend to be represented as strings rather than normalized into their own documents).
2. Whether to store embedded data positionally (with arrays) or by named fields (with nested
documents). This is often a matter of taste, but sometimes relates to what can be queried/indexed
efficiently (i.e., whether you need to be able to use a multi-key index).
3. Whether to put possibly-related data together into fewer, larger documents or to split them into
more numerous but smaller documents (possibly across separate collections). In general, it’s best
to design your documents to fit what the application needs; data stored but don’t look at in
documents will just cost you working space.
4. When you have immutable (or seldom mutated) fields, whether to denormalize values over documents.
If business requirements permit some data to be immutable, then you can freely duplicate
data around in any document to reduce round-trips to your servers. (For instance, in a product
review system, there might be a Users collection with canonical username information. If username
is permitted to be immutable, then you can embed it in review documents without concern
about update inconsistencies.)
Thiago Veiga

Storage Engines
The storage engine is the component of the database that is responsible for managing how data is stored, both in memory and
on disk. MongoDB supports multiple storage engines, as different engines perform better for specific workloads. Choosing the
appropriate storage engine for your use case can significantly impact the performance of your applications.
WiredTiger is the default storage engine starting in MongoDB 3.2. It is well-suited for most workloads and is recommended for
new deployments. WiredTiger provides a document-level concurrency model, checkpointing, and compression, among other
features. In MongoDB Enterprise, WiredTiger also supports Encryption at Rest.
MMAPv1 is the original MongoDB storage engine and is the default storage engine for MongoDB versions before 3.2. It performs
well on workloads with high volumes of reads and writes, as well as in-place updates.
The In-Memory Storage Engine is available in MongoDB Enterprise. Rather than storing documents on-disk, it retains them in-
memory for more predictable data latencies.
Journaling
To provide durability in the event of a failure, MongoDB uses write ahead logging to on-disk journal files.
Journal Files
For the journal files, MongoDB creates a subdirectory named journal under the dbPath directory. WiredTiger journal files have
names with the following format WiredTigerLog.<sequence> where <sequence> is a zero-padded number starting from
000000001.
Journal files contain a record per each write operation. Each record has a unique identifier.
MongoDB configures WiredTiger to use snappy compression for the journaling data.
Minimum log record size for WiredTiger is 128 bytes. If a log record is 128 bytes or smaller, WiredTiger does not compress that
record.
WiredTiger journal files for MongoDB have a maximum size limit of approximately 100 MB. Once the file exceeds that limit,
WiredTiger creates a new journal file.
WiredTiger automatically removes old journal files to maintain only the files needed to recover from last checkpoint.
WiredTiger will pre-allocate journal files.
Thiago Veiga

Durability, Availability, and Replica Sets
Like any other data storage system, unless you’re making sure to put copies of your data into places that fail
separately from one another, your data isn’t really durable or available in the presence of failures (power outages,
network partitions, hardware failures, etc.) For this reason, MongoDB has a built-in replication model based on
coordination among a number of mongod processes, called a Replica Set.
Replica Set Basics
A replica set is a group of mongod processes that allow you to have your data duplicated over several hosts, ideally
distributed among several data centers. Replica set members all know about each other, and each member
communicates with every other member occasionally, so it’s important to ensure network connectivity between all
the hosts where your replica set members will run.
In a replica set, at any moment there is at most one writable set member, called the primary node, or just the
primary. By default, all other members of a replica set request descriptions of the data changes that happen on the
primary, and apply those changes to their own copies of the primary’s data and indexes; these members that store
data but aren’t writable at a particular point in time are called secondary nodes, or just secondaries. Secondaries
constantly request new data changes, but it’s important to know that replication in MongoDB is asynchronous and in
no way a distributed transaction.
Thiago Veiga

Automatic Failover and Primary Elections
Whenever a replica set’s primary becomes unavailable (e.g., goes offline), the remainder of the set may try to elect a
new primary node. In order for a subset of a replica set to perform an election, the subset must consist of a strict
majority of the set’s normal composition. For example, if the replica set normally has 4 members, the set will be able
to elect a primary whenever 3 or 4 members are online and able to communicate with each other; if only 2 members
can communicate, neither of those members will be a primary.
Thiago Veiga

How Clients Work with Replica Sets
All 10gen-supported MongoDB drivers implement special logic for connecting to replica sets, often as adistinct class in
the language. When a client connects to a replica set, the driver automatically discovers what nodes exist in the set
and which node is primary. At all times, the driver always routes write operations to the primary; by default, read
operations go to the primary, too.
Thiago Veiga

Reading Documents From Secondaries
As mentioned, MongoDB’s drivers send all read operations to the replica set’s primary by default. To send a read
operation to a secondary, the application must authorize the driver to send reads to non-primary nodes.
All supported 10gen driver APIs have tunable ReadPreference setting for controlling read operation routing.
In the mongo shell you can use the rs.slaveOk() function to permit secondary reads on a per-connection basis. In 2.2,
the shell also includes a readPref() method for cursor objects.
Ensuring Replication of Write Operations
Because MongoDB’s replication is asynchronous and non-transactional, it can happen that a primary node performs a
write operation and subsequently fails before that operation gets replicated to any secondary. In this case, even if the
client asked the primary to acknowledge the write’s success with getLastError or a WriteConcern object, the result of
the write operation won’t be reflected in the state of the replica set after the primary fails.
Consequently, the getLastError command has an option (which the WriteConcern object encapsulates), called the w
parameter, that tells the primary not to confirm a write operation’s success until the write has been replicated.
Here’s how it works: if the w parameter is a number, the primary won’t confirm success until that number of nodes in
the set, including the primary have performed the write operation. If w is a string, then it names a so-called
getLastErrorMode.
Finally, when an application calls getLastError with a w parameter, the primary will block until an appropriate number
of secondaries have replicated. Because it’s often undesirable to block indefinitely for replication, getLastError also
supports a wtimeout option, to tell the primary to return a timeout error after some number of milliseconds, rather
than blocking.
// Ensure that a write has been propagated to 2 secondaries
// before returning.
db.runCommand( { "getLastError": 1, "w" : 2 } ) ;
// Like the previous, but time out after 2 seconds.
db.runCommand( { "getLastError" : 1, "w" : 2, "wtimeout" : 2000}) ;
Thiago Veiga

How Replication Works Internally
Each node contains a database called local. The local database contains the a collection, called oplog.rs.
All writes to the primary are written to the oplog, in an idempotent form. Secondary nodes also hold a local
database, where they keep track of how far into the primary’s oplog they have caught up.
Getting stats on the oplog:
db.printReplicationInfo()
configured oplog size: 192MB
log length start to end: 7878secs (2.19hrs)
oplog first event time: Mon Sep 13 2010 15:15:53 GMT-0400 (EDT)
oplog last event time: Mon Sep 13 2010 17:27:11 GMT-0400 (EDT)
now: Mon Sep 13 2010 17:27:17 GMT-0400 (EDT)
The oplog is a special kind of collection, called a capped collection. MongoDB allocates the space for a capped
collection’s space once, and it never grows; instead, when the capped collection runs out of room for more
documents, new documents replace the oldest documents. You may think of a capped collection as a circular file.)
On most platforms, the default oplog size is 5% of free disk space at the time the oplog is allocated (remember,
oplogs never change in size). However, this default size is arbitrary, and unlikely to be what you need. You can set
the oplog size when initially starting mongod like so:
mongod --oplogSize 200 // in MB
Thiago Veiga

Sharding
MongoDB supports an automated partitioning architecture called sharding, enabling horizontal scaling across multiple
nodes. Sharding operates by breaking up selected collections into smallish chunks of documents based on ranges of a
user-specified field, called the shard key, and then disbursing those chunks of across a number of cooperating replica
sets, called shards.
Sharding is made to support applications that outgrow the capacity of a single replica set. A replica set can be
converted to a sharded cluster fairly straightforwardly, and relatively few changes are necessary to convert an
application to work with a sharded cluster.
Operationally, a sharded cluster consists of many processes, grouped into 3 categories:
• Shards. Each shard should be a replica set. Shards store non-overlapping subsets of your applications’ databases.
mongod processes are started with all their normal options plus the --shardsvr option to operate in sharded mode.
You may have any number of shards. Although it’s possible to have a cluster with just one shard, such a cluster has
no performance or scaling benefits over a single replica set.
• Config servers. The config servers store routing information and some book keeping metadata. You need three of
these. Config servers are mongod processes started with the --configsvr option, along with other standard mongod
options.
• Routing nodes. Applications communicate with the sharded cluster via one or more mongos processes, never by
contacting the config servers or shard members directly. The mongos is a non-data-storing process that usually
lives on each application server, but can be deployed anywhere that has good connectivity to the shards and the
config servers. mongos processes must be started with a –configdb argument that specifies the addresses of the all
3 config servers.
Thiago Veiga

How Sharding Works
After all the processes in a cluster are set up and configured, it’s up to you, the application’s designers and
operators, to decide which collections would benefit from automatic data balancing. Typically, only the
largest or most volatile collections gain much by being sharded; and a cluster may contain both sharded and
non-sharded collectons. You run a command to shard a collection on a shard key; any field or compound
field can be given as the shard key, but the shard key on a collection cannot be changed, so it’s important to
pick a good one.
Once a collection is sharded, the cluster automatically breaks up the collection into ranges of shard key
values, called chunks. Each document in the collection falls into exactly one chunk, based on the value of the
shard key in that document. Chunks are automatically split into smaller, non-overlapping chunks in order
to keep the data volume within each chunk about the same size. Finally, whenever any shard has too many
chunks, the cluster will have that shard migrate some of its chunks in order to even out the distribution of
chunks over the group of shards.
All of the splitting and migrating activity is invisible to ordinary applications, because the mongos hides it
all. The mongos routes read and write requests to whichever shard or shards is the current holder of the
chunk that needs to be accessed. But at any particular moment, a document might exist in multiple copies
over a couple of shards, because a migration may be in progress. (Consequently, if you ever bypass the
mongos and connect directly to a shard, you may find confusing data. So don’t do that.)
The config servers are the canonical repository of which collections are sharded, what chunks exist in those
collections, and on which shards those chunks reside. The mongos processes keep a cache of the config servers’
state, and both the shards and the mongos processes occasionally update the config servers. However, the
protocol for updating config servers is carefully designed to prevent the system from losing track of any
chunks; in particular, whenever any config server is unavailable, no chunks may be split or migrated, no
collections may be newly sharded, and no shards may be added or removed.
Thiago Veiga

Primary Shard
Every database has a primary shard that holds all the un-sharded collections for a database. The primary shard
has no relation to the primary in a replica set.
Shard Status
Use the sh.status() method in the mongo shell to see an overview of the cluster. This
reports includes which shard is primary for the database and the chunk distribution
across the shards. See sh.status() method for more details. Thiago Veiga

Mongodb

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (14)

Ähnlich wie Mongodb

Ähnlich wie Mongodb (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Mongodb