One of the reasons companies are turning to NoSQL databases is performance. This presentation highlights performance advantages of MongoDB over other data stores, covers key hardware requirements for performance along with discussing sharding, choosing a database type, key factors for an optimal schema design, in addition to the types of and importance of indexes.
5. MWLUG 2017
Moving Collaboration Forward
Inserting Data: MongoDB vs. MySQL
• Inserting 1,615 chemical compound records into two parent-
child tables
• Turned off foreign keys during insert and used string builder to
create bulk insert SQL statement in MySQL
6. MWLUG 2017
Moving Collaboration Forward
MongoDB vs. S3 Performance
• Download 220 KB object from MongoDB was
7x faster cold, and 3x faster when warm
8. MWLUG 2017
Moving Collaboration Forward
Hardware Requirements
• Can use commodity hardware all way up to
IBM Power and zSeries
– Use multi-core systems when possible
• Ensure indexes and most frequently accessed
data (working set) fits in RAM
• RAM is the most important factor for
hardware
• db.serverStatus()
– Use to obtain info on current working set
9. MWLUG 2017
Moving Collaboration Forward
Hardware Requirements
• Data placement is key!
– Use SSDs for:
• Write-heavy data
• Placement of journals
• Compression
– Can reduces footprint by up to 80%
– Equals fewer bits read from disk
10. MWLUG 2017
Moving Collaboration Forward
Compression
• WiredTiger has native compression
• Compression options for documents and indexes
– Snappy
• Default, balance between high document and journal
compression ratios
• Low CPU overhead
– zlib
• Higher compression
• Additional CPU overhead
– Prefix
• What indexes use by default, reduces size ~50%
11. MWLUG 2017
Moving Collaboration Forward
Compression
• Snappy and SSDs
– Use for frequently accessed data
• zlib and rotational disks
– Use for older, less frequently accessed data
13. MWLUG 2017
Moving Collaboration Forward
Sharding
• Place a portion of data on certain servers
• Use with
– Very large data sets
– High throughput demands
– Needs for geo location of data
15. MWLUG 2017
Moving Collaboration Forward
Sharding
• Distribute data across cluster based on query
patterns or data locality
• Types of sharding:
– Range
– Hash
– Zone
16. MWLUG 2017
Moving Collaboration Forward
Sharding
• Range sharding
– Divides data into ranges based on shard key values
– Efficient queries when reading documents in a contiguous range
– Can have poor read and write performance with poor shard key
range selection
• Hash sharding
– More even data distribution
– Can impact performance of range-based queries
• Zone sharding
– Used to improve locality of data
• By geographic region
• By hardware configuration for tiered storage-architectures
• By application feature
18. MWLUG 2017
Moving Collaboration Forward
4 Types of Databases
• WiredTiger
– Most commonly used database type, the default
• Encrypted
– For highly sensitive data
• In-memory
– For performance critical data
• MMAPv1
– Improved version of database used in earlier
versions of MongoDB
19. MWLUG 2017
Moving Collaboration Forward
In-Memory Database
• Doesn’t maintain any on-disk data, including
configuration data, indexes, user credentials,
etc.
• Entire database needs to be able to fit into
memory
– Key to know true “working set”
21. MWLUG 2017
Moving Collaboration Forward
Schema Design
• Schema design is critical
– Most performance problems are because of poor
schema design
• RDBMS schema design
– What answers do I have?
• MongoDB schema design
– What questions will I have?
22. MWLUG 2017
Moving Collaboration Forward
Schema Design
• Key items of focus
–How will the data be accessed
–What is the projected read to write ratio
–How large will documents become
• Want to structure data to match how it is
queried and updated
24. MWLUG 2017
Moving Collaboration Forward
Embedding
• To embed or not to embed
– Favor embedding unless there is a compelling
reason not to
– If an object needs to be accessed frequently on
it’s own, it’s best not to embed
26. MWLUG 2017
Moving Collaboration Forward
Embedding
• Use when all of the data is manipulated
together
• Relationship between collections is one-to-
one
• When able to be used, normally reduces
latency of get requests by 50%
27. MWLUG 2017
Moving Collaboration Forward
Referencing
• Link to other documents when:
– One to many relationships
– Need to access parts of data stand-alone
28. MWLUG 2017
Moving Collaboration Forward
Denormalizing
• Read/write ratio is key for deciding on
denormalizing
– Fields primarily read and rarely updated are good
candidates
– If a field is updated frequently, don’t do it
29. MWLUG 2017
Moving Collaboration Forward
Denormalization
• Limits having to perform application-level join
for denormalized fields
30. MWLUG 2017
Moving Collaboration Forward
Denormalization
• Consider the write/read ratio when
denormalizing
– A field that will mostly be read and only seldom
updated is a good candidate for denormalization
– As updates become more frequent relative to
queries, the savings from denormalization
decrease
31. MWLUG 2017
Moving Collaboration Forward
Back to Embedding
• Embed computed information when you write
it
– Prevents needing to retrieve and compute over
and over
– Works well if writes are infrequent
– Pushes work to the application on the write, result
is dramatically improved read time
32. MWLUG 2017
Moving Collaboration Forward
Back to Embedding
• What to look for when choosing referencing vs
embedding data in a document
– Things that don’t change often and aren’t read
often are best stored in a separate document
– Parent document contains a reference to the less
frequently accessed/updated document
33. MWLUG 2017
Moving Collaboration Forward
Schema Design
• The MongoDB data schema design of choice
depends – entirely – on your particular
application’s data access patterns
• Structure your data to match the ways that
your application queries and updates it
35. MWLUG 2017
Moving Collaboration Forward
Indexes
• ½ of all performance issues are due to missing
or incorrect secondary indexes
• Index early
• Index often
36. MWLUG 2017
Moving Collaboration Forward
Types of Secondary Indexes
• Unique
• Compound
• Array
• Time to Live (TTL)
• Geospatial
• Partial
• Sparse
• Text search
37. MWLUG 2017
Moving Collaboration Forward
Types of Secondary Indexes
• Unique
– Rejects insertion of new documents or the update
of a document with an existing value for the field
it is built over
• Compound
– Useful for queries that specify multiple predicates
• Example: Find customers based on last name, first
name, and city of residence
– Can reduce the need for single field indexes as any
leading field in a compound index can be used
38. MWLUG 2017
Moving Collaboration Forward
Types of Secondary Indexes
• Array
– For fields that contain an array, each array value is
stored as a separate index entry
• Time to Live (TTL)
– Specify a period of time after which the data is
automatically deleted from the database
39. MWLUG 2017
Moving Collaboration Forward
Types of Secondary Indexes
• Geospatial
– Allow MongoDB to optimize queries for
documents that contain points or a polygon that
are closest to a given point or line; that are within
a circle, rectangle, or polygon; or that intersect
with a circle, rectangle, or polygon
• Partial
– Use to include only documents that meet specific
conditions
40. MWLUG 2017
Moving Collaboration Forward
Types of Secondary Indexes
• Sparse
– Contain entries for documents that contain a
specified field
– Allow for smaller, more efficient indexes when
fields are not present in all documents
• Text search
– Specialized index for text search that uses
advanced, language-specific linguistic rules for
stemming, tokenization, case sensitivity and stop
words
41. MWLUG 2017
Moving Collaboration Forward
Indexing Tidbits
• Query optimizer
– Selects best index to use by periodically running query
plans
• Index intersection
– Allows MongoDB to use more than one index to
optimize ad-hoc queries at run-time
• Covered queries
– Return results containing only indexed fields
– Very efficient, results returned without reading from
source documents
42. MWLUG 2017
Moving Collaboration Forward
Aggregation Pipeline
• Replaces find in certain scenarios
• Improves performance significantly
– Moves processing from the client side to the
server
– Saves CPU and bandwidth
• Reduce the amount of data transmitted to the
application layer
44. MWLUG 2017
Moving Collaboration Forward
Where to Find More Information
• MongoDB University
– university.mongodb.com
• YouTube tutorials
– youtube.com/mongodb
• MongoDB Performance Best Practices white
paper
– mongodb.com/collateral/mongodb-performance-
best-practices