Introduction to Microsoft Azure DocumentDB. The slides have sections on Overview, Resource Model, Data Modeling, Performance, Development, Pricing and DocumentDB resources.
This talk was given at the following locales:
- DevTeach Montreal (July 6, 2016)
1. NoSQL, NO PROBLEM:
USING AZURE
DOCUMENTDB
{
"name": "Ken Cenerelli",
"twitter": "@KenCenerelli",
"e-mail": "Ken_Cenerelli@Outlook.com",
"hashtags": ["#DevTeach", "#DocumentDB"]
}
2. ABOUT ME
Twitter: @KenCenerelli
Email: Ken_Cenerelli@Outlook.com
Blog: kencenerelli.wordpress.com
LinkedI
n:
linkedin.com/in/kencenerelli
Bio:
Content Developer / Programmer
Writer
Microsoft MVP - Visual Studio and
Development Technologies
Microsoft TechNet Wiki Guru
Co-Organizer of CTTDNUG
Technical reviewer of multiple
booksCTTDNU
G
Ken
Cenerelli
2
3. ROAD MAP
1. Overview
2. The Resource Model
3. Modeling Your Data
4. Performance
5. Developing with DocumentDB & Demos {“aka”: “The Good Stuff”}
6. Pricing
7. Wrap-up
3
4. WHAT IS NoSQL?
NoSQL → Not Only SQL
No up-front (schema) design
Easier to scale horizontally
Easier to develop iteratively
Types & Examples:
Document databases: DocumentDB, MongoDB, CouchDB
Key-value stores: Redis
Graph stores: Neo4J, Giraph
Wide-column: Cassandra, HBase
4
5. WHAT IS AZURE DOCUMENTDB?
NoSQL document database fully managed by Microsoft Azure
Part of the NoSQL family of databases
For rapid development of cloud-designed apps (web, mobile,
gaming, IoT)
Store and query schema agnostic JSON data with SQL-like grammar
Fast, predictable performance
Transactionally process multiple documents via native JavaScript
processing
Tunable consistency levels
Built with familiar tools – REST, JSON, JavaScript
5
8. WHEN TO USE DOCUMENTDB?
In General
You don’t want to do replication and scale-out by yourself
You want ACID transactions
You want to have tunable consistency
You want to do rapid development where models can evolve
You want to utilize your .NET, JavaScript and MongoDB skills
Compared to relational databases
You don’t want predefined columns
Compared to other document stores
You want to use a SQL-like grammar
8
9. WHEN TO NOT USE DOCUMENTDB?
If your data has complex relationships
If your data has rigid schemas
If your data has complex transactions
If your data needs aggregation
If your data needs encrypted storage
If you’re planning to move your entire data store to DocumentDB
If you do not want your data to be locked into Azure
9
10. DOCUMENTDB USE CASES
User generated content
Blog posts, chat sessions, ratings, comments, feedback, polls
Catalog data
User accounts, product catalogs, device registries for IoT
Logging and Time-series data
Event logs, input source for data analytics jobs performed offline
Gaming
In-game stats, social media integration, and high-score leaderboards
User preferences data
Modern web and mobile applications
IoT and Device sensor data
Ingest bursts of data from device sensors, ad-hoc querying and offline analytics
10
16. RESOURCE ADDRESSING
Native REST Interface
Each resource has a permanent unique ID
API URL:
https://{database account}.documents.azure.com
Document Path:
/dbs/{database id}/colls/{collection id}/docs/{document id}
16
17. DOCUMENTDB JSON DOCUMENTS
JSON
Intersection of most
modern type systems
JSON values
Self-describable,
self-contained values
Are trivially serialized
to/from text
17
{
"locations":
[
{"country": "Germany", "city": "Berlin"},
{"country": "France", "city": "Paris"},
],
"headquarters": "Belgium",
"exports":[{"city"; "Moscow"},{"city: "Athens"}]
};
a JSON document, as a tree
Locations
Headquarte
rs
Belgium
Country City Country City
Germany Berlin France Paris
Exports
CityCity
Moscow Athens
0 10 1
18. DATA MODELING WITH RDBMS
18
Doing it the RDBMS way: normalize everything!
To query for Person joins are needed to related
tables:
SELECT p.name, p.lastName, p.age, cd.detail,
cdt.type, a.street, a.city, a.state, a.zip
FROM Person p
INNER JOIN Address a
ON a.person_id = p.id
INNER JOIN ContactDetail cd
ON cd.person_id = p.id
INNER JOIN ContactDetailType cdt
ON cd.type_id = cdt.id
multiple
table updates
19. DATA MODELING WITH
DENORMALIZATION
19
{
"id": "1",
"firstName": "Thomas",
"lastName": "Andersen",
"addresses": [
{
"line1": "100 Some Street",
"line2": "Unit 1",
"city": "Seattle",
"state": "WA",
"zip": 98012 }
],
"contactDetails": [
{"email: "thomas@andersen.com"},
{"phone": "+1 555 555-5555", "extension": 5555}
]
}
Try to model your entity as a self-
contained document
Generally, use embedded data models
when:
contains
one-to-few
changes infrequently
won’t grow
integral
better read
performance
20. DATA MODELING WITH
REFERENCING
20
In general, use normalized
data models when:
Write performance is more
important than read
performance
Representing one-to-many
relationships
Can representing many-to-many
relationships
Related data changes frequently
Provides more flexibility than
embedding
More round trips to read data
{
"id": "xyz",
"username: "user xyz"
}
{
"id": "address_xyz",
"userid": "xyz",
"address" : {
…
}
}
{
"id: "contact_xyz",
"userid": "xyz",
"email" : "user@user.com"
"phone" : "555 5555"
}
Normalizing typically provides better write performance
21. HYBRID MODELS: DENORMALIZE +
REFERENCE
21
No magic bullet!
Think about how your data is
going to be written and read
then model accordingly
{
"id": "1",
"firstName": "Thomas",
"lastName": "Andersen",
"countOfBooks": 3,
"books": [1, 2, 3],
"images": [
{"thumbnail": "http://....png"}
{"profile": "http://....png"}
]
}
{
"id": 1,
"name": "DocumentDB 101",
"authors": [
{"id": 1, "name": "Thomas Andersen", "thumbnail": "http://....png"},
{"id": 2, "name": "William Wakefield", "thumbnail": "http://....png"}
]
}
Author document
Book document
22. DATA MODELLING TIPS
Map properties to JSON types
Prefer smaller documents (<16KB) for smaller footprint, less IO,
lower RU charges
Maximum size is 512KB – watch unbounded arrays leading to
document bloat
Store metadata on attachments, reference binary data/free text as
external links
Prefer sparse properties – skip rather than explicit null
Use fullName = "Azure DocumentDB" instead of firstName =
"Azure" AND lastName = "DocumentDB"
22
23. TUNABLE CONSISTENCY
Set at the account level
Can be overridden at the query level
Levels:
Strong
Session (default option)
Bounded Staleness
Eventual
23
Strong consistency; slow write
speeds
Weak consistency; fast write
speeds
24. INDEXING
Automatic indexing of documents and its properties when added
to the collection
Instantly queryable by property using a SQL-like grammar
No need to define secondary indices / schema hints for indexing
24
Indexing Modes
Consistent
Default mode
Index updated
synchronously on
writes
Lazy
Useful for bulk
ingestion scenarios
Indexing Policies
Automatic
Default
Manual
Can manually opt-
out of automatic
indexing
Indexing Types
Hash
For equality queries
Strings and
numbers
Range
For comparison
queries
Numbers
25. INDEXING POLICIES
25
Configuration Level Options
Automatic Per collection True (default) or False
Override with each document write
Indexing Mode Per collection Consistent or Lazy
Lazy for eventual updates/bulk ingestion
Included and excluded
paths
Per path Individual path or recursive includes (? And *)
Indexing Type Per path Support Hash (Default) and Range
Hash for equality, range for range queries
Indexing Precision Per path Supports 3 – 7 per path
Tradeoff storage, query RUs and write RUs
26. DOCUMENTDB FOR DEVELOPERS
Promotes code-first development
Resilient to iterative schema changes
Low impedance as object / JSON store; no ORM required
Richer query and indexing
Has a REST API
Available SDKs and libraries:
.NET (LINQ to SQL is supported)
Node.js
JavaScript
Python
Java
JavaScript for server-side app logic
26
27. QUERYING LIMITATION
Within a collection
Besides filtering, ORDER BY and TOP is supported
No aggregation yet
No COUNT
No GROUP BY
No SUM, AVG, etc.
SQL for queries only
No batch UPDATE or DELETE or CREATE
27
29. REQUEST UNITS
DocumentDB unit of scale
Throughput (in terms of rate of transactions / second)
Measured in Request Units (RUs)
1 RU = throughput for a 1KB document/second
2,000 requests per second allowed
“Request” depends on the size of the document
For example, uploading 1,000 large JSON documents might count as more than
one request
Max throughput per collection, measured in RUs per second per
collection, is 250,000 RUs/second
29
30. REQUEST UNITS
30
Request Unit (RU) is
the normalized
currency
%
Memory
% IOPS
% CPU
Replica gets a fixed
budget of Request Units
Resource
Resource
set
Resource
Resource
DocumentsSQL
sprocsargs
Resource Resource
Predictable Performance
Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents. Mongo, Azure DocumentDB. Work best with hierarchical documents that are entirely or almost entirely self contained.
Graph stores are used to store information about networks of data, such as social connections. Graph stores include Neo4J and Giraph.
Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or 'key'), together with its value. Examples of key-value stores are Riak and Berkeley DB. Some key-value stores, such as Redis, allow each value to have a type, such as 'integer', which adds functionality.
Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets, and store columns of data together, instead of rows.
In DocumentDB, persist whole object in DB ; query the whole object and display it
Service: no need for containers or VMs to run it; provision the service; has fine grain control b/c of this
Indexing: builds the structure to start right away; can shape the data; no need to spend time on ERD or large schemas to persist data; indexes evolve as models do
DocumentDB database can grow to hundred of terabytes or event petabytes; thousands of nodes in data centers
Table Storage (Key Value)
Azure Blob Storage can be used to store full user profiles including images.
Azure Tables is cheap, scalable No SQL solution
Good for key value patterns to be scored at scale
Runs on spinning disk and has higher latency when getting to 95th percentile
But no secondary indexes either
Mongo protocol support now supports Mongo drivers
Can leverage existing skills and tools to interact with a DocumentDB service
This supports the native MongoDB wire protocol
Complex relationships – obvious (denormalized data and flex schema)
Rigid schema – for instance if your data was imported from XML docs
Complex transactions – supports some but can’t cross boundaries
Aggregation – since some properties could be present or not, it’s not a good fit for data aggregation
Encrypted data storage – protocol is encrypted, data is not. No current in-built mechanisms for encrypted data storage
Moving entire datastore to Azure DocumentDB – use as supplement not replacement
Azure specific – although adding protocol support for MongoDB
UGC in social media applications is a blend of free form text, properties, tags and relationships not bounded by rigid structure.
Content such as chats, comments, and posts can be stored in DocumentDB without requiring transformations or complex object to relational mapping layers.
Data properties can be added or modified easily to match requirements as developers iterate over the application code, thus promoting rapid development.
Catalog Data - Attributes for this data may vary and can change over time to fit application requirements. Consider an example of a product catalog for an automotive parts supplier. Every part may have its own attributes in addition to the common attributes that all parts share. Furthermore, attributes for a specific part can change the following year when a new model is released. As a JSON document store, DocumentDB supports flexible schemas and allows you to represent data with nested properties, and thus it is well suited for storing product catalog data.
Logging - Perform ad-hoc queries over a subset of data for troubleshooting. Subset of data is first retrieved from the logs, typically by time series. Then, a drill-down is performed by filtering the dataset with error levels or error messages. Long running data analytics jobs performed offline over a large volume of log data. Examples of this use case include server availability analysis, application error analysis, and clickstream data analysis. Typically, Hadoop is used to perform these types of analyses with data from DocumentDB using Hadoop connector
Gaming - Handle updating profile and stats from millions of simultaneous gamers, millisecond reads and writes to help avoid any lags, automatic indexing allows for filtering against multiple different properties in real-time
User preferences data - most modern web and mobile applications come with complex views and experiences. These views and experiences are usually dynamic, catering to user preferences or moods and branding needs. Hence, applications need to be able to retrieve personalized settings effectively in order to render UI elements and experiences quickly.
IoT - Bursts of data can be ingested by Azure Event Hubs as it offers high throughput data ingestion with low latency. Data ingested that needs to be processed for real time insight can be funneled to Azure Stream Analytics for real time analytics. Data can be loaded into DocumentDB for ad-hoc querying. Once the data is loaded into DocumentDB, the data is ready to be queried. The data in DocumentDB can be used as reference data as part of real time analytics. In addition, data can further be refined and processed by connecting DocumentDB data to HDInsight for Pig, Hive or Map/Reduce jobs. Refined data is then loaded back to DocumentDB for reporting.
Provides keys and resource location for databases and contents
Container of JSON documents and the associated JavaScript application logic (stored proc, trigger, UDF)
Collections are isolated and independent from one another. Queries run against a single collection and returns documents from that collection
Read consistency level of documents follows the consistency policy on the database account
DocumentDB cannot have poor performing code affect the server so it uses bounded execution to limit how long code can run
Requires multiple reads
Typically provides faster write speeds
Data from entities are queried together
Only making one round trip to server for all the data
One to many relationships (unbounded)
Blog article comments array could have millions of entries
Many-to-many relationships
For example, one speaker could have multiple sessions and each session could have multiple speakers
Related data changes with differing volatility
Document does not change but document with views or likes does
Use Amazon/Best Buy product example
Can have embedded product info that also contains current review summary so don’t have to retrieve each time
When you want the rating summary or review you have different query
Trade-off between performance and consistency
Strong: Strong consistency guarantees that a write is only visible after it is committed durably by the majority quorum of replicas. Strong consistency provides absolute guarantees on data consistency, but offers the lowest level of read and write performance
Session: ability to read your own writes. A read request for session consistency is issued against a replica that can serve the client requested version (part of the session cookie). Session consistency provides predictable read data consistency for a session while offering the lowest latency writes. Reads are also low latency as except in the rare cases, the read will be served by a single replica.
Bounded staleness: provides more predictable behavior for read consistency while offering the lowest latency writes. As reads are acknowledged by a majority quorum, read latency is not the lowest offered by the system. This provides a stronger guarantee than Session or Eventual.
Eventual: weakest form of consistency wherein a client may get the values which are older than the ones it had seen before, over time. In the absence of any further writes, the replicas within the group will eventually converge. The read request is served by any secondary index.
Eventual consistency provides the weakest read consistency but offers the lowest latency for both reads and writes.
DocumentDB supports a RESTful protocol
Any DocumentDB operation can be performed from any HTTP client as long as request URL points to valid DocumentDB resource and the headers contain the required authentication info
Throughput = amount of items passing through a system
Every request you do has a response that displays the RequestUnit charge whether action succeeds or fails
Can see charge in QueryExplorer
Writes are more expensive than reads
Standard – pay as you go, allows partitioning, storage measured in GB, 400-250K RUs, 250GB storage max (can be increased)
S1, S2, S3 – predefined. 10GB storage max, throughput max 2.5K Rus
Can have mix of S1, S2 and S3 collections in a DB; can archive data in an S1 and keep active data in an S3
Once you create an account there is a Minimum S1 charge ($25) even with no collections created
But when you create a collection you satisfy the charge and no longer have the $25 fee