Big dataclasses 2019_nosql

17/07/2019 Big Data class by Alexandre Bergere 1
Big Data
ESAIP – IR4

alexandre.bergere@gmail.com
https://fr.linkedin.com/in/alexandrebergere
@AlexPhile
ESAIP
2013 - 2016
Avanade
2016 - 2019
Sr Anls, Data EngineeringStudent
Worked as a senior analyst at Avanade
France, I have developed my skills in data
analysis (MSBI, Power BI, R, Python) by
working on innovative projects and proofs of
concept in the energy industry.
ESAIP
Teacher
2016 - ?
Freelance
2019 - x
Data Analyst & Data Architect

Planning
D-1 D-2 D-3 D-4 D-5
MorningAfternoon
What’s Big Data
+
No SQL
+
Cloud Architecture
Azure IOT + Azure Stream
Analytics + Power BI
Theorical AWS Practice Azure Practice Exam
Oral Exam
Written Exam
SPARK
SPARK
Free time
Prep. Oral
Analyse Big Data with
Hadoop
SPARK
Redshift
Cosmos DB
Serverless architecture :
AWS Lambda +
DynamoDB + NodeJS
Cosmos DB
SPARK
On Prem
Neo4J
Mongo DB
Cloud
SPARK

Planning
D-1 D-2 D-3
MorningAfternoon
What’s Big Data
Azure IOT + Azure Stream
Analytics + Power BI
Theorical Azure Practice
Cosmos DB
SPARK
On Prem
Neo4J
Mongo DB
Cloud
Cloud architecture
Written Exam
BI & Machine Learning
Analyse Big Data with
Hadoop

Data Storage

Data Storage
Relational data store HDFS Key Value data store Columnar data store
Object store Search data store Graph data store Document data store

Mongo DB

Mongo DB
Created in 2007 & first release
in 2010.
Easy and simple … as a leaf.
Document data store &
Schemaless.

Nexus Architecture

Driver & Framework

MongoDB is easy
For many developers, data model goes hand in hand with object mapping, and for that purpose
you may have used an object-relational mapping library, such as Java’s Hibernate framework or
Ruby’s ActiveRecord.
Such libraries can be useful for efficiently building applications with a RDBMS, but they’re less
necessary with MongoDB. This is due in part to the fact that a document is already an object-
like representation. It’s also partly due to the MongoDB drivers, which already provide a fairly
high-level interface to MongoDB. Without question, you can build applications on MongoDB
using the driver interface alone.

Use cases
o Web application (mongoDB is well-suited as primary datastore for web application)
o Agile development
o Analytics and logging
o Caching
o Variable Schemas

Mongo DB 4.0 : ACID transactions
More info.
Bêta test.

Mongo DB releases

Compagnies

Analytics – use case
More info.
The City of Chicago cuts crime and improves citizen
welfare with a real-time geospatial analytics platform
called WindyGrid. Using MongoDB, it analyzes data
from 30+ different departments – like bus locations,
911 calls, and even tweets – to better understand and
respond to emergencies.

The case for adding NoSQL
o Large volumes of rapidly changing structured, semi-structured, and unstructured data
o Agile sprints, quick schema iteration, and frequent code pushes
o API-driven, object-oriented programming that is easy to use and flexible
o Geographically distributed scale-out architecture instead of expensive, monolithic
architecture
Consider, for example, enterprise resource planning (ERP), a standard for relational databases.
What if you want to offer ERP forms users can actually modify if they need to? A document-
based NoSQL database such as MongoDB can provide that functionality without requiring you
to rebuild your whole data schema every time a user wants to change the data format.

White papers
MongoDB – BI &
Analytics
MongoDB – Kafka MongoDB – Spark

Leader in The Forrester Wave™: Big Data NoSQL, Q1 2019
o Data Types
o Streaming and Loading
o Big Data Support
o In-memory
o Performance
o Scalability
o High Availability & Disaster
Recovery
o Tools
o Workloads
o Use Cases
o Ability to Execute
o Road Map
o Open Source and Licensing
o Support

Tools

MongoDB Compass

Mongo DB Atlas
DAAS : Database As A Service • Schema design
• Query and index optimization
• Server size selection - you must select the appropriate size of server,
coupled with IO and storage capacity
• Capacity planning - you must determine when you need additional
capacity, typically using the monitoring telemetry provided by
MongoDB Atlas, but you can make these changes with no downtime
• Initiating database restores
• How much you use

Mongo DB Cloud Manager

Mongo DB Connector for BI

MongoDB Charts
(beta)
MongoDB Charts is the fastest and
easiest way to build visualizations of
MongoDB data.

Architecture pseudo On premise

Change Streams
More info.
Change streams allow applications to access real-time data changes without the complexity and risk of
tailing the oplog. Applications can use change streams to subscribe to all data changes on a collection and
immediately react to them.

Stitch
Full access to MongoDB, declarative read/write
controls, and integration with your choice of services
MongoDB Stitch lets developers focus on building applications rather than on managing data manipulation code, service
integration, or backend infrastructure. Whether you’re just starting up and want a fully managed backend as a service, or
you’re part of an enterprise and want to expose existing MongoDB data to new applications, Stitch lets you focus on
building the app users want, not on writing boilerplate backend logic.

Modeling & request

Document are rich data structure
• JSON:
• String, Number, Array, Object, NULL, Boolean.
• BSON:
• Date, BinData, ObjectID, Geo-Location.
• Better storage performance.
ObjectID:
◦ _id : 'DATE[4] | MAC_ADDR[3] | PID[2] | COUNTER[3]

Available Types
Type Number Alias Notes
Double 1 “double”
String 2 “string”
Object 3 “object”
Array 4 “array”
Binary data 5 “binData”
Undefined 6 “undefined” Deprecated.
ObjectId 7 “objectId”
Boolean 8 “bool”
Date 9 “date”
Null 10 “null”
Regular Expression 11 “regex”
DBPointer 12 “dbPointer” Deprecated.
JavaScript 13 “javascript”
Symbol 14 “symbol” Deprecated.
JavaScript (with scope) 15 “javascriptWithScope”
32-bit integer 16 “int”
Timestamp 17 “timestamp”
64-bit integer 18 “long”
Decimal128 19 “decimal” New in version 3.4.
Min key -1 “minKey”
Max key 127 “maxKey”

SQL vs MongoDB Terms
SQL Terms/Concepts MongoDB Terms/Concepts
Database Database
Table Collection
Line Document
Column Field
Index Index
Join Embeded or linked document
Primary key Primary key (start by « _id »)

Documents are Flexible

Document Model
Pers_ID Surname First_Name City
0 Miller Paul London
1 Ortega Alvaro Valencia
2 Huber Urs Zurich
3 Blanc Gaston Paris
4 Bertolini Fabrizio Rome
Car_ID Model Year Value Pers_ID
101 Bently 1973 100000 0
102 Rolls Royce 1965 330000 0
103 Peugot 1993 500 3
104 Ferrari 2005 150000 4
105 Renault 1998 2000 3
106 Renault 2001 7000 3
107 Smart 1999 2000 2
CAR
PERSON
Mongo DB
RDBMS

One to many

CRUD
# FIND()
> db.<collection>.find ({<conditions>},{<champs>})
> db.products.find( { qty: { $gt: 25 } }, { item: 1, qty: 1 } )
Options:
.pretty()
.sort() : 1 : ASC, -1: DESC : sort({‘name’:-1})
.skip() : number
.limit() : number
.count()
sort, first, skip, second, and limit last because that is the only order that makes
sense.

CRUD
# INSERT()
> db.<collection>.insert ({<value>})
> db.<collection>.insertMany([{<values>}])
> db.inventory.insertMany([
{ item: "journal", qty: 25, tags: ["blank", "red"], size: { h: 14, w: 21, uom: "cm" } },
{ item: "mat", qty: 85, tags: ["gray"], size: { h: 27.9, w: 35.5, uom: "cm" } },
{ item: "mousepad", qty: 25, tags: ["gel", "blue"], size: { h: 19, w: 22.85, uom: "cm" } }
])
db.collection.insertOne() Inserts a single document into a collection.
db.collection.insertMany() db.collection.insertMany() inserts multiple documents into a collection.
db.collection.insert()
db.collection.insert() inserts a single document or multiple documents into
a collection.

CRUD
# UPDATE()
> db.<collection>.update
({<conditions>},{<champs>},{upsert:true/false},{multi:true/false}
)
> { "_id": "artist:271", "last_name": "Cotillard", "first_name": "Marion", "birth_date": "1975" }
# Operator Update
> db.artists.update({"_id": "artist:281"},{ $set : {"last_name" : "Page"}})
> { "_id": "artist:271", "last_name": “Page", "first_name": "Marion", "birth_date": "1975" }
# Replacement Update
> db.artists.update({"_id": "artist:281"},{"last_name" : "Page"})
> { "_id": "artist:271", "last_name": “Page"}
❑ Operator Update
❑ Replacement Update
Upsert: boolean Optional. If
set to true, creates a new document when
no document matches the query criteria.
The default value is false, which does not
insert a new document when no match is
found.
Multi: boolean Optional. If
set to true, updates multiple documents
that meet the query criteria. If set to false,
updates one document. The default value
is false.

CRUD
# DELETE()
> db.<collection>.remove ({<conditions>})
> db.artists.remove({"_id": "artist:39"})
# Remove all fields
> db.artists.remove({})

Query Operator
Name Description
$eq Matches values that are equal to a specified value.
$gt Matches values that are greater than a specified value.
$gte Matches values that are greater than or equal to a specified value.
$lt Matches values that are less than a specified value.
$lte Matches values that are less than or equal to a specified value.
$ne Matches all values that are not equal to a specified value.
$in Matches any of the values specified in an array.

Query Operator : $set
# $set
> db.products.update(
{ _id: 100 },
{ $set:
{
quantity: 500,
details: { model: "14Q3", make: "xyz" },
tags: [ "coats", "outerwear", "clothing" ]
}
}
)
# $set Embedded Documents
{ _id: 100 },
{ $set: { "details.make": "zzz" } }
)
# $set in Arrays
{ _id: 100 },
{ $set:
{
"tags.1": "rain gear",
"ratings.0.rating": 2
}
}
)

Query Operator : Arrays
Name Description
$pull Removes all array elements that match a specified query.
$push Add an element to an array.
$pop Removes the first or last item of an array.
$addToSet Adds elements to an array only if they do not already exist in the set.
$in Matches any of the values specified in an array.

DML
# Returns all database
> show dbs
# The current database name:
> db.getName()
# Returns all database
> show dbs
# Returns all collection in the current
database:
> db.getCollectionNames()
# Returns a collection or a view object:
> db.getCollection(name)
# The current database connection:
> db.getMongo()
# Clean the console log:
> cls
# Return collection informations:
> db.getCollectionInfos({name: "name"})

Command-line tools
# Import multiples document:
> mongoimport -d crunchbase -c companies
D:MongoDBsrccompanies.json
# Import multiples document in an array:
> mongoimport -d crunchbase -c companies
D:MongoDBsrccompanies.json --jsonArray
# Export
> mongoexport -d crunchbase -c artists --out
D:MongoDBartists.json
Launch in the shell, not in mongoDB instance.
Command Description
mongodump mongodump is a utility for creating a binary export of the
contents of a database. mongodump can export data from
either mongod or mongos instances.
mongorestore The mongorestore program loads data from either a binary
database dump created by mongodump or the standard input
(starting in version 3.0.0) into a mongod or mongos instance.
mongostat This utility constantly polls MongoDB and the system to
provide helpful stats, including the number of operations per
second (inserts, queries, updates, deletes, and so on), the
amount of virtual memory allocated, and the number of
connections to the server.
mongoperf Helps you understand the disk operations happening in a
running MongoDB instance.
mongotop Similar to top, this utility polls MongoDB and shows the
amount of time it spends reading and writing data in each
collection.
mongosniff A wire-sniffing tool for viewing operations sent to the
database. It essentially translates the BSON going over the
wire to human-readable shell statements.

$text
# $text
> db.articles.find( { $text: { $search: "coffee" } } ))
$text performs a text search on the content of the fields indexed with a text index. A $text expression has the following
syntax:
# $text
> {
$text:
{
$search: <string>,
$language: <string>,
$caseSensitive: <boolean>,
$diacriticSensitive: <boolean>
}
}
# Create index first - You can index multiple fields for the
text index:
db.reviews.createIndex(
{
subject: "text",
comments: "text"
}
)

Schema Validation
Implement data governance without sacrificing
the agility that comes from a dynamic schema.
With schema validation, developers and
operations spend less time defining data quality
controls in their applications, and instead
delegate these tasks to the database.

Aggregation
Swiss Army knife
Executes in native code
o Written in C++
o JSON parameter
Flexible, funcional, simple
o Operation pipeline
o Computational expressions

Pipeline operators
Operator Description
$match Filter documents
$project Reshape documents
$group Summarize documents
$unwind Expand arrays in documents
$sort Order documents
$limit / $skip Paginate documents
$redact Restrict documents
$geoNear Proximity sort documents
$let, $map Define variables

$match
# Matching field values
> {$match:{
language:"Russian"
}
{
title:"War and Peace",
pages:1440,
langugage:"Russian"
}
# Matching with query operators
> {$match:{
pages:{$gt:100}
}
{
title:"War and Peace",
pages:1440,
langugage:"Russian"
},
{
title:"Atlas Shrugged",
pages:1088,
langugage:"English"
}

$project
# Renaming and cuputing fields
> {$project:{
avgChapterLength:{
$divide:["$pages", "$chapters" ]
},
lang: "$language"
}}
{
_id:375,
avgChapterLength: 24,2222
lang:"English"
}
# Including & excluding fields
> {$project:{
_id:0,
title:1,
language:1
}}
{
title:"Great Gatsby",
language:"English"
}

$group
# Collect distinct values
> {$group:{
_id:"$langugage",
title:{$addToSet:"$title"}
}}
{
_id:"English",
language:[Atlas Shrugged" , "The
Great Gatsby"]
},
{
_id:"Russian",
language:["War and Peace"]
}
# Calculating average, summing fields…
> {$group:{
_id:"$langugage",
pages:{$sum:"$pages"},
books:{$sum:1},
avgPages:{$avg:"$pages"}
}}
{
_id:"Russian",
pages:1440,
books:1,
avgPages:1440
}

$unwind
# Collect distinct values
> {$unwind:{
"subjects"
}
{
title:"The Great Gatsby",
ISBN:"9762832930920323" ,
subjects:"Long Island"
},
{
ISBN:"9762832930920323" ,
subjects:"New York"
},
{
ISBN:"9762832930920323" ,
subjects:"1920s"
}
{
ISBN:"9762832930920323" ,
subjects:[
"Long Island",
"New York",
"1920s"
]
}

LABS

Installation
Download & Install

Instance
Launch as a service:
mongod --dbpath C:UsersalexaDocumentsMongoDBdata
Launch as a connection:
mongo
Options Shortcut
--db -d
--collection -c
--username -u
--password -p
--host -h

Request practice
# 1.0 Load artists.json
> mongoimport -d crunchbase -c artists --file C:UsersalexaDocumentsCoursMongoDB2017-
2018srcartists.json -–jsonArray -–port 27018
# 1.1 Return first_name and birth_date to all artists born in 1964
> db.artists.find({"birth_date": "1964"},{"_id":0,"first_name":1, "birth_date":1})
# 1.2 Return all arstists born after 1980 or with their first name begin by ‘Chri’
> db.artists.find({$or:[{"birth_date": {$gte:"1980"}},{"first_name":/^Chri/}]},{})
> db.artists.find({$or:[{"birth_date": {$gte:"1980"}},{"first_name":{$regex : /^Chri/}}]},{})
# 1.3 Return the 6e to the 9e artist by their name desc
> db.artists.find().pretty().sort({"last_name":-1}).skip(5).limit(4)
# 1.4 Insert the following artist:
{"_id": "artist:282", "last_name": "Bergere", "first_name": "Alexandre", "birth_date": "1992"} : (Replace
the id)
> db.artists.insert({ "_id": "artist:282", "last_name": "Bergere", "first_name": "Alexandre",
"birth_date": "1992" })

Request practice
# 1.5 Modify by « Jonathan » the first_name of the artists with the id artist:266
> db.artists.update({"_id": "artist:266"},{$set:{"first_name":"Jonathan"}})
# 1.6 Add « golf » to the 280 artist’s hobbies
> db.artists.update({"_id": "artist:280"},{$push:{"hobbies":"golf"}})
# 1.7 Add « yoga » to the 282 artist’s hobbies
> db.artists.update({"_id": "artist:282"},{$push:{"hobbies":"yoga"}})
# 1.8 Remove « poney » and « photo » from 280 artist’s hobbies
> db.artists.update({"_id": "artist:280"},{$pull:{"hobbies": {$in:["poney","photo"]}}})

Request practice
# Convert string to integer
> db.artists.find({birth_date: {$exists: true}}).forEach(function(obj) {
obj.birth_date = new NumberInt(obj.birth_date);
db.artists.save(obj);
});

Go Deeper

Support
MongoDB in action, 2nd Edition docs.mongodb.com
17/07/2019 MongoDB class by Alexandre Bergere 61

Summer Internship
https://www.mongodb.com/careers/college-students

Learning
https://www.university.mongodb.com

Graph database

What is a graph database?
A graph database is an online database management system with Create, Read, Update and Delete (CRUD)
operations working on a graph data model. Graph databases are generally built for use with online
transaction processing (OLTP) systems. Accordingly, they are normally optimized for transactional
performance, and engineered with transactional integrity and operational availability in mind. ~ Neo4j
Unlike other databases, relationships take first priority in graph databases.

The case for graph databases

What is Graph?
Graph is just a collection of vertices and edges—or, in less intimidating language, a set of nodes and the
relationships that connect them.

Definitions
• Nodes
o Nodes are the main data elements
o Nodes are connected to other nodes
via relationships
o Nodes can have one or more properties (i.e.,
attributes stored as key/value pairs)
o Nodes have one or more labels that describes
its role in the graph
o Example: Person nodes vs Car nodes
• Relationships
o Relationships connect two nodes
o Relationships are directional
o Nodes can have multiple, even recursive
relationships
o Relationships can have one or
more properties (i.e., attributes stored as
key/value pairs)
Properties
o Properties are named values where the name (or
key) is a string
o Properties can be indexed and constrained
o Composite indexes can be created from multiple
properties
Labels
o Labels are used to group nodes into sets
o A node may have multiple labels
o Labels are indexed to accelerate finding nodes in
the graph
o Native label indexes are optimized for speed

Modelling relational to graph
Relational Graph
Rows Nodes
Joins Relationships
Table names Labels
Columns Properties
similarities
relational model differs from the graph model
Relational Graph
Each column must have a field value.
Nodes with the same label aren’t required to have the same set of
properties.
Joins are calculated at query time. Relationships are stored on disk when they are created.
A row can belong to one table. A node can have many labels.

RDBMS vs graph

Neo4j

Neo4j Graph Platform
The Neo4j Graph Platform includes out-of-the-box tooling that enables you to access graphs in Neo4j Databases. In
addition, Neo4j provides APIs and drivers that enable you to create applications and custom tooling for accessing and
visualizing graphs.

Dev env.
Neo4j SandboxNeo4j Desktop
o Neo4j Database server
o graph engine
o kernel (Cypher execution)
o Neo4j Browser
o additional libraries and drivers for accessing the Neo4j database
o temporary, cloud-based instance of a Neo4j Server with its
associated graph that you can access from any Web
browser
o available for three days, but you can extend it for up to 10
days
o you can use Neo4j Browser Sync to save Cypher scripts
from your sandbox

Neo4j Browser

Introduction to Cypher

What’s Cypher?
Cypher is a declarative query language that allows for expressive and efficient querying and updating of graph data.
Cypher is ASCII art
focuses on the clarity of expressing what to retrieve from a
graph
Cypher is inspired by
SPARK
QL
SQL
Python
Haskell

Node & Label
() // anonymous node not be referenced later in the query
(p) // variable p, a reference to a node used later
(:Person) // anonymous node of type Person
(p:Person) // p, a reference to a node of type Person
(p:Actor:Director) // p, a reference to a node of types Actor and Director
Examining the data model
CALL db.schema

Using MATCH to retrieve nodes
MATCH (n) // returns all nodes in the graph
RETURN n
MATCH (p:Person) // returns all Person nodes in the graph
RETURN p
When you specify a pattern for a MATCH clause, you should always specify a node label if possible. In doing so, the graph
engine uses an index to retrieve the nodes which will perform better than not using a label for the MATCH.

Properties
A property is defined for a node and not for a type of node. All nodes of the same type need not have the same properties.
// Query the database for all property keys
CALL db.propertyKeys
MATCH (variable:Label {propertyKey: propertyValue, propertyKey2: propertyValue2})
RETURN variable
MATCH (m:Movie {released: 2003, tagline: 'Free your mind'})
RETURN m

Filtering queries using property values
// Retrieve all Movie nodes that have a released property value of 2003.
MATCH (m:Movie {released:2003}) RETURN m
// Retrieve all Movies released in 2006, returning their titles
MATCH (m:Movie {released: 2006}) RETURN m.title
// Display title, released, and tagline values for every Movie node in the graph
MATCH (m:Movie) RETURN m.title AS `movie title`, m.released AS released, m.tagline
AS tagLine

Relationships
A relationship is a directed connection between two nodes that has a relationship type (name). In addition, a relationship
can have properties, just like nodes.
() // a node
()--() // 2 nodes have some type of relationship
()-->() // the first node has a relationship to the second node
()<--() // the second node has a relationship to the first node
Here is how Cypher uses ASCII art to specify path used for a query:
Querying using relationships:
MATCH (node1)-[:REL_TYPE]->(node2)
RETURN node1, node2
MATCH (node1)-[:REL_TYPEA | :REL_TYPEB]->(node2)
RETURN node1, node2
node1 is a specification of a node where you may include node labels and property values for filtering.
:REL_TYPE is the type (name) for the relationship. For this syntax the relationship is from node1 to node2.
:REL_TYPEA , :REL_TYPEB are the relationships from node1 to node2. The nodes are returned if at least one of the relationships exists.
node2 is a specification of a node where you may include node labels and property values for filtering.

Relationships
Using patterns for queries:
MATCH (p:Person)-[:FOLLOWS]->(:Person {name:'Angela Scope'})
RETURN p
MATCH (p:Person)<-[:FOLLOWS]-(:Person {name:'Angela Scope'})
RETURN p

Relationships
// Querying by any direction of the relationship
MATCH (p1:Person)-[:FOLLOWS]-(p2:Person {name:'Angela Scope'})
RETURN p1, p2

Relationships
// Traversing relationships : query to return all followers of the followers
of Jessica Thompson.
MATCH (p:Person)-[:FOLLOWS]->(:Person)-[:FOLLOWS]->(:Person {name:'Jessica
Thompson'})
RETURN p
// Traversing relationships : return each person along the path by specifying
variables for the nodes and returning them
MATCH path = (:Person)-[:FOLLOWS]->(:Person)-[:FOLLOWS]->(:Person {name:'Jessica
Thompson'})
RETURN path

Relationships
Using a relationship in a query:
MATCH (p:Person)-[rel:ACTED_IN]->(m:Movie {title: 'The Matrix'})
RETURN p, rel, m
Variables:
o p to represent the Person nodes during the query, the
variable
o m to represent the Movie node retrieved
o rel to represent the relationship for the relationship
type, ACTED_IN
Querying by multiple relationships:
MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN|:DIRECTED]->(m:Movie)
RETURN p.name, m.title

Relationships
Using anonymous nodes in a query:
MATCH (p:Person)-[:ACTED_IN]->(:Movie {title: 'The Matrix'})
RETURN p.name
A best practice is to place named nodes (those with variables) before anonymous nodes in a MATCH clause.
Using an anonymous relationship for a query:
// find all people who are in any way connected to the movie
MATCH (p:Person)-->(m:Movie {title: 'The Matrix'})
RETURN p, m
MATCH (p:Person)--(m:Movie {title: 'The Matrix'})
RETURN p, m

Relationships
Retrieving the relationship types:
MATCH (p:Person)-[rel]->(:Movie {title:'The Matrix'})
RETURN p.name, type(rel)
Retrieving properties for relationships:
MATCH (p:Person)-[:REVIEWED {rating: 65}]->(:Movie {title: 'The Da Vinci Code'})
RETURN p.name

Filtering queries using relationships
// Retrieve all people who wrote the movie Speed Racer
MATCH (p:Person)-[:WROTE]->(:Movie {title: 'Speed Racer'}) RETURN p.name
// Retrieve all movies that are connected to the person, Tom Hanks
MATCH (m:Movie)<--(:Person {name: 'Tom Hanks'}) RETURN m.title
or
MATCH(:Person {name: 'Tom Hanks'})-->(m:Movie) RETURN m.title
// Retrieve information about the relationships Tom Hanks has with the set of
movies retrieved earlier
MATCH (m:Movie)-[rel]-(:Person {name: 'Tom Hanks'}) RETURN m.title, type(rel)
// Retrieve information about the roles that Tom Hanks acted in
MATCH (m:Movie)-[rel:ACTED_IN]-(:Person {name: 'Tom Hanks'}) RETURN m.title,
rel.roles

Cypher style recommendations
Here are the Neo4j-recommended Cypher coding standards:
o Node labels are CamelCase and begin with an upper-case letter (examples: Person, NetworkAddress). Note that node
labels are case-sensitive.
o Property keys, variables, parameters, aliases, and functions are camelCase and begin with a lower-case letter
(examples: businessAddress, title). Note that these elements are case-sensitive.
o Relationship types are in upper-case and can use the underscore. (examples: ACTED_IN, FOLLOWS). Note that
relationship types are case-sensitive and that you cannot use the “-” character in a relationship type.
o Cypher keywords are upper-case (examples: MATCH, RETURN). Note that Cypher keywords are case-insensitive, but a
best practice is to use upper-case.
o String constants are in single quotes, unless the string contains a quote or apostrophe (examples: ‘The Matrix’,
“Something’s Gotta Give”). Note that you can also escape single or double quotes within strings that are quoted with
the same using a backslash character.
o Specify variables only when needed for use later in the Cypher statement.
o Place named nodes and relationships (that use variables) before anonymous nodes and relationships in your MATCH
clauses when possible.
o Specify anonymous relationships with -->, --, or <--
MATCH (:Person {name: 'Diane Keaton'})-[movRel:ACTED_IN]->
(:Movie {title:"Something's Gotta Give"})
RETURN movRel.roles
Follow the Cypher Style Guide when writing your Cypher statements.

Getting More Out of Queries

Filtering queries using WHERE
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.released = 2008
RETURN p, m
// complex conditions
WHERE m.released = 2008 AND m.released = 2009
RETURN p, m
// same as previous
WHERE 2003 <= m.released <= 2004
RETURN p.name, m.title, m.released
MATCH (p:Person)-[:ACTED_IN]->(m:Movie {released: 2008})
RETURN p, m


// Testing labels
MATCH (p:Person)
RETURN p.name
MATCH (p:Person)-[:ACTED_IN]->(:Movie {title: 'The Matrix'})
RETURN p.name

MATCH (p)
WHERE p:Person
RETURN p.name
MATCH (p)-[:ACTED_IN]->(m)
WHERE p:Person AND m:Movie AND m.title='The Matrix'
RETURN p.name

// Testing the existence of a property
WHERE p.name='Jack Nicholson' AND exists(m.tagline)
RETURN m.title, m.tagline
// Testing strings : You can specify STARTS WITH, ENDS WITH, and CONTAINS
MATCH (p:Person)-[:ACTED_IN]->()
WHERE toLower(p.name) STARTS WITH 'michael'
RETURN p.name
// Testing with regular expressions; You use the syntax =~
MATCH (p:Person)
WHERE p.name =~'Tom.*'
RETURN p.name

// Testing with patterns
// exclude people who directed that movie
MATCH (p:Person)-[:WROTE]->(m:Movie)
WHERE NOT exists( (p)-[:DIRECTED]->() )
// find Gene Hackman and the movies that he acted in with another person who also
directed the movie
MATCH (gene:Person)-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(other:Person)
WHERE gene.name= 'Gene Hackman'
AND exists( (other)-[:DIRECTED]->() )
RETURN gene, other, m

// Testing with list values : elements of the list have to be the same type of data
MATCH (p:Person)
WHERE p.born IN [1965, 1970]
RETURN p.name as name, p.born as yearBorn
// You can also compare a value to an existing list in the graph.
MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
WHERE 'Neo' IN r.roles AND m.title='The Matrix'
RETURN p.name
There are a number of syntax elements of Cypher that we have not covered in this training. For example, you can specify
CASE logic in your conditional testing for your WHERE clauses. You can learn more about these syntax elements in the
Neo4j Cypher Manual and the Cypher Refcard.

// Retrieve all actors that were born in the 70’s
MATCH (a:Person)
WHERE a.born >= 1970 AND a.born < 1980
RETURN a.name as Name, a.born as `Year Born`
// Retrieve all movies released in 2000 by testing the node label and the released
property, returning the movie titles
MATCH (m)
WHERE m:Movie AND m.released = 2000 and exists(m.released)
RETURN m.title
// Retrieve all people that wrote movies by testing the relationship between two
nodes
MATCH (a)-[rel]->(m)
WHERE a:Person AND type(rel) = 'WROTE' AND m:Movie
RETURN a.name as Name, m.title as Movie
// Retrieve all people in the graph that do not have the property ‘born’
MATCH (a:Person)
WHERE NOT exists(a.born)
RETURN a.name as Name

// Retrieve all people related to movies where the relationship has the rating
property, then return their name, movie title, and the rating.
MATCH (a:Person)-[rel]->(m:Movie)
WHERE exists(rel.rating)
RETURN a.name as Name, m.title as Movie, rel.rating as Rating
// Retrieve all REVIEW relationships from the graph where the summary of the review
contains the string fun, returning the movie title reviewed and the rating and
summary of the relationship.
MATCH (:Person)-[r:REVIEWED]->(m:Movie)
WHERE toLower(r.summary) CONTAINS 'fun'
RETURN m.title as Movie, r.summary as Review, r.rating as Rating
// Retrieve all people who have produced a movie, but have not directed a movie
MATCH (a:Person)-[:PRODUCED]->(m:Movie)
WHERE NOT ((a)-[:DIRECTED]->(:Movie))
RETURN a.name, m.title
// Retrieve the movies and their actors where one of the actors also directed the
movie
MATCH (a1:Person)-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(a2:Person)
WHERE exists( (a2)-[:DIRECTED]->(m) )
RETURN a1.name as Actor, a2.name as `Actor/Director`, m.title as Movie

// Retrieve the movies that have an actor’s role that is the name of the movie
MATCH (a:Person)-[r:ACTED_IN]->(m:Movie)
WHERE m.title in r.roles
RETURN m.title as Movie, a.name as Actor

Controlling query processing
MATCH (a:Person)-[:ACTED_IN]->(m:Movie),
(m:Movie)<-[:DIRECTED]-(d:Person)
RETURN a.name, m.title, d.name
Specifying multiple MATCH patterns
This MATCH clause includes a pattern specified by two paths separated by a comma:
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)
RETURN a.name, m.title, d.name
If possible, you should write the same query as follows:

// retrieve the actors who acted in the same movies as Keanu Reeves, but not when
Hugo Weaving acted in the same movie
MATCH (keanu:Person)-[:ACTED_IN]->(movie:Movie)<-[:ACTED_IN]-(n:Person),
(hugo:Person)
WHERE keanu.name='Keanu Reeves' AND
hugo.name='Hugo Weaving'
AND NOT (hugo)-[:ACTED_IN]->(movie)
RETURN n.name
Specifying multiple MATCH patterns
// Suppose we want to retrieve the movies that Meg Ryan acted in and their
respective directors, as well as the other actors that acted in these movies.
MATCH (meg:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person),
(other:Person)-[:ACTED_IN]->(m)
WHERE meg.name = 'Meg Ryan'
RETURN m.title as movie, d.name AS director , other.name AS `co-actors`

MATCH megPath = (meg:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person),
(other:Person)-[:ACTED_IN]->(m)
WHERE meg.name = 'Meg Ryan'
RETURN megPath
Setting path variables

Specifying varying length paths
// all of the followers of the followers of a Person
MATCH (follower:Person)-[:FOLLOWS*2]->(p:Person)
WHERE follower.name = 'Paul Blythe'
RETURN p
// Retrieve all paths of any length with the relationship, :RELTYPE from nodeA to
nodeB and beyond:
(nodeA)-[:RELTYPE*]->(nodeB)
// Retrieve all paths of any length with the relationship, :RELTYPE from nodeA to
nodeB or from nodeB to nodeA and beyond:
(nodeA)-[:RELTYPE*]-(nodeB)
// Retrieve the paths of length 3 with the relationship,
(node1)-[:RELTYPE*3]->(node2)
// Retrieve the paths of lengths 1, 2, or 3 with the relationship
(node1)-[:RELTYPE*1..3]->(node2)

Finding the shortest path
MATCH p = shortestPath((m1:Movie)-[*]-(m2:Movie))
WHERE m1.title = 'A Few Good Men' AND
m2.title = 'The Matrix'
RETURN p
A built-in function that you may find useful in a graph that has many ways of traversing the graph to get to the same node
is the shortestPath() function. Using the shortest path between two nodes improves the performance of the query.
When you use the shortestPath() function, the query editor will show a warning that this type of query could potentially
run for a long time. You should heed the warning, especially for large graphs. Read the Graph Algorithms documentation
about the shortest path algorithm.
When you use shortestPath(), you can specify a upper limits for the shortest path. In addition, you should aim to provide
the patterns for the from an to nodes that execute efficiently. For example, use labels and indexes.

Specifying optional pattern matching
MATCH (p:Person)
WHERE p.name STARTS WITH 'James'
OPTIONAL MATCH (p)-[r:REVIEWED]->(m:Movie)
RETURN p.name, type(r), m.title
OPTIONAL MATCH matches patterns with your graph, just like MATCH does. The difference is that if no matches are found,
OPTIONAL MATCH will use NULLs for missing parts of the pattern. OPTIONAL MATCH could be considered the Cypher
equivalent of the outer join in SQL.

Collecting results
// the list of movies that Tom Cruise acted in
WHERE p.name ='Tom Cruise'
RETURN collect(m.title) AS `movies for Tom Cruise`
Cypher has a built-in function, collect() that enables you to aggregate a value into a list.

Aggregation in Cypher
// implicitly groups by a.name and d.name
MATCH (a)-[:ACTED_IN]->(m)<-[:DIRECTED]-(d)
RETURN a.name, d.name, count(*)
// count the paths retrieved where an actor and director collaborated in a movie
MATCH (actor:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(director:Person)
RETURN actor.name, director.name, count(m) AS collaborations, collect(m.title) AS
movies
Aggregation in Cypher is different from aggregation in SQL. In Cypher, you need not specify a grouping key. As soon as an
aggregation function is used, all non-aggregated result columns become grouping keys. The grouping is implicitly done,
based upon the fields in the RETURN clause.
There are more aggregating functions such as min()
or max() that you can also use in your queries.
These are described in the Aggregating Functions
section of the Neo4j Cypher Manual.

Additional processing using WITH
// only return actors that have 2 or 3 movies
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WITH a, count(a) AS numMovies, collect(m.title) as movies
WHERE numMovies > 1 AND numMovies < 4
RETURN a.name, numMovies, movies
During the execution of a MATCH clause, you can specify that you want some intermediate calculations or values that will
be used for further processing of the query, or for limiting the number of results before further processing is done. You use
the WITH clause to perform intermediate processing or data flow operations.
You have to name all expressions with an alias in a WITH that are not simple variables.
// find all actors who have acted in at least five movies, and find (optionally)
the movies they directed and return the person and those movies
MATCH (p:Person)
WITH p, size((p)-[:ACTED_IN]->(:Movie)) AS movies
WHERE movies >= 5
OPTIONAL MATCH (p)-[:DIRECTED]->(m:Movie)

Additional processing using WITH
// retrieves all actors that acted in movies, and collects the list of movies for
any actor that acted in more than five movies.
WITH p, collect(m) AS movies
WHERE size(movies) > 5
RETURN p.name, movies

// Write a Cypher query that retrieves all movies that Gene Hackman has acted it,
along with the directors of the movies. In addition, retrieve the actors that acted
in the same movies as Gene Hackman. Return the name of the movie, the name of the
director, and the names of actors that worked with Gene Hackman.
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person),
(a2:Person)-[:ACTED_IN]->(m)
WHERE a.name = 'Gene Hackman'
RETURN m.title as movie, d.name AS director , a2.name AS `co-actors`
// Retrieve particular nodes that have a relationship and when James Thompson is
acting on it
MATCH (p1:Person)-[:FOLLOWS]-(p2:Person)
WHERE p1.name = 'James Thompson'
RETURN p1, p2
// Modify the query to retrieve nodes that are one and two hops away
MATCH (p1:Person)-[:FOLLOWS*1..2]-(p2:Person)
RETURN p1, p2
// Modify the query to retrieve particular nodes that are connected no matter how
many hops are required
MATCH (p1:Person)-[:FOLLOWS*]-(p2:Person)
RETURN p1, p2

// Retrieve all movie by collecting a list of all people who acted in it
RETURN p.name as actor, collect(m.title) AS `movie list`
// Retrieve all movies that Tom Cruise has acted in and the co-actors that acted in
the same movie by collecting a list
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(p2:Person)
WHERE p.name ='Tom Cruise'
RETURN m.title as movie, collect(p2.name) AS `co-actors`
// Retrieve all people who reviewed a movie, returning the list of reviewers and
how many reviewers reviewed the movie
MATCH (p:Person)-[:REVIEWED]->(m:Movie)
RETURN m.title as movie, count(p) as numReviews, collect(p.name) as reviewers
// Retrieve all directors, their movies, and people who acted in the movies,
returning the name of the director, the number of actors the director has worked
with, and the list of actors.
MATCH (d:Person)-[:DIRECTED]->(m:Movie)<-[:ACTED_IN]-(a:Person)
RETURN d.name AS director, count(a) AS `number actors` , collect(a.name) AS `actors
worked with`

// Retrieve the movies that have at least 2 directors, and optionally the names of
people who reviewed the movies.
MATCH (m:Movie)
WITH m, size((:Person)-[:DIRECTED]->(m)) AS directors
WHERE directors >= 2
OPTIONAL MATCH (p:Person)-[:REVIEWED]->(m)
RETURN m.title, p.name

Controlling how results are returned
Eliminating duplication
MATCH (p:Person)-[:DIRECTED | :ACTED_IN]->(m:Movie)
WHERE p.name = 'Tom Hanks'
RETURN m.released, collect(DISTINCT m.title) AS movies
You have seen a number of query results where there is duplication in the results returned. In most cases, you want to
eliminate duplicated results. You do so by using the DISTINCT keyword.
Using WITH and DISTINCT to eliminate duplication
WITH DISTINCT m
RETURN m.released, m.title
Another way that you can avoid duplication is to with WITH and DISTINCT together as follows:

Ordering results
RETURN m.released, collect(DISTINCT m.title) AS movies ORDER BY m.released DESC
If you want the results to be sorted, you specify the expression to use for the sort using the ORDER BY keyword and
whether you want the order to be descending using the DESC keyword. Ascending order is the default.

Limiting the number of results
MATCH (m:Movie)
RETURN m.title as title, m.released as year ORDER BY m.released DESC LIMIT 10
Although you can filter queries to reduce the number of results returned, you may also want to limit the number of results.

Controlling results returned
// write a query to retrieve all actors that acted in movies during the 1990s,
where you return the released date, the movie title, and the collected actor names
for the movie. For now do not worry about duplication.
WHERE m.released >= 1990 AND m.released < 2000
RETURN DISTINCT m.released, m.title, collect(a.name)
// modify the query so that the released date records returned are not duplicated.
To implement this, you must add the collection of the movie titles to the results
returned.
RETURN m.released, collect(m.title), collect(a.name)
// The results returned from the previous query returns the collection of movie
titles with duplicates. That is because there are multiple actors per released
year. Next, modify the query so that there is no duplication of the movies listed
for a year.
RETURN m.released, collect(DISTINCT m.title), collect(a.name)

Controlling results returned
// Retrieve the top 5 ratings and their associated movies, returning the movie
title and the rating.
MATCH (:Person)-[r:REVIEWED]->(m:Movie)
RETURN m.title AS movie, r.rating AS rating
ORDER BY r.rating DESC LIMIT 5

Working with Cypher data
Unwinding lists
// create a list with three elements, unwind the list and then return the values
WITH [1, 2, 3] AS list
UNWIND list AS row
RETURN list, row
There may be some situations where you want to perform the opposite of collecting results, but rather separate the lists
into separate rows. This functionality is done using the UNWIND clause.
The UNWIND clause is frequently used when importing data into a graph.

Dates
MATCH (actor:Person)-[:ACTED_IN]->(:Movie)
WHERE exists(actor.born)
// calculate the age
with DISTINCT actor, date().year - actor.born as age
RETURN actor.name, age as `age today`
ORDER BY actor.born DESC
Cypher has a built-in date() function, as well as other temporal values and functions that you can use to calculate temporal
values.
You use a combination of numeric, temporal, spatial, list and string functions to calculate values that are useful to your
application. For example, suppose you wanted to calculate the age of a Person node, given a year they were born (the born
property must exist and have a value).

// Modify the query you just wrote so that before the query processing ends, you
unwind the list of movies and then return the name of the actor and the title of
the associated movie
WITH p, collect(m) AS movies
WHERE size(movies) > 5
WITH p, movies UNWIND movies AS movie
RETURN p.name, movie.title
// retrieves all movies that Tom Hanks acted in, returning the title of the movie,
the year the movie was released, the number of years ago that the movie was
released, and the age of Tom when the movie was released
WHERE a.name = 'Tom Hanks'
RETURN m.title, m.released, date().year - m.released as yearsAgoReleased,
m.released - a.born AS `age of Tom`
ORDER BY yearsAgoReleased

Go further

Neo4j Bookshelf

Ressources
ressources:
blog:

Training & Certification

Labs

GraphGists

Azure Cosmos DB
A globally distributed, massively scalable, multi-model database service
Azure Cosmos DB

Global Distribution
Policy-based geo-fencing Dynamically add and remove regions
Failover prioritiesDynamically configurable read and write regions
Geo-local reads and writes 99.99% SLA for read availability
Database designed for modern web and mobile applications, which are (typically) global applications in nature.

Multi-Master
Improved write latency for end users
Improved write scalability and write throughput
Better support for disconnected environments (for example, edge devices)
Load balancing

Consistency
Consistency
Level Guarantees
Strong Linearizability (once operation is complete, it will be visible to all)
Bounded
Staleness
Consistent Prefix.
Reads lag behind writes by at most k prefixes or t interval
Similar properties to strong consistency (except within staleness window), while preserving 99.99%
availability and low latency.
Session Consistent Prefix.
Within a session: monotonic reads, monotonic writes, read-your-writes, write-follows-reads
Predictable consistency for a session, high read throughput + low latency
Consistent
Prefix
Reads will never see out of order writes (no gaps).
Eventual Potential for out of order reads. Lowest cost for reads of all consistency levels.

COMPREHENSIVE SLAs
RUN YOUR APP ON WORLD-CLASS INFRASTRUCTURE
Azure Cosmos DB is the only service with financially-backed SLAs for
millisecond latency at the 99th percentile, 99.999% HA and guaranteed
throughput and consistency
HALatency
<10 ms
99th percentile
99.999%
Throughput Consistency
Guaranteed Guaranteed

Trust your data to industry-leading Security & Compliance
Azure is the world’s most trusted cloud, with more certifications
than any other cloud provider.
• Enterprise grade security
• Encryption at Rest
• Encryption is enabled automatically by default
• Comprehensive Azure compliance certification

Throughput
Request unit calculator
Request unit considerations
Item size
Item property count
Data consistency
Indexex properties
Document indexing
Script usage
The currency of Azure Cosmos DB is the request unit (RU). With request units, you don't need to reserve read/write capacities or provision CPU, memory, and IOPS.

Serverless database
Serverless computing is all about the ability to focus on individual pieces of logic that are repeatable and stateless.
o no infrastructure management.
o consume resources only for the seconds, or milliseconds, they run for.
Azure Cosmos DB trigger to invoke an Azure Function
Use an input binding to get data from Azure
Cosmos DB
Use an ouput binding to write data to Azure Cosmos DB

Serverless database
Serverless computing is all about the ability to focus on individual pieces of logic that are repeatable and stateless.
o no infrastructure management.
o consume resources only for the seconds, or milliseconds, they run for.

Cosmos DB Change Feed

Uses cases

Top 10 reasons why customers use
Azure Cosmos DB
different types of data
multi-tenancy
and enterprise-grade
security
global
distribution turnkey
capability
mission
critical
massive
storage/throughput
scalability
to
optimize for speed and
cost
5 well-defined
consistency models
analytics-
ready
event-driven
architectures
single digit
millisecond latency at
99th percentile
worldwide
big data
high
availability and
reliability

Powering global solutions
Azure Cosmos DB was built to support modern app patterns and use cases.
It enables industry-leading organizations to unlock the value of data, and respond to
global customers and changing business dynamics in real-time.
Data distributed and
available globally
Puts data where your
users are
Build real-time
customer experiences
Enable latency-sensitive
personalization, bidding,
and fraud detection.
Ideal for gaming,
IoT & eCommerce
Predictable and fast
service, even during
traffic spikes
Simplified
development with
serverless architecture
Fully-managed event-
driven micro-services
with elastic computing
power
Run Spark analytics
over operational data
Accelerate insights from
fast, global data
Lift and shift
NoSQL data
Lift and shift MongoDB
and Cassandra
workloads

Data distributed and available globally
Put your data where your users are to give real-time access and
uninterrupted service to customers anywhere in the world.
o Turnkey global data replication across all Azure regions
o Guaranteed low-latency experience for global users
o Resiliency for high availability and disaster recovery

Build Real-Time Customer experiences
Offer latency-sensitive applications with personalization, bidding, and
fraud-detection.
o Machine learning models generate real-time
recommendations across product catalogues
o Product analysis in milliseconds
o Low-latency ensures high app performance worldwide
o Tunable consistency models for rapid insight
Online Recommendations Service
HOT path
Offline Recommendations Engine
COLD path

Ideal for gaming, IoT and ecommerce
Maintain service quality during high-traffic periods requiring
massive scale and performance.
o Instant, elastic scaling handles traffic bursts
o Uninterrupted global user experience
o Low-latency data access and processing for large and
changing user bases
o High availability across multiple data centers

Massive Scale Telemetry Stores for IOT
Diverse and unpredictable IoT sensor workloads require a
responsive data platform
o Seamless handling of any data output or volume
o Data made available immediately, and indexed
automatically
o High writes per second, with stable ingestion and
query performance

simplified development with serverless architecture
Experience decreased time-to-market, enhanced scalability, and
freedom from framework management with event-driven
micro-services.
o Seamless handling of any data output or volume
o Data made available immediately, and indexed
automatically
o High writes per second, with stable ingestion and
query performance
o Real-time, resilient change feeds logged forever and
always accessible
o Native integration with Azure Functions

Run spark over operational data
Accelerate analysis of fast-changing, high-volume, global data.
o Real-time big data processing across any data model
o Machine learning at scale over globally-distributed data
o Speeds analytical queries with automatic indexing and
push-down predicate filtering
o Native integration with Spark Connector

Lift and shift nosql apps
Make data modernization easy with seamless lift and shift
migration of NoSQL workloads to the cloud.
o Azure Cosmos DB APIs for MongoDB and Cassandra
bring app data from anywhere to Azure Cosmos DB
o Leverage existing tools, drivers, and libraries, and
continue using existing apps’ current SDKs
o Turnkey geo-replication
o No infrastructure or VM management required
.NET

Retail and marketing

Model

Document Data Model
“Because at the end of the day, it’s all just keys and values – not just the key-value data model, but all these data models.”
“When it comes to actually building applications – well, that’s the developer’s job, and this is where the decision of which data model to
choose comes into play.”
Document
SQL API (JSON)
MongoDB API
Graph
Gremlin API
(graph transversal language)
Key-Value
Table API
(replaces Azure Table Storage)
Columnar
Cassandra API

Atom Record Sequence (ARS)
Your data is always stored as ARS – or Atom Record Sequence – a Microsoft creation that defines the
persistence layer for key-value pairs.
Switching Between Data Models
choosing an API = choosing a data model

Switching Between Data Models
Each data model is merely a projection of the same underlying ARS format, and so eventually you will be
able to create a single account, and then switch freely between different APIs within the account. So that
then, you’ll be able to access one database as graph, key-value, document, or columnar, all at once.
Future release ?

Resource Model

Resource Model
Account
Database
Container
Item
User
Permission

Resource Model
Account
Database
Container
Item
= Collection Graph Table

Handle any data with no schema or indexing required
Azure Cosmos DB’s schema-less service automatically indexes all your data,
regardless of the data model, to delivery blazing fast queries.
Item Color
Microwave
safe
Liquid
capacity
CPU Memory Storage
Geek mug Graphite Yes 16ox ??? ??? ???
Coffee
Bean mug
Tan No 12oz ??? ??? ???
Surface
book
Gray ??? ??? 3.4 GHz
Intel
Skylake
Core i7-
6600U
16GB 1 TB SSD
o Automatic index management
o Synchronous auto-indexing
o No schemas or secondary indices needed
o Works across every data model
GEEK

Index
Schema-agnostic, automatic indexing
o Automatically index every property of every record without
having to define schemas and indices upfront.
o No need for schema and index management
o Works across every data model
o Latch free data structure for highly write-optimized database
engine
o Multiple index types: Hash, range, and geospatial

Index POLICIES
CUSTOM INDEXING POLICIES
Though all Azure Cosmos DB data is indexed by default, you
can specify a custom indexing policy for your collections. Custom
indexing policies allow you to design and customize the shape of
your index while maintaining schema flexibility.
o Define trade-offs between storage, write and query
performance, and query consistency
o Include or exclude documents and paths to and from the
index
o Configure various index types
{
"automatic": true,
"indexingMode": "Consistent",
"includedPaths": [{
"path": "/*",
"indexes": [{
"kind": "Hash",
"dataType": "String",
"precision": -1
}, {
"kind": "Range",
"dataType": "Number",
"precision": -1
}, {
"kind": "Spatial",
"dataType": "Point"
}]
}],
"excludedPaths": [{
"path": "/nonIndexedContent/*"
}]
}

Ressource Model in Cosmos DB

SQL QUERY SYNTAX

SQL SYNTAX
Using the popular query language, SQL, to access semi-
structured JSON data.
This module will reference querying in the context of the SQL
API for Azure Cosmos DB.

SQL QUERY SYNTAX
BASIC QUERY SYNTAX
The SELECT & FROM keywords are the basic components of
every query.
> SELECT
tickets.id,
tickets.pricePaid
FROM tickets
> SELECT
t.id,
t.pricePaid
FROM tickets t

SQL QUERY SYNTAX - WHERE
FILTERING
WHERE supports complex scalar expressions including
arithmetic, comparison and logical operators
> SELECT
tickets.id,
tickets.pricePaid
FROM tickets
WHERE
tickets.pricePaid > 500.00 AND
tickets.pricePaid <= 1000.00

SQL QUERY SYNTAX - PROJECTION
FILTERING
If your workloads require a specific JSON schema, Azure
Cosmos DB supports JSON projection within its queries
> SELECT {
"id": tickets.id,
"flightNumber": tickets.assignedFlight.flightNumber,
"purchase": {
"cost": tickets.pricePaid
},
"stops": [
tickets.assignedFlight.origin,
tickets.assignedFlight.destination
]
} AS ticket
FROM tickets

SQL QUERY SYNTAX - PROJECTION
FILTERING
If your workloads require a specific JSON schema, Azure
Cosmos DB supports JSON projection within its queries
> SELECT VALUE {
"id": tickets.id,
"flightNumber": tickets.assignedFlight.flightNumber,
"purchase": {
"cost": tickets.pricePaid
},
"stops": [
tickets.assignedFlight.origin,
tickets.assignedFlight.destination
]
}
FROM tickets

INTRA-DOCUMENT JOIN
Azure Cosmos DB supports intra-document JOIN’s for de-normalized arrays
Let’s assume that we have two JSON documents in a collection:
{
"pricePaid": 575.5,
"assignedFlight": {
"number": "F125",
"origin": "SEA",
"destination": "JFK"
},
"seat": “12A",
"requests": [
"kosher_meal",
"aisle_seat"
],
"id": "6ebe1165836a"
}
{
"pricePaid": 234.75,
"assignedFlight": {
"number": "F752",
"origin": "SEA",
"destination": "LGA"
},
"seat": "14C",
"requests": [
"early_boarding",
"window_seat"
],
"id": "c4991b4d2efc"
}

INTRA-DOCUMENT JOIN
We can filter on a particular array index position without JOIN:
> SELECT
tickets.assignedFlight.number,
tickets.seat,
ticket.requests
FROM
tickets
WHERE
ticket.requests[1] == "aisle_seat"
[
{
"number":"F125","seat":"12A",
"requests": [
"kosher_meal",
"aisle_seat"
]
}
]

INTRA-DOCUMENT JOIN
JOIN allows us to merge embedded documents or arrays across multiple documents and
returned a flattened result set:
> SELECT
tickets.assignedFlight.number,
tickets.seat,
requests
FROM
tickets
JOIN
requests IN tickets.requests
[
{
"requests":"kosher_meal"
},
{
"requests":"aisle_seat"
},
{
"number":"F752","seat":"14C",
"requests":"early_boarding"
},
{
"requests":"window_seat"
}
]

INTRA-DOCUMENT JOIN
Along with JOIN, we can also filter the cross products without knowing the array index
position:
> SELECT
tickets.id, requests
FROM
tickets
JOIN
requests IN tickets.requests
WHERE
requests
IN ("aisle_seat", "window_seat")
[
{
"number":"F125","seat":"12A“,
"requests": "aisle_seat"
},
{
"requests": "window_seat"
}
]

Tools

Cosmos DB Emulator
The Azure Cosmos DB Emulator provides a local environment that emulates the Azure Cosmos DB service for development
purposes. Using the Azure Cosmos DB Emulator, you can develop and test your application locally, without creating an Azure
subscription or incurring any costs. When you're satisfied with how your application is working in the Azure Cosmos DB
Emulator, you can switch to using an Azure Cosmos DB account in the cloud.
At this time the Data Explorer in the emulator only fully supports SQL API collections and MongoDB collections. Table, Graph, and Cassandra containers are not
fully supported.
The Azure Cosmos DB Emulator provides a high-fidelity emulation of the Azure Cosmos DB service. It supports identical
functionality as Azure Cosmos DB, including support for creating and querying JSON documents, provisioning and scaling
collections, and executing stored procedures and triggers. You can develop and test applications using the Azure Cosmos DB
Emulator, and deploy them to Azure at global scale by just making a single configuration change to the connection endpoint for
Azure Cosmos DB.
The Azure Cosmos DB Emulator by default runs on the local machine ("localhost") listening on port 8081.

Azure Cosmos DB : Data migration tools
Data Migration Tools
SQL API Mongo DB API Graph APITable API
Cassandra API

Cosmos DB Explorer
With Cosmos DB Explorer you can:
o Take advantage of the full screen real estate for your queries and
results.
o Access your database account and collections with a connection string,
without needing access to the Azure subscription or portal.
o Share query results with authorized peers who do not have Azure
portal access.
o Work with Cosmos DB data without having to download any desktop
tools locally.
https://cosmos.azure.com/

Azure Cosmos DB – Interface demo

Azure Cosmos DB – SQL Query Exercice
Add data using Data Explorer
https://docs.microsoft.com/en-ie/learn/modules/access-data-with-cosmos-db-and-
sql-api/3-add-data
Explore SQL query types
https://docs.microsoft.com/en-ie/learn/modules/access-data-
with-cosmos-db-and-sql-api/4-query-types

Add cosmos DB to you architecture

Partitioning

Stored procedure & UDFs

Stored Procedures
BENEFITS
o Familiar programming language
o Atomic Transactions
o Built-in Optimizations
o Business Logic Encapsulation
Stored procedures perform complex transactions on documents and properties.
Stored procedures are written in JavaScript and are stored in a container on Azure
Cosmos DB. By performing the stored procedures on the database engine and
close to the data, you can improve performance over client-side programming.
Stored procedures are the only way to achieve atomic transactions within Azure
Cosmos DB; the client-side SDKs do not support transactions.
Performing batch operations in stored procedures is also recommended because
of the reduced need to create separate transactions.

Simple Stored Procedure
function createSampleDocument(documentToCreate) {
var context = getContext();
var collection = context.getCollection();
var accepted = collection.createDocument(
collection.getSelfLink(),
documentToCreate,
function (error, documentCreated) {
context.getResponse().setBody(documentCreated.id)
}
);
if (!accepted) return;
}

Multi-DOCUMENT Transactions
DATABASE TRANSACTIONS
In a typical database, a transaction can be defined as a sequence of operations performed as a single
logical unit of work. Each transaction provides ACID guarantees.
In Azure Cosmos DB, JavaScript is hosted in the same memory space as the database. Hence,
requests made within stored procedures and triggers execute in the same scope of a database
session.
Create
New
Document
Query
Collection
Update
Existing
Document
Delete
Existing
Document
Stored procedures utilize snapshot
isolation to guarantee all reads within the
transaction will see a consistent snapshot
of the data

Bounded Execution
EXECUTION WITHIN TIME BOUNDARIES
All Azure Cosmos DB operations must complete within the server-specified request timeout duration. If an
operation does not complete within that time limit, the transaction is rolled back.
HELPER BOOLEAN VALUE
All functions under the collection object (for create, read, replace, and delete of documents and
attachments) return a Boolean value that represents whether that operation will complete:
o If true, the operation is expected to complete
o If false, the time limit will soon be reached and your function should end execution as soon as
possible.

Transaction Continuation Model
CONTINUING LONG-RUNNING TRANSACTIONS
o JavaScript functions can implement a continuation-based model to batch/resume execution
o The continuation value can be any value of your own choosing. This value can then be used by your
applications to resume a transaction from a new “starting point”
Bulk Create Documents
Return a “pointer” to resume later
Observe
Return
Value
Try Create
Each
Document
Done

Control Flow
JAVASCRIPT CONTROL FLOW
Stored procedures allow you to naturally express control flow, variable scoping, assignment, and
integration of exception handling primitives with database transactions directly in terms of the JavaScript
programming language.
ES6 PROMISES
ES6 promises can be used to implement promises for Azure Cosmos DB stored procedures. Unfortunately,
promises “swallow” exceptions by default. It is recommended to use callbacks instead of ES6 promises.

Stored Procedure Control Flow
function createTwoDocuments(docA, docB) {
var ctxt = getContext(); var coll = context.getCollection(); var collLink =
coll.getSelfLink();
var aAccepted = coll.createDocument(collLink, docA, docACallback);
function docACallback(error, created) {
var bAccepted = coll.createDocument(collLink, docB, docBCallback);
if (!bAccepted) return;
};
function docBCallback(error, created) {
context.getResponse().setBody({
"firstDocId": created.id,
"secondDocId": created.id
});
};
}

Rolling Back Transactions
TRANSACTION ROLL-BACK
Inside a JavaScript function, all operations are automatically wrapped under a single transaction:
o If the function completes without any exception, all data changes are committed
o If there is any exception that’s thrown from the script, Azure Cosmos DB’s JavaScript runtime will
roll back the whole transaction.
Create New
Document
Query
Collection
Update
Existing
Document
Delete Existing
Document
If exception, undo changes
Transaction Scope

Transaction ROLLBACK in Stored Procedure
collection.createDocument(
collection.getSelfLink(),
documentToCreate,
function (error, documentCreated) {
if (error) throw "Unable to create document, aborting...";
}
);
collection.createDocument(
documentToReplace._self,
replacementDocument,
function (error, documentReplaced) {
if (error) throw "Unable to update document, aborting...";
}
);

User-defined Functions
UDF
User-defined functions (UDFs) are used to extend the Azure Cosmos DB SQL API’s query language
grammar and implement custom business logic. UDFs can only be called from inside queries
They do not have access to the context object and are meant to be used as compute-only code.

User-Defined Function Definition
var taxUdf = {
id: "tax",
serverScript: function tax(income) {
if (income == undefined)
throw 'no input’;
if (income < 1000)
return income * 0.1;
else if (income < 10000)
else
}
}

User-Defined Function USAGE in Queries
> SELECT
*
FROM
TaxPayers t
WHERE
udf.tax(t.income) > 20000

Create multiple Cosmos DB triggers

Modelization

Modelling Data
Embeded
“The guiding premise when normalizing data is to avoid storing redundant data on each
record and rather refer to data.”
Embedding data

Modelling Data
Embeded data
When to embed:
o There are contains relationships between entities.
o There are one-to-few relationships between entities.
o There is embedded data that changes infrequently.
o There is embedded data won't grow without bound.
o There is embedded data that is integral to data in a document.

Modelling Data
Referenced data
The problem with this example is that the comments array is unbounded, meaning that there is no (practical) limit to the
number of comments any single post can have.
Referencing data

Modelling Data
Referenced data

Modelling Data
Referenced data
When to reference:
o Representing one-to-many relationships.
o Representing many-to-many relationships.
o Related data changes frequently.
o Referenced data could be unbounded.

Modelling Data
Where do I put the relationship?
We have dropped the unbounded collection on the publisher document.
Instead we just have a reference to the publisher on each book document.

Modelling Data
The “Ladder” pattern

Modelling Data
How do I model many:many relationships?

Modelling Data
Hybrid data models
Pre-calculated aggregates values to save expensive processing on a read operation. In
the example, some of the data embedded in the author document is data that is
calculated at run-time. Every time a new book is published, a book document is
created and the countOfBooks field is set to a calculated value based on the number of
book documents that exist for a particular author. This optimization would be good in
read heavy systems where we can afford to do computations on writes in order to
optimize reads.
We could've just stuck with id and left the application to get any additional information
it needed from the respective author document using the "link", but because our
application displays the author's name and a thumbnail picture with every book
displayed we can save a round trip to the server per book in a list by
denormalizing some data from the author.
Sure, if the author's name changed or they wanted to update their photo we'd have to
go an update every book they ever published but for our application, based on the
assumption that authors don't change their names very often, this is an acceptable
design decision.

Modelling Data

Architectures

Azure Cosmos DB - Change Feed Lab

Cosmos DB & Spark

Broadcast Real-time Updates from Cosmos DB with SignalR
Service and Azure Functions

Advanced Analytics on big data architecture

STRIIM FOR AZURE COSMOS DB
Continuous, Real-Time Data Movement

Querying An Azure Cosmos DB Database using the SQL API
https://cosmosdb.github.io/labs/dotnet/technical_deep_dive/03-querying_the_database_using_sql.html
Azure Data Factory
Azure Cosmos DB
Visual Studio Code

Through examples

How Skype modernized its backend infrastructure using Azure
Cosmos DB
Lessons learned
Looking back at the project, Kaduk recalls several “lessons learned.” These include:
o Use direct mode for better performance – How a client connects to Azure Cosmos DB has important performance implications, especially
with respect to observed client side latency. The team began by using the default Gateway Mode connection policy, but switched to a Direct
Mode connection policy because it delivers better performance.
o Learn how to write and handle stored procedures – With Azure Cosmos DB, transactions can only be implemented using stored
procedures—pieces of application logic that are written in JavaScript that are registered and executed against a collection as a single
transaction. (In Azure Cosmos DB, JavaScript is hosted in the same memory space as the database. Hence, requests made within stored
procedures execute in the same scope of a database session, which enables Azure Cosmos DB to guarantee ACID for all operations that are
part of a single stored procedure.)
o Pay attention to query design – With Azure Cosmos DB, queries have a large impact in terms of RU consumption. Developers didn’t pay
much attention to query design at first, but soon found that RU costs were higher than desired. This led to an increased focus on optimizing
query design, such as using point document reads wherever possible and optimizing the query selections per API.
o Use the Azure Cosmos DB SDK 2.x to optimize connection usage – Within Azure Cosmos DB, the data stored in each region is distributed
across tens of thousands of physical partitions. To serve reads and writes, the Azure Cosmos DB client SDK must establish a connection with
the physical node hosting the partition. The team started by using the Azure Cosmos DB SDK 1.x, but found that its lack of support for
connection multiplexing led to excessive connection establishment and closing rates. Switching to the Azure Cosmos DB SDK 2.x, which
supports connection multiplexing, helped solve the problem —and also helped mitigate SNAT port exhaustion issues.

Deeper

Cosmic notes

Become an Azure Cosmonauts

Big dataclasses 2019_nosql

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Big dataclasses 2019_nosql

Ähnlich wie Big dataclasses 2019_nosql (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big dataclasses 2019_nosql