SlideShare ist ein Scribd-Unternehmen logo
1 von 215
Downloaden Sie, um offline zu lesen
17/07/2019 Big Data class by Alexandre Bergere 1
Big Data
ESAIP – IR4
17/07/2019 Big Data class by Alexandre Bergere 2
alexandre.bergere@gmail.com
https://fr.linkedin.com/in/alexandrebergere
@AlexPhile
ESAIP
2013 - 2016
Avanade
2016 - 2019
Sr Anls, Data EngineeringStudent
Worked as a senior analyst at Avanade
France, I have developed my skills in data
analysis (MSBI, Power BI, R, Python) by
working on innovative projects and proofs of
concept in the energy industry.
ESAIP
Teacher
2016 - ?
Freelance
2019 - x
Data Analyst & Data Architect
17/07/2019 Big Data class by Alexandre Bergere 3
Planning
D-1 D-2 D-3 D-4 D-5
MorningAfternoon
What’s Big Data
+
No SQL
+
Cloud Architecture
Azure IOT + Azure Stream
Analytics + Power BI
Theorical AWS Practice Azure Practice Exam
Oral Exam
Written Exam
SPARK
SPARK
Free time
Prep. Oral
Analyse Big Data with
Hadoop
SPARK
Redshift
Cosmos DB
Serverless architecture :
AWS Lambda +
DynamoDB + NodeJS
Cosmos DB
SPARK
On Prem
Neo4J
Mongo DB
Cloud
SPARK
17/07/2019 Big Data class by Alexandre Bergere 4
Planning
D-1 D-2 D-3
MorningAfternoon
What’s Big Data
Azure IOT + Azure Stream
Analytics + Power BI
Theorical Azure Practice
Cosmos DB
SPARK
On Prem
Neo4J
Mongo DB
Cloud
Cloud architecture
Written Exam
BI & Machine Learning
Analyse Big Data with
Hadoop
17/07/2019 Big Data class by Alexandre Bergere 6
Data Storage
17/07/2019 Big Data class by Alexandre Bergere 7
Data Storage
Relational data store HDFS Key Value data store Columnar data store
Object store Search data store Graph data store Document data store
17/07/2019 Big Data class by Alexandre Bergere 8
Mongo DB
17/07/2019 Big Data class by Alexandre Bergere 9
Mongo DB
Created in 2007 & first release
in 2010.
Easy and simple … as a leaf.
Document data store &
Schemaless.
Nexus Architecture
17/07/2019 Big Data class by Alexandre Bergere 10
Driver & Framework
17/07/2019 Big Data class by Alexandre Bergere 11
MongoDB is easy
17/07/2019 Big Data class by Alexandre Bergere 12
For many developers, data model goes hand in hand with object mapping, and for that purpose
you may have used an object-relational mapping library, such as Java’s Hibernate framework or
Ruby’s ActiveRecord.
Such libraries can be useful for efficiently building applications with a RDBMS, but they’re less
necessary with MongoDB. This is due in part to the fact that a document is already an object-
like representation. It’s also partly due to the MongoDB drivers, which already provide a fairly
high-level interface to MongoDB. Without question, you can build applications on MongoDB
using the driver interface alone.
Use cases
17/07/2019 Big Data class by Alexandre Bergere 13
o Web application (mongoDB is well-suited as primary datastore for web application)
o Agile development
o Analytics and logging
o Caching
o Variable Schemas
Mongo DB 4.0 : ACID transactions
17/07/2019 Big Data class by Alexandre Bergere 14
More info.
Bêta test.
Mongo DB releases
17/07/2019 Big Data class by Alexandre Bergere 15
Compagnies
17/07/2019 Big Data class by Alexandre Bergere 16
Analytics – use case
17/07/2019 Big Data class by Alexandre Bergere 17
More info.
The City of Chicago cuts crime and improves citizen
welfare with a real-time geospatial analytics platform
called WindyGrid. Using MongoDB, it analyzes data
from 30+ different departments – like bus locations,
911 calls, and even tweets – to better understand and
respond to emergencies.
The case for adding NoSQL
17/07/2019 Big Data class by Alexandre Bergere 18
o Large volumes of rapidly changing structured, semi-structured, and unstructured data
o Agile sprints, quick schema iteration, and frequent code pushes
o API-driven, object-oriented programming that is easy to use and flexible
o Geographically distributed scale-out architecture instead of expensive, monolithic
architecture
Consider, for example, enterprise resource planning (ERP), a standard for relational databases.
What if you want to offer ERP forms users can actually modify if they need to? A document-
based NoSQL database such as MongoDB can provide that functionality without requiring you
to rebuild your whole data schema every time a user wants to change the data format.
White papers
17/07/2019 Big Data class by Alexandre Bergere 19
MongoDB – BI &
Analytics
MongoDB – Kafka MongoDB – Spark
Leader in The Forrester Wave™: Big Data NoSQL, Q1 2019
17/07/2019 Big Data class by Alexandre Bergere 20
o Data Types
o Streaming and Loading
o Big Data Support
o In-memory
o Performance
o Scalability
o High Availability & Disaster
Recovery
o Tools
o Workloads
o Use Cases
o Ability to Execute
o Road Map
o Open Source and Licensing
o Support
17/07/2019 Big Data class by Alexandre Bergere 21
Tools
MongoDB Compass
17/07/2019 Big Data class by Alexandre Bergere 22
Mongo DB Atlas
17/07/2019 Big Data class by Alexandre Bergere 23
DAAS : Database As A Service • Schema design
• Query and index optimization
• Server size selection - you must select the appropriate size of server,
coupled with IO and storage capacity
• Capacity planning - you must determine when you need additional
capacity, typically using the monitoring telemetry provided by
MongoDB Atlas, but you can make these changes with no downtime
• Initiating database restores
• How much you use
Mongo DB Cloud Manager
17/07/2019 Big Data class by Alexandre Bergere 24
Mongo DB Connector for BI
17/07/2019 Big Data class by Alexandre Bergere 25
MongoDB Charts
(beta)
17/07/2019 Big Data class by Alexandre Bergere 26
MongoDB Charts is the fastest and
easiest way to build visualizations of
MongoDB data.
Architecture pseudo On premise
17/07/2019 Big Data class by Alexandre Bergere 27
Change Streams
17/07/2019 Big Data class by Alexandre Bergere 28
More info.
Change streams allow applications to access real-time data changes without the complexity and risk of
tailing the oplog. Applications can use change streams to subscribe to all data changes on a collection and
immediately react to them.
Stitch
17/07/2019 Big Data class by Alexandre Bergere 29
Full access to MongoDB, declarative read/write
controls, and integration with your choice of services
MongoDB Stitch lets developers focus on building applications rather than on managing data manipulation code, service
integration, or backend infrastructure. Whether you’re just starting up and want a fully managed backend as a service, or
you’re part of an enterprise and want to expose existing MongoDB data to new applications, Stitch lets you focus on
building the app users want, not on writing boilerplate backend logic.
17/07/2019 Big Data class by Alexandre Bergere 30
Modeling & request
Document are rich data structure
17/07/2019 Big Data class by Alexandre Bergere 31
• JSON:
• String, Number, Array, Object, NULL, Boolean.
• BSON:
• Date, BinData, ObjectID, Geo-Location.
• Better storage performance.
ObjectID:
◦ _id : 'DATE[4] | MAC_ADDR[3] | PID[2] | COUNTER[3]
Available Types
17/07/2019 Big Data class by Alexandre Bergere 32
Type Number Alias Notes
Double 1 “double”
String 2 “string”
Object 3 “object”
Array 4 “array”
Binary data 5 “binData”
Undefined 6 “undefined” Deprecated.
ObjectId 7 “objectId”
Boolean 8 “bool”
Date 9 “date”
Null 10 “null”
Regular Expression 11 “regex”
DBPointer 12 “dbPointer” Deprecated.
JavaScript 13 “javascript”
Symbol 14 “symbol” Deprecated.
JavaScript (with scope) 15 “javascriptWithScope”
32-bit integer 16 “int”
Timestamp 17 “timestamp”
64-bit integer 18 “long”
Decimal128 19 “decimal” New in version 3.4.
Min key -1 “minKey”
Max key 127 “maxKey”
SQL vs MongoDB Terms
17/07/2019 Big Data class by Alexandre Bergere 33
SQL Terms/Concepts MongoDB Terms/Concepts
Database Database
Table Collection
Line Document
Column Field
Index Index
Join Embeded or linked document
Primary key Primary key (start by « _id »)
Documents are Flexible
17/07/2019 Big Data class by Alexandre Bergere 34
Document Model
17/07/2019 Big Data class by Alexandre Bergere 35
Pers_ID Surname First_Name City
0 Miller Paul London
1 Ortega Alvaro Valencia
2 Huber Urs Zurich
3 Blanc Gaston Paris
4 Bertolini Fabrizio Rome
Car_ID Model Year Value Pers_ID
101 Bently 1973 100000 0
102 Rolls Royce 1965 330000 0
103 Peugot 1993 500 3
104 Ferrari 2005 150000 4
105 Renault 1998 2000 3
106 Renault 2001 7000 3
107 Smart 1999 2000 2
CAR
PERSON
Mongo DB
RDBMS
One to many
17/07/2019 Big Data class by Alexandre Bergere 36
CRUD
17/07/2019 Big Data class by Alexandre Bergere 37
# FIND()
> db.<collection>.find ({<conditions>},{<champs>})
> db.products.find( { qty: { $gt: 25 } }, { item: 1, qty: 1 } )
Options:
.pretty()
.sort() : 1 : ASC, -1: DESC : sort({‘name’:-1})
.skip() : number
.limit() : number
.count()
sort, first, skip, second, and limit last because that is the only order that makes
sense.
CRUD
17/07/2019 Big Data class by Alexandre Bergere 38
# INSERT()
> db.<collection>.insert ({<value>})
> db.<collection>.insertMany([{<values>}])
> db.inventory.insertMany([
{ item: "journal", qty: 25, tags: ["blank", "red"], size: { h: 14, w: 21, uom: "cm" } },
{ item: "mat", qty: 85, tags: ["gray"], size: { h: 27.9, w: 35.5, uom: "cm" } },
{ item: "mousepad", qty: 25, tags: ["gel", "blue"], size: { h: 19, w: 22.85, uom: "cm" } }
])
db.collection.insertOne() Inserts a single document into a collection.
db.collection.insertMany() db.collection.insertMany() inserts multiple documents into a collection.
db.collection.insert()
db.collection.insert() inserts a single document or multiple documents into
a collection.
CRUD
17/07/2019 Big Data class by Alexandre Bergere 39
# UPDATE()
> db.<collection>.update
({<conditions>},{<champs>},{upsert:true/false},{multi:true/false}
)
> { "_id": "artist:271", "last_name": "Cotillard", "first_name": "Marion", "birth_date": "1975" }
# Operator Update
> db.artists.update({"_id": "artist:281"},{ $set : {"last_name" : "Page"}})
> { "_id": "artist:271", "last_name": “Page", "first_name": "Marion", "birth_date": "1975" }
# Replacement Update
> db.artists.update({"_id": "artist:281"},{"last_name" : "Page"})
> { "_id": "artist:271", "last_name": “Page"}
❑ Operator Update
❑ Replacement Update
Upsert: boolean Optional. If
set to true, creates a new document when
no document matches the query criteria.
The default value is false, which does not
insert a new document when no match is
found.
Multi: boolean Optional. If
set to true, updates multiple documents
that meet the query criteria. If set to false,
updates one document. The default value
is false.
CRUD
17/07/2019 Big Data class by Alexandre Bergere 40
# DELETE()
> db.<collection>.remove ({<conditions>})
> db.artists.remove({"_id": "artist:39"})
# Remove all fields
> db.artists.remove({})
Query Operator
17/07/2019 Big Data class by Alexandre Bergere 41
Name Description
$eq Matches values that are equal to a specified value.
$gt Matches values that are greater than a specified value.
$gte Matches values that are greater than or equal to a specified value.
$lt Matches values that are less than a specified value.
$lte Matches values that are less than or equal to a specified value.
$ne Matches all values that are not equal to a specified value.
$in Matches any of the values specified in an array.
Query Operator : $set
17/07/2019 Big Data class by Alexandre Bergere 42
# $set
> db.products.update(
{ _id: 100 },
{ $set:
{
quantity: 500,
details: { model: "14Q3", make: "xyz" },
tags: [ "coats", "outerwear", "clothing" ]
}
}
)
# $set Embedded Documents
> db.products.update(
{ _id: 100 },
{ $set: { "details.make": "zzz" } }
)
# $set in Arrays
> db.products.update(
{ _id: 100 },
{ $set:
{
"tags.1": "rain gear",
"ratings.0.rating": 2
}
}
)
Query Operator : Arrays
17/07/2019 Big Data class by Alexandre Bergere 43
Name Description
$pull Removes all array elements that match a specified query.
$push Add an element to an array.
$pop Removes the first or last item of an array.
$addToSet Adds elements to an array only if they do not already exist in the set.
$in Matches any of the values specified in an array.
DML
17/07/2019 Big Data class by Alexandre Bergere 44
# Returns all database
> show dbs
# The current database name:
> db.getName()
# Returns all database
> show dbs
# Returns all collection in the current
database:
> db.getCollectionNames()
# Returns a collection or a view object:
> db.getCollection(name)
# The current database connection:
> db.getMongo()
# Clean the console log:
> cls
# Return collection informations:
> db.getCollectionInfos({name: "name"})
Command-line tools
17/07/2019 Big Data class by Alexandre Bergere 45
# Import multiples document:
> mongoimport -d crunchbase -c companies
D:MongoDBsrccompanies.json
# Import multiples document in an array:
> mongoimport -d crunchbase -c companies
D:MongoDBsrccompanies.json --jsonArray
# Export
> mongoexport -d crunchbase -c artists --out
D:MongoDBartists.json
Launch in the shell, not in mongoDB instance.
Command Description
mongodump mongodump is a utility for creating a binary export of the
contents of a database. mongodump can export data from
either mongod or mongos instances.
mongorestore The mongorestore program loads data from either a binary
database dump created by mongodump or the standard input
(starting in version 3.0.0) into a mongod or mongos instance.
mongostat This utility constantly polls MongoDB and the system to
provide helpful stats, including the number of operations per
second (inserts, queries, updates, deletes, and so on), the
amount of virtual memory allocated, and the number of
connections to the server.
mongoperf Helps you understand the disk operations happening in a
running MongoDB instance.
mongotop Similar to top, this utility polls MongoDB and shows the
amount of time it spends reading and writing data in each
collection.
mongosniff A wire-sniffing tool for viewing operations sent to the
database. It essentially translates the BSON going over the
wire to human-readable shell statements.
$text
17/07/2019 Big Data class by Alexandre Bergere 46
# $text
> db.articles.find( { $text: { $search: "coffee" } } ))
$text performs a text search on the content of the fields indexed with a text index. A $text expression has the following
syntax:
# $text
> {
$text:
{
$search: <string>,
$language: <string>,
$caseSensitive: <boolean>,
$diacriticSensitive: <boolean>
}
}
# Create index first - You can index multiple fields for the
text index:
db.reviews.createIndex(
{
subject: "text",
comments: "text"
}
)
Schema Validation
17/07/2019 Big Data class by Alexandre Bergere 47
Implement data governance without sacrificing
the agility that comes from a dynamic schema.
With schema validation, developers and
operations spend less time defining data quality
controls in their applications, and instead
delegate these tasks to the database.
Aggregation
17/07/2019 Big Data class by Alexandre Bergere 48
Swiss Army knife
Executes in native code
o Written in C++
o JSON parameter
Flexible, funcional, simple
o Operation pipeline
o Computational expressions
Pipeline operators
17/07/2019 Big Data class by Alexandre Bergere 49
Operator Description
$match Filter documents
$project Reshape documents
$group Summarize documents
$unwind Expand arrays in documents
$sort Order documents
$limit / $skip Paginate documents
$redact Restrict documents
$geoNear Proximity sort documents
$let, $map Define variables
$match
17/07/2019 Big Data class by Alexandre Bergere 50
# Matching field values
> {$match:{
language:"Russian"
}
{
title:"War and Peace",
pages:1440,
langugage:"Russian"
}
# Matching with query operators
> {$match:{
pages:{$gt:100}
}
{
title:"War and Peace",
pages:1440,
langugage:"Russian"
},
{
title:"Atlas Shrugged",
pages:1088,
langugage:"English"
}
$project
17/07/2019 Big Data class by Alexandre Bergere 51
# Renaming and cuputing fields
> {$project:{
avgChapterLength:{
$divide:["$pages", "$chapters" ]
},
lang: "$language"
}}
{
_id:375,
avgChapterLength: 24,2222
lang:"English"
}
# Including & excluding fields
> {$project:{
_id:0,
title:1,
language:1
}}
{
title:"Great Gatsby",
language:"English"
}
$group
17/07/2019 Big Data class by Alexandre Bergere 52
# Collect distinct values
> {$group:{
_id:"$langugage",
title:{$addToSet:"$title"}
}}
{
_id:"English",
language:[Atlas Shrugged" , "The
Great Gatsby"]
},
{
_id:"Russian",
language:["War and Peace"]
}
# Calculating average, summing fields…
> {$group:{
_id:"$langugage",
pages:{$sum:"$pages"},
books:{$sum:1},
avgPages:{$avg:"$pages"}
}}
{
_id:"Russian",
pages:1440,
books:1,
avgPages:1440
}
$unwind
17/07/2019 Big Data class by Alexandre Bergere 53
# Collect distinct values
> {$unwind:{
"subjects"
}
{
title:"The Great Gatsby",
ISBN:"9762832930920323" ,
subjects:"Long Island"
},
{
title:"The Great Gatsby",
ISBN:"9762832930920323" ,
subjects:"New York"
},
{
title:"The Great Gatsby",
ISBN:"9762832930920323" ,
subjects:"1920s"
}
{
title:"The Great Gatsby",
ISBN:"9762832930920323" ,
subjects:[
"Long Island",
"New York",
"1920s"
]
}
17/07/2019 Big Data class by Alexandre Bergere 54
LABS
Installation
17/07/2019 Big Data class by Alexandre Bergere 55
Download & Install
Instance
17/07/2019 Big Data class by Alexandre Bergere 56
Launch as a service:
mongod --dbpath C:UsersalexaDocumentsMongoDBdata
Launch as a connection:
mongo
Options Shortcut
--db -d
--collection -c
--username -u
--password -p
--host -h
Request practice
17/07/2019 Big Data class by Alexandre Bergere 57
# 1.0 Load artists.json
> mongoimport -d crunchbase -c artists --file C:UsersalexaDocumentsCoursMongoDB2017-
2018srcartists.json -–jsonArray -–port 27018
# 1.1 Return first_name and birth_date to all artists born in 1964
> db.artists.find({"birth_date": "1964"},{"_id":0,"first_name":1, "birth_date":1})
# 1.2 Return all arstists born after 1980 or with their first name begin by ‘Chri’
> db.artists.find({$or:[{"birth_date": {$gte:"1980"}},{"first_name":/^Chri/}]},{})
> db.artists.find({$or:[{"birth_date": {$gte:"1980"}},{"first_name":{$regex : /^Chri/}}]},{})
# 1.3 Return the 6e to the 9e artist by their name desc
> db.artists.find().pretty().sort({"last_name":-1}).skip(5).limit(4)
# 1.4 Insert the following artist:
{"_id": "artist:282", "last_name": "Bergere", "first_name": "Alexandre", "birth_date": "1992"} : (Replace
the id)
> db.artists.insert({ "_id": "artist:282", "last_name": "Bergere", "first_name": "Alexandre",
"birth_date": "1992" })
Request practice
17/07/2019 Big Data class by Alexandre Bergere 58
# 1.5 Modify by « Jonathan » the first_name of the artists with the id artist:266
> db.artists.update({"_id": "artist:266"},{$set:{"first_name":"Jonathan"}})
# 1.6 Add « golf » to the 280 artist’s hobbies
> db.artists.update({"_id": "artist:280"},{$push:{"hobbies":"golf"}})
# 1.7 Add « yoga » to the 282 artist’s hobbies
> db.artists.update({"_id": "artist:282"},{$push:{"hobbies":"yoga"}})
# 1.8 Remove « poney » and « photo » from 280 artist’s hobbies
> db.artists.update({"_id": "artist:280"},{$pull:{"hobbies": {$in:["poney","photo"]}}})
Request practice
17/07/2019 Big Data class by Alexandre Bergere 59
# Convert string to integer
> db.artists.find({birth_date: {$exists: true}}).forEach(function(obj) {
obj.birth_date = new NumberInt(obj.birth_date);
db.artists.save(obj);
});
17/07/2019 Big Data class by Alexandre Bergere 60
Go Deeper
Support
MongoDB in action, 2nd Edition docs.mongodb.com
17/07/2019 MongoDB class by Alexandre Bergere 61
Summer Internship
https://www.mongodb.com/careers/college-students
17/07/2019 MongoDB class by Alexandre Bergere 62
Learning
https://www.university.mongodb.com
17/07/2019 MongoDB class by Alexandre Bergere 63
17/07/2019 Big Data class by Alexandre Bergere 64
17/07/2019 Big Data class by Alexandre Bergere 65
Graph database
What is a graph database?
17/07/2019 Big Data class by Alexandre Bergere 66
A graph database is an online database management system with Create, Read, Update and Delete (CRUD)
operations working on a graph data model. Graph databases are generally built for use with online
transaction processing (OLTP) systems. Accordingly, they are normally optimized for transactional
performance, and engineered with transactional integrity and operational availability in mind. ~ Neo4j
Unlike other databases, relationships take first priority in graph databases.
The case for graph databases
17/07/2019 Big Data class by Alexandre Bergere 67
What is Graph?
17/07/2019 Big Data class by Alexandre Bergere 68
Graph is just a collection of vertices and edges—or, in less intimidating language, a set of nodes and the
relationships that connect them.
Definitions
17/07/2019 Big Data class by Alexandre Bergere 69
• Nodes
o Nodes are the main data elements
o Nodes are connected to other nodes
via relationships
o Nodes can have one or more properties (i.e.,
attributes stored as key/value pairs)
o Nodes have one or more labels that describes
its role in the graph
o Example: Person nodes vs Car nodes
• Relationships
o Relationships connect two nodes
o Relationships are directional
o Nodes can have multiple, even recursive
relationships
o Relationships can have one or
more properties (i.e., attributes stored as
key/value pairs)
Properties
o Properties are named values where the name (or
key) is a string
o Properties can be indexed and constrained
o Composite indexes can be created from multiple
properties
Labels
o Labels are used to group nodes into sets
o A node may have multiple labels
o Labels are indexed to accelerate finding nodes in
the graph
o Native label indexes are optimized for speed
Modelling relational to graph
17/07/2019 Big Data class by Alexandre Bergere 70
Relational Graph
Rows Nodes
Joins Relationships
Table names Labels
Columns Properties
similarities
relational model differs from the graph model
Relational Graph
Each column must have a field value.
Nodes with the same label aren’t required to have the same set of
properties.
Joins are calculated at query time. Relationships are stored on disk when they are created.
A row can belong to one table. A node can have many labels.
RDBMS vs graph
17/07/2019 Big Data class by Alexandre Bergere 71
17/07/2019 Big Data class by Alexandre Bergere 72
Neo4j
Neo4j Graph Platform
17/07/2019 Big Data class by Alexandre Bergere 73
The Neo4j Graph Platform includes out-of-the-box tooling that enables you to access graphs in Neo4j Databases. In
addition, Neo4j provides APIs and drivers that enable you to create applications and custom tooling for accessing and
visualizing graphs.
Dev env.
17/07/2019 Big Data class by Alexandre Bergere 75
Neo4j SandboxNeo4j Desktop
o Neo4j Database server
o graph engine
o kernel (Cypher execution)
o Neo4j Browser
o additional libraries and drivers for accessing the Neo4j database
o temporary, cloud-based instance of a Neo4j Server with its
associated graph that you can access from any Web
browser
o available for three days, but you can extend it for up to 10
days
o you can use Neo4j Browser Sync to save Cypher scripts
from your sandbox
Neo4j Browser
17/07/2019 Big Data class by Alexandre Bergere 76
17/07/2019 Big Data class by Alexandre Bergere 77
Introduction to Cypher
What’s Cypher?
17/07/2019 Big Data class by Alexandre Bergere 78
Cypher is a declarative query language that allows for expressive and efficient querying and updating of graph data.
Cypher is ASCII art
focuses on the clarity of expressing what to retrieve from a
graph
Cypher is inspired by
SPARK
QL
SQL
Python
Haskell
Node & Label
17/07/2019 Big Data class by Alexandre Bergere 79
() // anonymous node not be referenced later in the query
(p) // variable p, a reference to a node used later
(:Person) // anonymous node of type Person
(p:Person) // p, a reference to a node of type Person
(p:Actor:Director) // p, a reference to a node of types Actor and Director
Examining the data model
CALL db.schema
Using MATCH to retrieve nodes
17/07/2019 Big Data class by Alexandre Bergere 80
MATCH (n) // returns all nodes in the graph
RETURN n
MATCH (p:Person) // returns all Person nodes in the graph
RETURN p
When you specify a pattern for a MATCH clause, you should always specify a node label if possible. In doing so, the graph
engine uses an index to retrieve the nodes which will perform better than not using a label for the MATCH.
Properties
17/07/2019 Big Data class by Alexandre Bergere 81
A property is defined for a node and not for a type of node. All nodes of the same type need not have the same properties.
// Query the database for all property keys
CALL db.propertyKeys
MATCH (variable:Label {propertyKey: propertyValue, propertyKey2: propertyValue2})
RETURN variable
MATCH (m:Movie {released: 2003, tagline: 'Free your mind'})
RETURN m
 Filtering queries using property values
17/07/2019 Big Data class by Alexandre Bergere 82
// Retrieve all Movie nodes that have a released property value of 2003.
MATCH (m:Movie {released:2003}) RETURN m
// Retrieve all Movies released in 2006, returning their titles
MATCH (m:Movie {released: 2006}) RETURN m.title
// Display title, released, and tagline values for every Movie node in the graph
MATCH (m:Movie) RETURN m.title AS `movie title`, m.released AS released, m.tagline
AS tagLine
Relationships
17/07/2019 Big Data class by Alexandre Bergere 83
A relationship is a directed connection between two nodes that has a relationship type (name). In addition, a relationship
can have properties, just like nodes.
() // a node
()--() // 2 nodes have some type of relationship
()-->() // the first node has a relationship to the second node
()<--() // the second node has a relationship to the first node
Here is how Cypher uses ASCII art to specify path used for a query:
Querying using relationships:
MATCH (node1)-[:REL_TYPE]->(node2)
RETURN node1, node2
MATCH (node1)-[:REL_TYPEA | :REL_TYPEB]->(node2)
RETURN node1, node2
node1 is a specification of a node where you may include node labels and property values for filtering.
:REL_TYPE is the type (name) for the relationship. For this syntax the relationship is from node1 to node2.
:REL_TYPEA , :REL_TYPEB are the relationships from node1 to node2. The nodes are returned if at least one of the relationships exists.
node2 is a specification of a node where you may include node labels and property values for filtering.
Relationships
17/07/2019 Big Data class by Alexandre Bergere 84
Using patterns for queries:
MATCH (p:Person)-[:FOLLOWS]->(:Person {name:'Angela Scope'})
RETURN p
MATCH (p:Person)<-[:FOLLOWS]-(:Person {name:'Angela Scope'})
RETURN p
Relationships
17/07/2019 Big Data class by Alexandre Bergere 85
Using patterns for queries:
// Querying by any direction of the relationship
MATCH (p1:Person)-[:FOLLOWS]-(p2:Person {name:'Angela Scope'})
RETURN p1, p2
Relationships
17/07/2019 Big Data class by Alexandre Bergere 86
Using patterns for queries:
// Traversing relationships : query to return all followers of the followers
of Jessica Thompson.
MATCH (p:Person)-[:FOLLOWS]->(:Person)-[:FOLLOWS]->(:Person {name:'Jessica
Thompson'})
RETURN p
// Traversing relationships : return each person along the path by specifying
variables for the nodes and returning them
MATCH path = (:Person)-[:FOLLOWS]->(:Person)-[:FOLLOWS]->(:Person {name:'Jessica
Thompson'})
RETURN path
Relationships
17/07/2019 Big Data class by Alexandre Bergere 87
Using a relationship in a query:
MATCH (p:Person)-[rel:ACTED_IN]->(m:Movie {title: 'The Matrix'})
RETURN p, rel, m
Variables:
o p to represent the Person nodes during the query, the
variable
o m to represent the Movie node retrieved
o rel to represent the relationship for the relationship
type, ACTED_IN
Querying by multiple relationships:
MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN|:DIRECTED]->(m:Movie)
RETURN p.name, m.title
Relationships
17/07/2019 Big Data class by Alexandre Bergere 88
Using anonymous nodes in a query:
MATCH (p:Person)-[:ACTED_IN]->(:Movie {title: 'The Matrix'})
RETURN p.name
A best practice is to place named nodes (those with variables) before anonymous nodes in a MATCH clause.
Using an anonymous relationship for a query:
// find all people who are in any way connected to the movie
MATCH (p:Person)-->(m:Movie {title: 'The Matrix'})
RETURN p, m
MATCH (p:Person)--(m:Movie {title: 'The Matrix'})
RETURN p, m
Relationships
17/07/2019 Big Data class by Alexandre Bergere 89
Retrieving the relationship types:
MATCH (p:Person)-[rel]->(:Movie {title:'The Matrix'})
RETURN p.name, type(rel)
Retrieving properties for relationships:
MATCH (p:Person)-[:REVIEWED {rating: 65}]->(:Movie {title: 'The Da Vinci Code'})
RETURN p.name
Filtering queries using relationships
17/07/2019 Big Data class by Alexandre Bergere 90
// Retrieve all people who wrote the movie Speed Racer
MATCH (p:Person)-[:WROTE]->(:Movie {title: 'Speed Racer'}) RETURN p.name
// Retrieve all movies that are connected to the person, Tom Hanks
MATCH (m:Movie)<--(:Person {name: 'Tom Hanks'}) RETURN m.title
or
MATCH(:Person {name: 'Tom Hanks'})-->(m:Movie) RETURN m.title
// Retrieve information about the relationships Tom Hanks has with the set of
movies retrieved earlier
MATCH (m:Movie)-[rel]-(:Person {name: 'Tom Hanks'}) RETURN m.title, type(rel)
// Retrieve information about the roles that Tom Hanks acted in
MATCH (m:Movie)-[rel:ACTED_IN]-(:Person {name: 'Tom Hanks'}) RETURN m.title,
rel.roles
Cypher style recommendations
17/07/2019 Big Data class by Alexandre Bergere 91
Here are the Neo4j-recommended Cypher coding standards:
o Node labels are CamelCase and begin with an upper-case letter (examples: Person, NetworkAddress). Note that node
labels are case-sensitive.
o Property keys, variables, parameters, aliases, and functions are camelCase and begin with a lower-case letter
(examples: businessAddress, title). Note that these elements are case-sensitive.
o Relationship types are in upper-case and can use the underscore. (examples: ACTED_IN, FOLLOWS). Note that
relationship types are case-sensitive and that you cannot use the “-” character in a relationship type.
o Cypher keywords are upper-case (examples: MATCH, RETURN). Note that Cypher keywords are case-insensitive, but a
best practice is to use upper-case.
o String constants are in single quotes, unless the string contains a quote or apostrophe (examples: ‘The Matrix’,
“Something’s Gotta Give”). Note that you can also escape single or double quotes within strings that are quoted with
the same using a backslash character.
o Specify variables only when needed for use later in the Cypher statement.
o Place named nodes and relationships (that use variables) before anonymous nodes and relationships in your MATCH
clauses when possible.
o Specify anonymous relationships with -->, --, or <--
MATCH (:Person {name: 'Diane Keaton'})-[movRel:ACTED_IN]->
(:Movie {title:"Something's Gotta Give"})
RETURN movRel.roles
Follow the Cypher Style Guide when writing your Cypher statements.
17/07/2019 Big Data class by Alexandre Bergere 92
Getting More Out of Queries
Filtering queries using WHERE
17/07/2019 Big Data class by Alexandre Bergere 93
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.released = 2008
RETURN p, m
// complex conditions
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.released = 2008 AND m.released = 2009
RETURN p, m
// same as previous
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE 2003 <= m.released <= 2004
RETURN p.name, m.title, m.released
MATCH (p:Person)-[:ACTED_IN]->(m:Movie {released: 2008})
RETURN p, m

Filtering queries using WHERE
17/07/2019 Big Data class by Alexandre Bergere 94
// Testing labels
MATCH (p:Person)
RETURN p.name
MATCH (p:Person)-[:ACTED_IN]->(:Movie {title: 'The Matrix'})
RETURN p.name

MATCH (p)
WHERE p:Person
RETURN p.name
MATCH (p)-[:ACTED_IN]->(m)
WHERE p:Person AND m:Movie AND m.title='The Matrix'
RETURN p.name
Filtering queries using WHERE
17/07/2019 Big Data class by Alexandre Bergere 95
// Testing the existence of a property
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name='Jack Nicholson' AND exists(m.tagline)
RETURN m.title, m.tagline
// Testing strings : You can specify STARTS WITH, ENDS WITH, and CONTAINS
MATCH (p:Person)-[:ACTED_IN]->()
WHERE toLower(p.name) STARTS WITH 'michael'
RETURN p.name
// Testing with regular expressions; You use the syntax =~
MATCH (p:Person)
WHERE p.name =~'Tom.*'
RETURN p.name
Filtering queries using WHERE
17/07/2019 Big Data class by Alexandre Bergere 96
// Testing with patterns
// exclude people who directed that movie
MATCH (p:Person)-[:WROTE]->(m:Movie)
WHERE NOT exists( (p)-[:DIRECTED]->() )
RETURN p.name, m.title
// find Gene Hackman and the movies that he acted in with another person who also
directed the movie
MATCH (gene:Person)-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(other:Person)
WHERE gene.name= 'Gene Hackman'
AND exists( (other)-[:DIRECTED]->() )
RETURN gene, other, m
Filtering queries using WHERE
17/07/2019 Big Data class by Alexandre Bergere 97
// Testing with list values : elements of the list have to be the same type of data
MATCH (p:Person)
WHERE p.born IN [1965, 1970]
RETURN p.name as name, p.born as yearBorn
// You can also compare a value to an existing list in the graph.
MATCH (p:Person)-[r:ACTED_IN]->(m:Movie)
WHERE 'Neo' IN r.roles AND m.title='The Matrix'
RETURN p.name
There are a number of syntax elements of Cypher that we have not covered in this training. For example, you can specify
CASE logic in your conditional testing for your WHERE clauses. You can learn more about these syntax elements in the
Neo4j Cypher Manual and the Cypher Refcard.
Filtering queries using WHERE
17/07/2019 Big Data class by Alexandre Bergere 98
// Retrieve all actors that were born in the 70’s
MATCH (a:Person)
WHERE a.born >= 1970 AND a.born < 1980
RETURN a.name as Name, a.born as `Year Born`
// Retrieve all movies released in 2000 by testing the node label and the released
property, returning the movie titles
MATCH (m)
WHERE m:Movie AND m.released = 2000 and exists(m.released)
RETURN m.title
// Retrieve all people that wrote movies by testing the relationship between two
nodes
MATCH (a)-[rel]->(m)
WHERE a:Person AND type(rel) = 'WROTE' AND m:Movie
RETURN a.name as Name, m.title as Movie
// Retrieve all people in the graph that do not have the property ‘born’
MATCH (a:Person)
WHERE NOT exists(a.born)
RETURN a.name as Name
Filtering queries using WHERE
17/07/2019 Big Data class by Alexandre Bergere 99
// Retrieve all people related to movies where the relationship has the rating
property, then return their name, movie title, and the rating.
MATCH (a:Person)-[rel]->(m:Movie)
WHERE exists(rel.rating)
RETURN a.name as Name, m.title as Movie, rel.rating as Rating
// Retrieve all REVIEW relationships from the graph where the summary of the review
contains the string fun, returning the movie title reviewed and the rating and
summary of the relationship.
MATCH (:Person)-[r:REVIEWED]->(m:Movie)
WHERE toLower(r.summary) CONTAINS 'fun'
RETURN m.title as Movie, r.summary as Review, r.rating as Rating
// Retrieve all people who have produced a movie, but have not directed a movie
MATCH (a:Person)-[:PRODUCED]->(m:Movie)
WHERE NOT ((a)-[:DIRECTED]->(:Movie))
RETURN a.name, m.title
// Retrieve the movies and their actors where one of the actors also directed the
movie
MATCH (a1:Person)-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(a2:Person)
WHERE exists( (a2)-[:DIRECTED]->(m) )
RETURN a1.name as Actor, a2.name as `Actor/Director`, m.title as Movie
Filtering queries using WHERE
17/07/2019 Big Data class by Alexandre Bergere 100
// Retrieve the movies that have an actor’s role that is the name of the movie
MATCH (a:Person)-[r:ACTED_IN]->(m:Movie)
WHERE m.title in r.roles
RETURN m.title as Movie, a.name as Actor
Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 101
MATCH (a:Person)-[:ACTED_IN]->(m:Movie),
(m:Movie)<-[:DIRECTED]-(d:Person)
WHERE m.released = 2000
RETURN a.name, m.title, d.name
Specifying multiple MATCH patterns
This MATCH clause includes a pattern specified by two paths separated by a comma:
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person)
WHERE m.released = 2000
RETURN a.name, m.title, d.name
If possible, you should write the same query as follows:
Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 102
// retrieve the actors who acted in the same movies as Keanu Reeves, but not when
Hugo Weaving acted in the same movie
MATCH (keanu:Person)-[:ACTED_IN]->(movie:Movie)<-[:ACTED_IN]-(n:Person),
(hugo:Person)
WHERE keanu.name='Keanu Reeves' AND
hugo.name='Hugo Weaving'
AND NOT (hugo)-[:ACTED_IN]->(movie)
RETURN n.name
Specifying multiple MATCH patterns
// Suppose we want to retrieve the movies that Meg Ryan acted in and their
respective directors, as well as the other actors that acted in these movies.
MATCH (meg:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person),
(other:Person)-[:ACTED_IN]->(m)
WHERE meg.name = 'Meg Ryan'
RETURN m.title as movie, d.name AS director , other.name AS `co-actors`
Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 103
MATCH megPath = (meg:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person),
(other:Person)-[:ACTED_IN]->(m)
WHERE meg.name = 'Meg Ryan'
RETURN megPath
Setting path variables
Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 104
Specifying varying length paths
// all of the followers of the followers of a Person
MATCH (follower:Person)-[:FOLLOWS*2]->(p:Person)
WHERE follower.name = 'Paul Blythe'
RETURN p
// Retrieve all paths of any length with the relationship, :RELTYPE from nodeA to
nodeB and beyond:
(nodeA)-[:RELTYPE*]->(nodeB)
// Retrieve all paths of any length with the relationship, :RELTYPE from nodeA to
nodeB or from nodeB to nodeA and beyond:
(nodeA)-[:RELTYPE*]-(nodeB)
// Retrieve the paths of length 3 with the relationship,
(node1)-[:RELTYPE*3]->(node2)
// Retrieve the paths of lengths 1, 2, or 3 with the relationship
(node1)-[:RELTYPE*1..3]->(node2)
Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 105
Finding the shortest path
MATCH p = shortestPath((m1:Movie)-[*]-(m2:Movie))
WHERE m1.title = 'A Few Good Men' AND
m2.title = 'The Matrix'
RETURN p
A built-in function that you may find useful in a graph that has many ways of traversing the graph to get to the same node
is the shortestPath() function. Using the shortest path between two nodes improves the performance of the query.
When you use the shortestPath() function, the query editor will show a warning that this type of query could potentially
run for a long time. You should heed the warning, especially for large graphs. Read the Graph Algorithms documentation
about the shortest path algorithm.
When you use shortestPath(), you can specify a upper limits for the shortest path. In addition, you should aim to provide
the patterns for the from an to nodes that execute efficiently. For example, use labels and indexes.
Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 106
Specifying optional pattern matching
MATCH (p:Person)
WHERE p.name STARTS WITH 'James'
OPTIONAL MATCH (p)-[r:REVIEWED]->(m:Movie)
RETURN p.name, type(r), m.title
OPTIONAL MATCH matches patterns with your graph, just like MATCH does. The difference is that if no matches are found,
OPTIONAL MATCH will use NULLs for missing parts of the pattern. OPTIONAL MATCH could be considered the Cypher
equivalent of the outer join in SQL.
Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 107
Collecting results
// the list of movies that Tom Cruise acted in
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WHERE p.name ='Tom Cruise'
RETURN collect(m.title) AS `movies for Tom Cruise`
Cypher has a built-in function, collect() that enables you to aggregate a value into a list.
Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 108
Aggregation in Cypher
// implicitly groups by a.name and d.name
MATCH (a)-[:ACTED_IN]->(m)<-[:DIRECTED]-(d)
RETURN a.name, d.name, count(*)
// count the paths retrieved where an actor and director collaborated in a movie
MATCH (actor:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(director:Person)
RETURN actor.name, director.name, count(m) AS collaborations, collect(m.title) AS
movies
Aggregation in Cypher is different from aggregation in SQL. In Cypher, you need not specify a grouping key. As soon as an
aggregation function is used, all non-aggregated result columns become grouping keys. The grouping is implicitly done,
based upon the fields in the RETURN clause.
There are more aggregating functions such as min()
or max() that you can also use in your queries.
These are described in the Aggregating Functions
section of the Neo4j Cypher Manual.
Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 109
Additional processing using WITH
// only return actors that have 2 or 3 movies
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WITH a, count(a) AS numMovies, collect(m.title) as movies
WHERE numMovies > 1 AND numMovies < 4
RETURN a.name, numMovies, movies
During the execution of a MATCH clause, you can specify that you want some intermediate calculations or values that will
be used for further processing of the query, or for limiting the number of results before further processing is done. You use
the WITH clause to perform intermediate processing or data flow operations.
You have to name all expressions with an alias in a WITH that are not simple variables.
// find all actors who have acted in at least five movies, and find (optionally)
the movies they directed and return the person and those movies
MATCH (p:Person)
WITH p, size((p)-[:ACTED_IN]->(:Movie)) AS movies
WHERE movies >= 5
OPTIONAL MATCH (p)-[:DIRECTED]->(m:Movie)
RETURN p.name, m.title
Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 110
Additional processing using WITH
// retrieves all actors that acted in movies, and collects the list of movies for
any actor that acted in more than five movies.
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WITH p, collect(m) AS movies
WHERE size(movies) > 5
RETURN p.name, movies
Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 111
// Write a Cypher query that retrieves all movies that Gene Hackman has acted it,
along with the directors of the movies. In addition, retrieve the actors that acted
in the same movies as Gene Hackman. Return the name of the movie, the name of the
director, and the names of actors that worked with Gene Hackman.
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person),
(a2:Person)-[:ACTED_IN]->(m)
WHERE a.name = 'Gene Hackman'
RETURN m.title as movie, d.name AS director , a2.name AS `co-actors`
// Retrieve particular nodes that have a relationship and when James Thompson is
acting on it
MATCH (p1:Person)-[:FOLLOWS]-(p2:Person)
WHERE p1.name = 'James Thompson'
RETURN p1, p2
// Modify the query to retrieve nodes that are one and two hops away
MATCH (p1:Person)-[:FOLLOWS*1..2]-(p2:Person)
WHERE p1.name = 'James Thompson'
RETURN p1, p2
// Modify the query to retrieve particular nodes that are connected no matter how
many hops are required
MATCH (p1:Person)-[:FOLLOWS*]-(p2:Person)
WHERE p1.name = 'James Thompson'
RETURN p1, p2
Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 112
// Retrieve all movie by collecting a list of all people who acted in it
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
RETURN p.name as actor, collect(m.title) AS `movie list`
// Retrieve all movies that Tom Cruise has acted in and the co-actors that acted in
the same movie by collecting a list
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(p2:Person)
WHERE p.name ='Tom Cruise'
RETURN m.title as movie, collect(p2.name) AS `co-actors`
// Retrieve all people who reviewed a movie, returning the list of reviewers and
how many reviewers reviewed the movie
MATCH (p:Person)-[:REVIEWED]->(m:Movie)
RETURN m.title as movie, count(p) as numReviews, collect(p.name) as reviewers
// Retrieve all directors, their movies, and people who acted in the movies,
returning the name of the director, the number of actors the director has worked
with, and the list of actors.
MATCH (d:Person)-[:DIRECTED]->(m:Movie)<-[:ACTED_IN]-(a:Person)
RETURN d.name AS director, count(a) AS `number actors` , collect(a.name) AS `actors
worked with`
Controlling query processing
17/07/2019 Big Data class by Alexandre Bergere 113
// Retrieve the movies that have at least 2 directors, and optionally the names of
people who reviewed the movies.
MATCH (m:Movie)
WITH m, size((:Person)-[:DIRECTED]->(m)) AS directors
WHERE directors >= 2
OPTIONAL MATCH (p:Person)-[:REVIEWED]->(m)
RETURN m.title, p.name
Controlling how results are returned
17/07/2019 Big Data class by Alexandre Bergere 114
Eliminating duplication
MATCH (p:Person)-[:DIRECTED | :ACTED_IN]->(m:Movie)
WHERE p.name = 'Tom Hanks'
RETURN m.released, collect(DISTINCT m.title) AS movies
You have seen a number of query results where there is duplication in the results returned. In most cases, you want to
eliminate duplicated results. You do so by using the DISTINCT keyword.
Using WITH and DISTINCT to eliminate duplication
MATCH (p:Person)-[:DIRECTED | :ACTED_IN]->(m:Movie)
WHERE p.name = 'Tom Hanks'
WITH DISTINCT m
RETURN m.released, m.title
Another way that you can avoid duplication is to with WITH and DISTINCT together as follows:
Controlling how results are returned
17/07/2019 Big Data class by Alexandre Bergere 115
Ordering results
MATCH (p:Person)-[:DIRECTED | :ACTED_IN]->(m:Movie)
WHERE p.name = 'Tom Hanks'
RETURN m.released, collect(DISTINCT m.title) AS movies ORDER BY m.released DESC
If you want the results to be sorted, you specify the expression to use for the sort using the ORDER BY keyword and
whether you want the order to be descending using the DESC keyword. Ascending order is the default.
Controlling how results are returned
17/07/2019 Big Data class by Alexandre Bergere 116
Limiting the number of results
MATCH (m:Movie)
RETURN m.title as title, m.released as year ORDER BY m.released DESC LIMIT 10
Although you can filter queries to reduce the number of results returned, you may also want to limit the number of results.
Controlling results returned
17/07/2019 Big Data class by Alexandre Bergere 117
// write a query to retrieve all actors that acted in movies during the 1990s,
where you return the released date, the movie title, and the collected actor names
for the movie. For now do not worry about duplication.
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.released >= 1990 AND m.released < 2000
RETURN DISTINCT m.released, m.title, collect(a.name)
// modify the query so that the released date records returned are not duplicated.
To implement this, you must add the collection of the movie titles to the results
returned.
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.released >= 1990 AND m.released < 2000
RETURN m.released, collect(m.title), collect(a.name)
// The results returned from the previous query returns the collection of movie
titles with duplicates. That is because there are multiple actors per released
year. Next, modify the query so that there is no duplication of the movies listed
for a year.
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE m.released >= 1990 AND m.released < 2000
RETURN m.released, collect(DISTINCT m.title), collect(a.name)
Controlling results returned
17/07/2019 Big Data class by Alexandre Bergere 118
// Retrieve the top 5 ratings and their associated movies, returning the movie
title and the rating.
MATCH (:Person)-[r:REVIEWED]->(m:Movie)
RETURN m.title AS movie, r.rating AS rating
ORDER BY r.rating DESC LIMIT 5
Working with Cypher data
17/07/2019 Big Data class by Alexandre Bergere 119
Unwinding lists
// create a list with three elements, unwind the list and then return the values
WITH [1, 2, 3] AS list
UNWIND list AS row
RETURN list, row
There may be some situations where you want to perform the opposite of collecting results, but rather separate the lists
into separate rows. This functionality is done using the UNWIND clause.
The UNWIND clause is frequently used when importing data into a graph.
Working with Cypher data
17/07/2019 Big Data class by Alexandre Bergere 120
Dates
MATCH (actor:Person)-[:ACTED_IN]->(:Movie)
WHERE exists(actor.born)
// calculate the age
with DISTINCT actor, date().year - actor.born as age
RETURN actor.name, age as `age today`
ORDER BY actor.born DESC
Cypher has a built-in date() function, as well as other temporal values and functions that you can use to calculate temporal
values.
You use a combination of numeric, temporal, spatial, list and string functions to calculate values that are useful to your
application. For example, suppose you wanted to calculate the age of a Person node, given a year they were born (the born
property must exist and have a value).
Working with Cypher data
17/07/2019 Big Data class by Alexandre Bergere 121
// Modify the query you just wrote so that before the query processing ends, you
unwind the list of movies and then return the name of the actor and the title of
the associated movie
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)
WITH p, collect(m) AS movies
WHERE size(movies) > 5
WITH p, movies UNWIND movies AS movie
RETURN p.name, movie.title
// retrieves all movies that Tom Hanks acted in, returning the title of the movie,
the year the movie was released, the number of years ago that the movie was
released, and the age of Tom when the movie was released
MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE a.name = 'Tom Hanks'
RETURN m.title, m.released, date().year - m.released as yearsAgoReleased,
m.released - a.born AS `age of Tom`
ORDER BY yearsAgoReleased
17/07/2019 Big Data class by Alexandre Bergere 122
Go further
Neo4j Bookshelf
17/07/2019 Big Data class by Alexandre Bergere 123
Ressources
17/07/2019 Big Data class by Alexandre Bergere 124
ressources:
blog:
Training & Certification
17/07/2019 Big Data class by Alexandre Bergere 125
Labs
17/07/2019 Big Data class by Alexandre Bergere 126
GraphGists
17/07/2019 Big Data class by Alexandre Bergere 127
Azure Cosmos DB
Azure Cosmos DB
17/07/2019 Big Data class by Alexandre Bergere 129
A globally distributed, massively scalable, multi-model database service
Azure Cosmos DB
Global Distribution
17/07/2019 Big Data class by Alexandre Bergere 130
Policy-based geo-fencing Dynamically add and remove regions
Failover prioritiesDynamically configurable read and write regions
Geo-local reads and writes 99.99% SLA for read availability
Database designed for modern web and mobile applications, which are (typically) global applications in nature.
Multi-Master
17/07/2019 Big Data class by Alexandre Bergere 131
Improved write latency for end users
Improved write scalability and write throughput
Better support for disconnected environments (for example, edge devices)
Load balancing
Consistency
17/07/2019 Big Data class by Alexandre Bergere 133
Consistency
Level Guarantees
Strong Linearizability (once operation is complete, it will be visible to all)
Bounded
Staleness
Consistent Prefix.
Reads lag behind writes by at most k prefixes or t interval
Similar properties to strong consistency (except within staleness window), while preserving 99.99%
availability and low latency.
Session Consistent Prefix.
Within a session: monotonic reads, monotonic writes, read-your-writes, write-follows-reads
Predictable consistency for a session, high read throughput + low latency
Consistent
Prefix
Reads will never see out of order writes (no gaps).
Eventual Potential for out of order reads. Lowest cost for reads of all consistency levels.
COMPREHENSIVE SLAs
17/07/2019 Big Data class by Alexandre Bergere 134
RUN YOUR APP ON WORLD-CLASS INFRASTRUCTURE
Azure Cosmos DB is the only service with financially-backed SLAs for
millisecond latency at the 99th percentile, 99.999% HA and guaranteed
throughput and consistency
HALatency
<10 ms
99th percentile
99.999%
Throughput Consistency
Guaranteed Guaranteed
Trust your data to industry-leading Security & Compliance
17/07/2019 Big Data class by Alexandre Bergere 135
Azure is the world’s most trusted cloud, with more certifications
than any other cloud provider.
• Enterprise grade security
• Encryption at Rest
• Encryption is enabled automatically by default
• Comprehensive Azure compliance certification
Throughput
17/07/2019 Big Data class by Alexandre Bergere 136
Request unit calculator
Request unit considerations
Item size
Item property count
Data consistency
Indexex properties
Document indexing
Script usage
The currency of Azure Cosmos DB is the request unit (RU). With request units, you don't need to reserve read/write capacities or provision CPU, memory, and IOPS.
Serverless database
17/07/2019 Big Data class by Alexandre Bergere 137
Serverless computing is all about the ability to focus on individual pieces of logic that are repeatable and stateless.
o no infrastructure management.
o consume resources only for the seconds, or milliseconds, they run for.
Azure Cosmos DB trigger to invoke an Azure Function
Use an input binding to get data from Azure
Cosmos DB
Use an ouput binding to write data to Azure Cosmos DB
Serverless database
17/07/2019 Big Data class by Alexandre Bergere 139
Serverless computing is all about the ability to focus on individual pieces of logic that are repeatable and stateless.
o no infrastructure management.
o consume resources only for the seconds, or milliseconds, they run for.
Cosmos DB Change Feed
17/07/2019 Big Data class by Alexandre Bergere 140
17/07/2019 Big Data class by Alexandre Bergere 141
Uses cases
Top 10 reasons why customers use
Azure Cosmos DB
17/07/2019 Big Data class by Alexandre Bergere 142
different types of data
multi-tenancy
and enterprise-grade
security
global
distribution turnkey
capability
mission
critical
massive
storage/throughput
scalability
to
optimize for speed and
cost
5 well-defined
consistency models
analytics-
ready
event-driven
architectures
single digit
millisecond latency at
99th percentile
worldwide
big data
high
availability and
reliability
Powering global solutions
17/07/2019 Big Data class by Alexandre Bergere 143
Azure Cosmos DB was built to support modern app patterns and use cases.
It enables industry-leading organizations to unlock the value of data, and respond to
global customers and changing business dynamics in real-time.
Data distributed and
available globally
Puts data where your
users are
Build real-time
customer experiences
Enable latency-sensitive
personalization, bidding,
and fraud detection.
Ideal for gaming,
IoT & eCommerce
Predictable and fast
service, even during
traffic spikes
Simplified
development with
serverless architecture
Fully-managed event-
driven micro-services
with elastic computing
power
Run Spark analytics
over operational data
Accelerate insights from
fast, global data
Lift and shift
NoSQL data
Lift and shift MongoDB
and Cassandra
workloads
Data distributed and available globally
17/07/2019 Big Data class by Alexandre Bergere 144
Put your data where your users are to give real-time access and
uninterrupted service to customers anywhere in the world.
o Turnkey global data replication across all Azure regions
o Guaranteed low-latency experience for global users
o Resiliency for high availability and disaster recovery
Build Real-Time Customer experiences
17/07/2019 Big Data class by Alexandre Bergere 145
Offer latency-sensitive applications with personalization, bidding, and
fraud-detection.
o Machine learning models generate real-time
recommendations across product catalogues
o Product analysis in milliseconds
o Low-latency ensures high app performance worldwide
o Tunable consistency models for rapid insight
Online Recommendations Service
HOT path
Offline Recommendations Engine
COLD path
Ideal for gaming, IoT and ecommerce
17/07/2019 Big Data class by Alexandre Bergere 146
Maintain service quality during high-traffic periods requiring
massive scale and performance.
o Instant, elastic scaling handles traffic bursts
o Uninterrupted global user experience
o Low-latency data access and processing for large and
changing user bases
o High availability across multiple data centers
Massive Scale Telemetry Stores for IOT
17/07/2019 Big Data class by Alexandre Bergere 147
Diverse and unpredictable IoT sensor workloads require a
responsive data platform
o Seamless handling of any data output or volume
o Data made available immediately, and indexed
automatically
o High writes per second, with stable ingestion and
query performance
simplified development with serverless architecture
17/07/2019 Big Data class by Alexandre Bergere 148
Experience decreased time-to-market, enhanced scalability, and
freedom from framework management with event-driven
micro-services.
o Seamless handling of any data output or volume
o Data made available immediately, and indexed
automatically
o High writes per second, with stable ingestion and
query performance
o Real-time, resilient change feeds logged forever and
always accessible
o Native integration with Azure Functions
Run spark over operational data
17/07/2019 Big Data class by Alexandre Bergere 149
Accelerate analysis of fast-changing, high-volume, global data.
o Real-time big data processing across any data model
o Machine learning at scale over globally-distributed data
o Speeds analytical queries with automatic indexing and
push-down predicate filtering
o Native integration with Spark Connector
Lift and shift nosql apps
17/07/2019 Big Data class by Alexandre Bergere 150
Make data modernization easy with seamless lift and shift
migration of NoSQL workloads to the cloud.
o Azure Cosmos DB APIs for MongoDB and Cassandra
bring app data from anywhere to Azure Cosmos DB
o Leverage existing tools, drivers, and libraries, and
continue using existing apps’ current SDKs
o Turnkey geo-replication
o No infrastructure or VM management required
.NET
Retail and marketing
17/07/2019 Big Data class by Alexandre Bergere 151
17/07/2019 Big Data class by Alexandre Bergere 152
Model
Document Data Model
17/07/2019 Big Data class by Alexandre Bergere 153
“Because at the end of the day, it’s all just keys and values – not just the key-value data model, but all these data models.”
“When it comes to actually building applications – well, that’s the developer’s job, and this is where the decision of which data model to
choose comes into play.”
Document
SQL API (JSON)
MongoDB API
Graph
Gremlin API
(graph transversal language)
Key-Value
Table API
(replaces Azure Table Storage)
Columnar
Cassandra API
Atom Record Sequence (ARS)
17/07/2019 Big Data class by Alexandre Bergere 154
Your data is always stored as ARS – or Atom Record Sequence – a Microsoft creation that defines the
persistence layer for key-value pairs.
Switching Between Data Models
choosing an API = choosing a data model
Switching Between Data Models
17/07/2019 Big Data class by Alexandre Bergere 155
Each data model is merely a projection of the same underlying ARS format, and so eventually you will be
able to create a single account, and then switch freely between different APIs within the account. So that
then, you’ll be able to access one database as graph, key-value, document, or columnar, all at once.
Future release ?
Resource Model
17/07/2019 Big Data class by Alexandre Bergere 156
Resource Model
17/07/2019 Big Data class by Alexandre Bergere 157
Account
Database
Container
Item
User
Permission
Resource Model
17/07/2019 Big Data class by Alexandre Bergere 158
Account
Database
Container
Item
User
Permission
Resource Model
17/07/2019 Big Data class by Alexandre Bergere 159
Account
Database
Container
Item
= Collection Graph Table
Handle any data with no schema or indexing required
17/07/2019 Big Data class by Alexandre Bergere 160
Azure Cosmos DB’s schema-less service automatically indexes all your data,
regardless of the data model, to delivery blazing fast queries.
Item Color
Microwave
safe
Liquid
capacity
CPU Memory Storage
Geek mug Graphite Yes 16ox ??? ??? ???
Coffee
Bean mug
Tan No 12oz ??? ??? ???
Surface
book
Gray ??? ??? 3.4 GHz
Intel
Skylake
Core i7-
6600U
16GB 1 TB SSD
o Automatic index management
o Synchronous auto-indexing
o No schemas or secondary indices needed
o Works across every data model
GEEK
Index
17/07/2019 Big Data class by Alexandre Bergere 161
Schema-agnostic, automatic indexing
o Automatically index every property of every record without
having to define schemas and indices upfront.
o No need for schema and index management
o Works across every data model
o Latch free data structure for highly write-optimized database
engine
o Multiple index types: Hash, range, and geospatial
Index POLICIES
17/07/2019 Big Data class by Alexandre Bergere 162
CUSTOM INDEXING POLICIES
Though all Azure Cosmos DB data is indexed by default, you
can specify a custom indexing policy for your collections. Custom
indexing policies allow you to design and customize the shape of
your index while maintaining schema flexibility.
o Define trade-offs between storage, write and query
performance, and query consistency
o Include or exclude documents and paths to and from the
index
o Configure various index types
{
"automatic": true,
"indexingMode": "Consistent",
"includedPaths": [{
"path": "/*",
"indexes": [{
"kind": "Hash",
"dataType": "String",
"precision": -1
}, {
"kind": "Range",
"dataType": "Number",
"precision": -1
}, {
"kind": "Spatial",
"dataType": "Point"
}]
}],
"excludedPaths": [{
"path": "/nonIndexedContent/*"
}]
}
Ressource Model in Cosmos DB
17/07/2019 Big Data class by Alexandre Bergere 163
17/07/2019 Big Data class by Alexandre Bergere 164
SQL QUERY SYNTAX
SQL SYNTAX
17/07/2019 Big Data class by Alexandre Bergere 165
Using the popular query language, SQL, to access semi-
structured JSON data.
This module will reference querying in the context of the SQL
API for Azure Cosmos DB.
SQL QUERY SYNTAX
17/07/2019 Big Data class by Alexandre Bergere 166
BASIC QUERY SYNTAX
The SELECT & FROM keywords are the basic components of
every query.
> SELECT
tickets.id,
tickets.pricePaid
FROM tickets
> SELECT
t.id,
t.pricePaid
FROM tickets t
SQL QUERY SYNTAX - WHERE
17/07/2019 Big Data class by Alexandre Bergere 167
FILTERING
WHERE supports complex scalar expressions including
arithmetic, comparison and logical operators
> SELECT
tickets.id,
tickets.pricePaid
FROM tickets
WHERE
tickets.pricePaid > 500.00 AND
tickets.pricePaid <= 1000.00
SQL QUERY SYNTAX - PROJECTION
17/07/2019 Big Data class by Alexandre Bergere 168
FILTERING
If your workloads require a specific JSON schema, Azure
Cosmos DB supports JSON projection within its queries
> SELECT {
"id": tickets.id,
"flightNumber": tickets.assignedFlight.flightNumber,
"purchase": {
"cost": tickets.pricePaid
},
"stops": [
tickets.assignedFlight.origin,
tickets.assignedFlight.destination
]
} AS ticket
FROM tickets
SQL QUERY SYNTAX - PROJECTION
17/07/2019 Big Data class by Alexandre Bergere 169
FILTERING
If your workloads require a specific JSON schema, Azure
Cosmos DB supports JSON projection within its queries
> SELECT VALUE {
"id": tickets.id,
"flightNumber": tickets.assignedFlight.flightNumber,
"purchase": {
"cost": tickets.pricePaid
},
"stops": [
tickets.assignedFlight.origin,
tickets.assignedFlight.destination
]
}
FROM tickets
INTRA-DOCUMENT JOIN
17/07/2019 Big Data class by Alexandre Bergere 170
Azure Cosmos DB supports intra-document JOIN’s for de-normalized arrays
Let’s assume that we have two JSON documents in a collection:
{
"pricePaid": 575.5,
"assignedFlight": {
"number": "F125",
"origin": "SEA",
"destination": "JFK"
},
"seat": “12A",
"requests": [
"kosher_meal",
"aisle_seat"
],
"id": "6ebe1165836a"
}
{
"pricePaid": 234.75,
"assignedFlight": {
"number": "F752",
"origin": "SEA",
"destination": "LGA"
},
"seat": "14C",
"requests": [
"early_boarding",
"window_seat"
],
"id": "c4991b4d2efc"
}
INTRA-DOCUMENT JOIN
17/07/2019 Big Data class by Alexandre Bergere 171
We can filter on a particular array index position without JOIN:
> SELECT
tickets.assignedFlight.number,
tickets.seat,
ticket.requests
FROM
tickets
WHERE
ticket.requests[1] == "aisle_seat"
[
{
"number":"F125","seat":"12A",
"requests": [
"kosher_meal",
"aisle_seat"
]
}
]
INTRA-DOCUMENT JOIN
17/07/2019 Big Data class by Alexandre Bergere 172
JOIN allows us to merge embedded documents or arrays across multiple documents and
returned a flattened result set:
> SELECT
tickets.assignedFlight.number,
tickets.seat,
requests
FROM
tickets
JOIN
requests IN tickets.requests
[
{
"number":"F125","seat":"12A",
"requests":"kosher_meal"
},
{
"number":"F125","seat":"12A",
"requests":"aisle_seat"
},
{
"number":"F752","seat":"14C",
"requests":"early_boarding"
},
{
"number":"F752","seat":"14C",
"requests":"window_seat"
}
]
INTRA-DOCUMENT JOIN
17/07/2019 Big Data class by Alexandre Bergere 173
Along with JOIN, we can also filter the cross products without knowing the array index
position:
> SELECT
tickets.id, requests
FROM
tickets
JOIN
requests IN tickets.requests
WHERE
requests
IN ("aisle_seat", "window_seat")
[
{
"number":"F125","seat":"12A“,
"requests": "aisle_seat"
},
{
"number":"F752","seat":"14C",
"requests": "window_seat"
}
]
17/07/2019 Big Data class by Alexandre Bergere 174
Tools
Cosmos DB Emulator
17/07/2019 Big Data class by Alexandre Bergere 175
The Azure Cosmos DB Emulator provides a local environment that emulates the Azure Cosmos DB service for development
purposes. Using the Azure Cosmos DB Emulator, you can develop and test your application locally, without creating an Azure
subscription or incurring any costs. When you're satisfied with how your application is working in the Azure Cosmos DB
Emulator, you can switch to using an Azure Cosmos DB account in the cloud.
At this time the Data Explorer in the emulator only fully supports SQL API collections and MongoDB collections. Table, Graph, and Cassandra containers are not
fully supported.
The Azure Cosmos DB Emulator provides a high-fidelity emulation of the Azure Cosmos DB service. It supports identical
functionality as Azure Cosmos DB, including support for creating and querying JSON documents, provisioning and scaling
collections, and executing stored procedures and triggers. You can develop and test applications using the Azure Cosmos DB
Emulator, and deploy them to Azure at global scale by just making a single configuration change to the connection endpoint for
Azure Cosmos DB.
The Azure Cosmos DB Emulator by default runs on the local machine ("localhost") listening on port 8081.
Azure Cosmos DB : Data migration tools
17/07/2019 Big Data class by Alexandre Bergere 176
Data Migration Tools
SQL API Mongo DB API Graph APITable API
Cassandra API
Cosmos DB Explorer
17/07/2019 Big Data class by Alexandre Bergere 177
With Cosmos DB Explorer you can:
o Take advantage of the full screen real estate for your queries and
results.
o Access your database account and collections with a connection string,
without needing access to the Azure subscription or portal.
o Share query results with authorized peers who do not have Azure
portal access.
o Work with Cosmos DB data without having to download any desktop
tools locally.
https://cosmos.azure.com/
Azure Cosmos DB – Interface demo
17/07/2019 Big Data class by Alexandre Bergere 178
Azure Cosmos DB – SQL Query Exercice
17/07/2019 Big Data class by Alexandre Bergere 179
Add data using Data Explorer
https://docs.microsoft.com/en-ie/learn/modules/access-data-with-cosmos-db-and-
sql-api/3-add-data
Explore SQL query types
https://docs.microsoft.com/en-ie/learn/modules/access-data-
with-cosmos-db-and-sql-api/4-query-types
17/07/2019 Big Data class by Alexandre Bergere 180
Add cosmos DB to you architecture
Partitioning
17/07/2019 Big Data class by Alexandre Bergere 181
17/07/2019 Big Data class by Alexandre Bergere 182
Stored procedure & UDFs
Stored Procedures
17/07/2019 Big Data class by Alexandre Bergere 183
BENEFITS
o Familiar programming language
o Atomic Transactions
o Built-in Optimizations
o Business Logic Encapsulation
Stored procedures perform complex transactions on documents and properties.
Stored procedures are written in JavaScript and are stored in a container on Azure
Cosmos DB. By performing the stored procedures on the database engine and
close to the data, you can improve performance over client-side programming.
Stored procedures are the only way to achieve atomic transactions within Azure
Cosmos DB; the client-side SDKs do not support transactions.
Performing batch operations in stored procedures is also recommended because
of the reduced need to create separate transactions.
Simple Stored Procedure
17/07/2019 Big Data class by Alexandre Bergere 184
function createSampleDocument(documentToCreate) {
var context = getContext();
var collection = context.getCollection();
var accepted = collection.createDocument(
collection.getSelfLink(),
documentToCreate,
function (error, documentCreated) {
context.getResponse().setBody(documentCreated.id)
}
);
if (!accepted) return;
}
Multi-DOCUMENT Transactions
17/07/2019 Big Data class by Alexandre Bergere 185
DATABASE TRANSACTIONS
In a typical database, a transaction can be defined as a sequence of operations performed as a single
logical unit of work. Each transaction provides ACID guarantees.
In Azure Cosmos DB, JavaScript is hosted in the same memory space as the database. Hence,
requests made within stored procedures and triggers execute in the same scope of a database
session.
Create
New
Document
Query
Collection
Update
Existing
Document
Delete
Existing
Document
Stored procedures utilize snapshot
isolation to guarantee all reads within the
transaction will see a consistent snapshot
of the data
Bounded Execution
17/07/2019 Big Data class by Alexandre Bergere 186
EXECUTION WITHIN TIME BOUNDARIES
All Azure Cosmos DB operations must complete within the server-specified request timeout duration. If an
operation does not complete within that time limit, the transaction is rolled back.
HELPER BOOLEAN VALUE
All functions under the collection object (for create, read, replace, and delete of documents and
attachments) return a Boolean value that represents whether that operation will complete:
o If true, the operation is expected to complete
o If false, the time limit will soon be reached and your function should end execution as soon as
possible.
Transaction Continuation Model
17/07/2019 Big Data class by Alexandre Bergere 187
CONTINUING LONG-RUNNING TRANSACTIONS
o JavaScript functions can implement a continuation-based model to batch/resume execution
o The continuation value can be any value of your own choosing. This value can then be used by your
applications to resume a transaction from a new “starting point”
Bulk Create Documents
Return a “pointer” to resume later
Observe
Return
Value
Try Create
Each
Document
Done
Control Flow
17/07/2019 Big Data class by Alexandre Bergere 188
JAVASCRIPT CONTROL FLOW
Stored procedures allow you to naturally express control flow, variable scoping, assignment, and
integration of exception handling primitives with database transactions directly in terms of the JavaScript
programming language.
ES6 PROMISES
ES6 promises can be used to implement promises for Azure Cosmos DB stored procedures. Unfortunately,
promises “swallow” exceptions by default. It is recommended to use callbacks instead of ES6 promises.
Stored Procedure Control Flow
17/07/2019 Big Data class by Alexandre Bergere 189
function createTwoDocuments(docA, docB) {
var ctxt = getContext(); var coll = context.getCollection(); var collLink =
coll.getSelfLink();
var aAccepted = coll.createDocument(collLink, docA, docACallback);
function docACallback(error, created) {
var bAccepted = coll.createDocument(collLink, docB, docBCallback);
if (!bAccepted) return;
};
function docBCallback(error, created) {
context.getResponse().setBody({
"firstDocId": created.id,
"secondDocId": created.id
});
};
}
Rolling Back Transactions
17/07/2019 Big Data class by Alexandre Bergere 190
TRANSACTION ROLL-BACK
Inside a JavaScript function, all operations are automatically wrapped under a single transaction:
o If the function completes without any exception, all data changes are committed
o If there is any exception that’s thrown from the script, Azure Cosmos DB’s JavaScript runtime will
roll back the whole transaction.
Create New
Document
Query
Collection
Update
Existing
Document
Delete Existing
Document
If exception, undo changes
Transaction Scope
Transaction ROLLBACK in Stored Procedure
17/07/2019 Big Data class by Alexandre Bergere 191
collection.createDocument(
collection.getSelfLink(),
documentToCreate,
function (error, documentCreated) {
if (error) throw "Unable to create document, aborting...";
}
);
collection.createDocument(
documentToReplace._self,
replacementDocument,
function (error, documentReplaced) {
if (error) throw "Unable to update document, aborting...";
}
);
User-defined Functions
17/07/2019 Big Data class by Alexandre Bergere 192
UDF
User-defined functions (UDFs) are used to extend the Azure Cosmos DB SQL API’s query language
grammar and implement custom business logic. UDFs can only be called from inside queries
They do not have access to the context object and are meant to be used as compute-only code.
User-Defined Function Definition
17/07/2019 Big Data class by Alexandre Bergere 193
var taxUdf = {
id: "tax",
serverScript: function tax(income) {
if (income == undefined)
throw 'no input’;
if (income < 1000)
return income * 0.1;
else if (income < 10000)
return income * 0.2;
else
return income * 0.4;
}
}
User-Defined Function USAGE in Queries
17/07/2019 Big Data class by Alexandre Bergere 194
> SELECT
*
FROM
TaxPayers t
WHERE
udf.tax(t.income) > 20000
Create multiple Cosmos DB triggers
17/07/2019 Big Data class by Alexandre Bergere 195
17/07/2019 Big Data class by Alexandre Bergere 196
Modelization
Modelling Data
17/07/2019 Big Data class by Alexandre Bergere 197
Embeded
“The guiding premise when normalizing data is to avoid storing redundant data on each
record and rather refer to data.”
Embedding data
Modelling Data
17/07/2019 Big Data class by Alexandre Bergere 198
Embeded data
When to embed:
o There are contains relationships between entities.
o There are one-to-few relationships between entities.
o There is embedded data that changes infrequently.
o There is embedded data won't grow without bound.
o There is embedded data that is integral to data in a document.
Modelling Data
17/07/2019 Big Data class by Alexandre Bergere 199
Referenced data
The problem with this example is that the comments array is unbounded, meaning that there is no (practical) limit to the
number of comments any single post can have.
Referencing data
Modelling Data
17/07/2019 Big Data class by Alexandre Bergere 200
Referenced data
Modelling Data
17/07/2019 Big Data class by Alexandre Bergere 201
Referenced data
When to reference:
o Representing one-to-many relationships.
o Representing many-to-many relationships.
o Related data changes frequently.
o Referenced data could be unbounded.
Modelling Data
17/07/2019 Big Data class by Alexandre Bergere 202
Where do I put the relationship?
We have dropped the unbounded collection on the publisher document.
Instead we just have a reference to the publisher on each book document.
Modelling Data
17/07/2019 Big Data class by Alexandre Bergere 203
The “Ladder” pattern
Modelling Data
17/07/2019 Big Data class by Alexandre Bergere 204
How do I model many:many relationships?
Modelling Data
17/07/2019 Big Data class by Alexandre Bergere 205
Hybrid data models
Pre-calculated aggregates values to save expensive processing on a read operation. In
the example, some of the data embedded in the author document is data that is
calculated at run-time. Every time a new book is published, a book document is
created and the countOfBooks field is set to a calculated value based on the number of
book documents that exist for a particular author. This optimization would be good in
read heavy systems where we can afford to do computations on writes in order to
optimize reads.
We could've just stuck with id and left the application to get any additional information
it needed from the respective author document using the "link", but because our
application displays the author's name and a thumbnail picture with every book
displayed we can save a round trip to the server per book in a list by
denormalizing some data from the author.
Sure, if the author's name changed or they wanted to update their photo we'd have to
go an update every book they ever published but for our application, based on the
assumption that authors don't change their names very often, this is an acceptable
design decision.
Modelling Data
17/07/2019 Big Data class by Alexandre Bergere 206
Modelling Data
17/07/2019 Big Data class by Alexandre Bergere 207
17/07/2019 Big Data class by Alexandre Bergere 208
Architectures
Azure Cosmos DB - Change Feed Lab
17/07/2019 Big Data class by Alexandre Bergere 209
Cosmos DB & Spark
17/07/2019 Big Data class by Alexandre Bergere 210
Broadcast Real-time Updates from Cosmos DB with SignalR
Service and Azure Functions
17/07/2019 Big Data class by Alexandre Bergere 211
Advanced Analytics on big data architecture
17/07/2019 Big Data class by Alexandre Bergere 212
STRIIM FOR AZURE COSMOS DB
17/07/2019 Big Data class by Alexandre Bergere 213
Continuous, Real-Time Data Movement
Querying An Azure Cosmos DB Database using the SQL API
17/07/2019 Big Data class by Alexandre Bergere 214
https://cosmosdb.github.io/labs/dotnet/technical_deep_dive/03-querying_the_database_using_sql.html
Azure Data Factory
Azure Cosmos DB
Visual Studio Code
17/07/2019 Big Data class by Alexandre Bergere 215
Through examples
How Skype modernized its backend infrastructure using Azure
Cosmos DB
17/07/2019 Big Data class by Alexandre Bergere 216
Lessons learned
Looking back at the project, Kaduk recalls several “lessons learned.” These include:
o Use direct mode for better performance – How a client connects to Azure Cosmos DB has important performance implications, especially
with respect to observed client side latency. The team began by using the default Gateway Mode connection policy, but switched to a Direct
Mode connection policy because it delivers better performance.
o Learn how to write and handle stored procedures – With Azure Cosmos DB, transactions can only be implemented using stored
procedures—pieces of application logic that are written in JavaScript that are registered and executed against a collection as a single
transaction. (In Azure Cosmos DB, JavaScript is hosted in the same memory space as the database. Hence, requests made within stored
procedures execute in the same scope of a database session, which enables Azure Cosmos DB to guarantee ACID for all operations that are
part of a single stored procedure.)
o Pay attention to query design – With Azure Cosmos DB, queries have a large impact in terms of RU consumption. Developers didn’t pay
much attention to query design at first, but soon found that RU costs were higher than desired. This led to an increased focus on optimizing
query design, such as using point document reads wherever possible and optimizing the query selections per API.
o Use the Azure Cosmos DB SDK 2.x to optimize connection usage – Within Azure Cosmos DB, the data stored in each region is distributed
across tens of thousands of physical partitions. To serve reads and writes, the Azure Cosmos DB client SDK must establish a connection with
the physical node hosting the partition. The team started by using the Azure Cosmos DB SDK 1.x, but found that its lack of support for
connection multiplexing led to excessive connection establishment and closing rates. Switching to the Azure Cosmos DB SDK 2.x, which
supports connection multiplexing, helped solve the problem —and also helped mitigate SNAT port exhaustion issues.
17/07/2019 Big Data class by Alexandre Bergere 217
Deeper
Cosmic notes
17/07/2019 Big Data class by Alexandre Bergere 218
Become an Azure Cosmonauts
17/07/2019 Big Data class by Alexandre Bergere 219

Weitere ähnliche Inhalte

Was ist angesagt?

Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Mariano Gonzalez
 
Webinar on MongoDB BI Connectors
Webinar on MongoDB BI ConnectorsWebinar on MongoDB BI Connectors
Webinar on MongoDB BI ConnectorsSumit Sarkar
 
Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML Amazon Web Services
 
How data modelling helps serve billions of queries in millisecond latency wit...
How data modelling helps serve billions of queries in millisecond latency wit...How data modelling helps serve billions of queries in millisecond latency wit...
How data modelling helps serve billions of queries in millisecond latency wit...DataWorks Summit
 
GoDaddy Customer Success Dashboard Using Apache Spark with Baburao Kamble
GoDaddy Customer Success Dashboard Using Apache Spark with Baburao KambleGoDaddy Customer Success Dashboard Using Apache Spark with Baburao Kamble
GoDaddy Customer Success Dashboard Using Apache Spark with Baburao KambleDatabricks
 
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014James Chittenden
 
The Scout24 Data Landscape Manifesto: Building an Opinionated Data Platform
The Scout24 Data Landscape Manifesto: Building an Opinionated Data PlatformThe Scout24 Data Landscape Manifesto: Building an Opinionated Data Platform
The Scout24 Data Landscape Manifesto: Building an Opinionated Data PlatformRising Media Ltd.
 
IBM Cognos Business Intelligence using dashDB
IBM Cognos Business Intelligence using dashDBIBM Cognos Business Intelligence using dashDB
IBM Cognos Business Intelligence using dashDBIBM Cloud Data Services
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglyTyler Wishnoff
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceCambridge Semantics
 
Syngenta's Predictive Analytics Platform for Seeds R&D
Syngenta's Predictive Analytics Platform for Seeds R&DSyngenta's Predictive Analytics Platform for Seeds R&D
Syngenta's Predictive Analytics Platform for Seeds R&DMichael Swanson
 
Airbyte - Seed deck
Airbyte  - Seed deckAirbyte  - Seed deck
Airbyte - Seed deckAirbyte
 
IBM + REDHAT "Creating the World's Leading Hybrid Cloud Provider..."
IBM + REDHAT "Creating the World's Leading Hybrid Cloud Provider..."IBM + REDHAT "Creating the World's Leading Hybrid Cloud Provider..."
IBM + REDHAT "Creating the World's Leading Hybrid Cloud Provider..."Gustavo Cuervo
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileRoy Kim
 
Net conf ar v2018 real time analytics
Net conf ar v2018 real time analyticsNet conf ar v2018 real time analytics
Net conf ar v2018 real time analyticsGaston Cruz
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopCCG
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleAdam Doyle
 

Was ist angesagt? (20)

Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
 
Webinar on MongoDB BI Connectors
Webinar on MongoDB BI ConnectorsWebinar on MongoDB BI Connectors
Webinar on MongoDB BI Connectors
 
Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML
 
How data modelling helps serve billions of queries in millisecond latency wit...
How data modelling helps serve billions of queries in millisecond latency wit...How data modelling helps serve billions of queries in millisecond latency wit...
How data modelling helps serve billions of queries in millisecond latency wit...
 
GoDaddy Customer Success Dashboard Using Apache Spark with Baburao Kamble
GoDaddy Customer Success Dashboard Using Apache Spark with Baburao KambleGoDaddy Customer Success Dashboard Using Apache Spark with Baburao Kamble
GoDaddy Customer Success Dashboard Using Apache Spark with Baburao Kamble
 
How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014How Google Does Big Data - DevNexus 2014
How Google Does Big Data - DevNexus 2014
 
Google and big query
Google and big queryGoogle and big query
Google and big query
 
The Scout24 Data Landscape Manifesto: Building an Opinionated Data Platform
The Scout24 Data Landscape Manifesto: Building an Opinionated Data PlatformThe Scout24 Data Landscape Manifesto: Building an Opinionated Data Platform
The Scout24 Data Landscape Manifesto: Building an Opinionated Data Platform
 
Practical advice to build a data driven company
Practical advice to build a data driven companyPractical advice to build a data driven company
Practical advice to build a data driven company
 
IBM Cognos Business Intelligence using dashDB
IBM Cognos Business Intelligence using dashDBIBM Cognos Business Intelligence using dashDB
IBM Cognos Business Intelligence using dashDB
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
 
Knowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data ScienceKnowledge Graph for Machine Learning and Data Science
Knowledge Graph for Machine Learning and Data Science
 
Big Data Application Architectures - IoT
Big Data Application Architectures - IoTBig Data Application Architectures - IoT
Big Data Application Architectures - IoT
 
Syngenta's Predictive Analytics Platform for Seeds R&D
Syngenta's Predictive Analytics Platform for Seeds R&DSyngenta's Predictive Analytics Platform for Seeds R&D
Syngenta's Predictive Analytics Platform for Seeds R&D
 
Airbyte - Seed deck
Airbyte  - Seed deckAirbyte  - Seed deck
Airbyte - Seed deck
 
IBM + REDHAT "Creating the World's Leading Hybrid Cloud Provider..."
IBM + REDHAT "Creating the World's Leading Hybrid Cloud Provider..."IBM + REDHAT "Creating the World's Leading Hybrid Cloud Provider..."
IBM + REDHAT "Creating the World's Leading Hybrid Cloud Provider..."
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
Net conf ar v2018 real time analytics
Net conf ar v2018 real time analyticsNet conf ar v2018 real time analytics
Net conf ar v2018 real time analytics
 
Analytics in a Day Virtual Workshop
Analytics in a Day Virtual WorkshopAnalytics in a Day Virtual Workshop
Analytics in a Day Virtual Workshop
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 

Ähnlich wie Big dataclasses 2019_nosql

how_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdfhow_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdfsarah david
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB
 
how_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptxhow_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptxsarah david
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryMárton Kodok
 
An introduction to MongoDB by César Trigo #OpenExpoDay 2014
An introduction to MongoDB by César Trigo #OpenExpoDay 2014An introduction to MongoDB by César Trigo #OpenExpoDay 2014
An introduction to MongoDB by César Trigo #OpenExpoDay 2014OpenExpoES
 
An introduction to MongoDB
An introduction to MongoDBAn introduction to MongoDB
An introduction to MongoDBCésar Trigo
 
How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists CCG
 
Introduction to mago3D, an Open Source Based Digital Twin Platform
Introduction to mago3D, an Open Source Based Digital Twin PlatformIntroduction to mago3D, an Open Source Based Digital Twin Platform
Introduction to mago3D, an Open Source Based Digital Twin PlatformSANGHEE SHIN
 
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning MetadataArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning MetadataArangoDB Database
 
MongoDB World 2019: Unleash the Power of the MongoDB Aggregation Framework
MongoDB World 2019: Unleash the Power of the MongoDB Aggregation FrameworkMongoDB World 2019: Unleash the Power of the MongoDB Aggregation Framework
MongoDB World 2019: Unleash the Power of the MongoDB Aggregation FrameworkMongoDB
 
Accelerating a Path to Digital with a Cloud Data Strategy
Accelerating a Path to Digital with a Cloud Data StrategyAccelerating a Path to Digital with a Cloud Data Strategy
Accelerating a Path to Digital with a Cloud Data StrategyMongoDB
 
Beyond the Basics 3: Introduction to the MongoDB BI Connector
Beyond the Basics 3: Introduction to the MongoDB BI ConnectorBeyond the Basics 3: Introduction to the MongoDB BI Connector
Beyond the Basics 3: Introduction to the MongoDB BI ConnectorMongoDB
 
Everything You Need to Know About MongoDB Development.pptx
Everything You Need to Know About MongoDB Development.pptxEverything You Need to Know About MongoDB Development.pptx
Everything You Need to Know About MongoDB Development.pptx75waytechnologies
 
Machine Learning for z/OS
Machine Learning for z/OSMachine Learning for z/OS
Machine Learning for z/OSCuneyt Goksu
 
MongoDB_Spark
MongoDB_SparkMongoDB_Spark
MongoDB_SparkMat Keep
 
Idc datadog-expands-into-apm
Idc datadog-expands-into-apmIdc datadog-expands-into-apm
Idc datadog-expands-into-apmBrett Sheppard
 
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...SANGHEE SHIN
 

Ähnlich wie Big dataclasses 2019_nosql (20)

how_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdfhow_can_businesses_address_storage_issues_using_mongodb.pdf
how_can_businesses_address_storage_issues_using_mongodb.pdf
 
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB AtlasMongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
MongoDB SoCal 2020: Migrate Anything* to MongoDB Atlas
 
how_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptxhow_can_businesses_address_storage_issues_using_mongodb.pptx
how_can_businesses_address_storage_issues_using_mongodb.pptx
 
MongoDB DOC v1.5
MongoDB DOC v1.5MongoDB DOC v1.5
MongoDB DOC v1.5
 
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQueryCodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
 
An introduction to MongoDB by César Trigo #OpenExpoDay 2014
An introduction to MongoDB by César Trigo #OpenExpoDay 2014An introduction to MongoDB by César Trigo #OpenExpoDay 2014
An introduction to MongoDB by César Trigo #OpenExpoDay 2014
 
An introduction to MongoDB
An introduction to MongoDBAn introduction to MongoDB
An introduction to MongoDB
 
How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists How Cloud is Affecting Data Scientists
How Cloud is Affecting Data Scientists
 
Introduction to mago3D, an Open Source Based Digital Twin Platform
Introduction to mago3D, an Open Source Based Digital Twin PlatformIntroduction to mago3D, an Open Source Based Digital Twin Platform
Introduction to mago3D, an Open Source Based Digital Twin Platform
 
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning MetadataArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
 
Introduction to mongodb
Introduction to mongodbIntroduction to mongodb
Introduction to mongodb
 
MongoDB World 2019: Unleash the Power of the MongoDB Aggregation Framework
MongoDB World 2019: Unleash the Power of the MongoDB Aggregation FrameworkMongoDB World 2019: Unleash the Power of the MongoDB Aggregation Framework
MongoDB World 2019: Unleash the Power of the MongoDB Aggregation Framework
 
Accelerating a Path to Digital with a Cloud Data Strategy
Accelerating a Path to Digital with a Cloud Data StrategyAccelerating a Path to Digital with a Cloud Data Strategy
Accelerating a Path to Digital with a Cloud Data Strategy
 
Beyond the Basics 3: Introduction to the MongoDB BI Connector
Beyond the Basics 3: Introduction to the MongoDB BI ConnectorBeyond the Basics 3: Introduction to the MongoDB BI Connector
Beyond the Basics 3: Introduction to the MongoDB BI Connector
 
Everything You Need to Know About MongoDB Development.pptx
Everything You Need to Know About MongoDB Development.pptxEverything You Need to Know About MongoDB Development.pptx
Everything You Need to Know About MongoDB Development.pptx
 
Machine Learning for z/OS
Machine Learning for z/OSMachine Learning for z/OS
Machine Learning for z/OS
 
MongoDB_Spark
MongoDB_SparkMongoDB_Spark
MongoDB_Spark
 
Idc datadog-expands-into-apm
Idc datadog-expands-into-apmIdc datadog-expands-into-apm
Idc datadog-expands-into-apm
 
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
Let's integrate CAD/BIM/GIS on the same platform: A practical approach in rea...
 
Dataweek-Talk-2014
Dataweek-Talk-2014Dataweek-Talk-2014
Dataweek-Talk-2014
 

Kürzlich hochgeladen

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Big dataclasses 2019_nosql

  • 1. 17/07/2019 Big Data class by Alexandre Bergere 1 Big Data ESAIP – IR4
  • 2. 17/07/2019 Big Data class by Alexandre Bergere 2 alexandre.bergere@gmail.com https://fr.linkedin.com/in/alexandrebergere @AlexPhile ESAIP 2013 - 2016 Avanade 2016 - 2019 Sr Anls, Data EngineeringStudent Worked as a senior analyst at Avanade France, I have developed my skills in data analysis (MSBI, Power BI, R, Python) by working on innovative projects and proofs of concept in the energy industry. ESAIP Teacher 2016 - ? Freelance 2019 - x Data Analyst & Data Architect
  • 3. 17/07/2019 Big Data class by Alexandre Bergere 3 Planning D-1 D-2 D-3 D-4 D-5 MorningAfternoon What’s Big Data + No SQL + Cloud Architecture Azure IOT + Azure Stream Analytics + Power BI Theorical AWS Practice Azure Practice Exam Oral Exam Written Exam SPARK SPARK Free time Prep. Oral Analyse Big Data with Hadoop SPARK Redshift Cosmos DB Serverless architecture : AWS Lambda + DynamoDB + NodeJS Cosmos DB SPARK On Prem Neo4J Mongo DB Cloud SPARK
  • 4. 17/07/2019 Big Data class by Alexandre Bergere 4 Planning D-1 D-2 D-3 MorningAfternoon What’s Big Data Azure IOT + Azure Stream Analytics + Power BI Theorical Azure Practice Cosmos DB SPARK On Prem Neo4J Mongo DB Cloud Cloud architecture Written Exam BI & Machine Learning Analyse Big Data with Hadoop
  • 5. 17/07/2019 Big Data class by Alexandre Bergere 6 Data Storage
  • 6. 17/07/2019 Big Data class by Alexandre Bergere 7 Data Storage Relational data store HDFS Key Value data store Columnar data store Object store Search data store Graph data store Document data store
  • 7. 17/07/2019 Big Data class by Alexandre Bergere 8 Mongo DB
  • 8. 17/07/2019 Big Data class by Alexandre Bergere 9 Mongo DB Created in 2007 & first release in 2010. Easy and simple … as a leaf. Document data store & Schemaless.
  • 9. Nexus Architecture 17/07/2019 Big Data class by Alexandre Bergere 10
  • 10. Driver & Framework 17/07/2019 Big Data class by Alexandre Bergere 11
  • 11. MongoDB is easy 17/07/2019 Big Data class by Alexandre Bergere 12 For many developers, data model goes hand in hand with object mapping, and for that purpose you may have used an object-relational mapping library, such as Java’s Hibernate framework or Ruby’s ActiveRecord. Such libraries can be useful for efficiently building applications with a RDBMS, but they’re less necessary with MongoDB. This is due in part to the fact that a document is already an object- like representation. It’s also partly due to the MongoDB drivers, which already provide a fairly high-level interface to MongoDB. Without question, you can build applications on MongoDB using the driver interface alone.
  • 12. Use cases 17/07/2019 Big Data class by Alexandre Bergere 13 o Web application (mongoDB is well-suited as primary datastore for web application) o Agile development o Analytics and logging o Caching o Variable Schemas
  • 13. Mongo DB 4.0 : ACID transactions 17/07/2019 Big Data class by Alexandre Bergere 14 More info. Bêta test.
  • 14. Mongo DB releases 17/07/2019 Big Data class by Alexandre Bergere 15
  • 15. Compagnies 17/07/2019 Big Data class by Alexandre Bergere 16
  • 16. Analytics – use case 17/07/2019 Big Data class by Alexandre Bergere 17 More info. The City of Chicago cuts crime and improves citizen welfare with a real-time geospatial analytics platform called WindyGrid. Using MongoDB, it analyzes data from 30+ different departments – like bus locations, 911 calls, and even tweets – to better understand and respond to emergencies.
  • 17. The case for adding NoSQL 17/07/2019 Big Data class by Alexandre Bergere 18 o Large volumes of rapidly changing structured, semi-structured, and unstructured data o Agile sprints, quick schema iteration, and frequent code pushes o API-driven, object-oriented programming that is easy to use and flexible o Geographically distributed scale-out architecture instead of expensive, monolithic architecture Consider, for example, enterprise resource planning (ERP), a standard for relational databases. What if you want to offer ERP forms users can actually modify if they need to? A document- based NoSQL database such as MongoDB can provide that functionality without requiring you to rebuild your whole data schema every time a user wants to change the data format.
  • 18. White papers 17/07/2019 Big Data class by Alexandre Bergere 19 MongoDB – BI & Analytics MongoDB – Kafka MongoDB – Spark
  • 19. Leader in The Forrester Wave™: Big Data NoSQL, Q1 2019 17/07/2019 Big Data class by Alexandre Bergere 20 o Data Types o Streaming and Loading o Big Data Support o In-memory o Performance o Scalability o High Availability & Disaster Recovery o Tools o Workloads o Use Cases o Ability to Execute o Road Map o Open Source and Licensing o Support
  • 20. 17/07/2019 Big Data class by Alexandre Bergere 21 Tools
  • 21. MongoDB Compass 17/07/2019 Big Data class by Alexandre Bergere 22
  • 22. Mongo DB Atlas 17/07/2019 Big Data class by Alexandre Bergere 23 DAAS : Database As A Service • Schema design • Query and index optimization • Server size selection - you must select the appropriate size of server, coupled with IO and storage capacity • Capacity planning - you must determine when you need additional capacity, typically using the monitoring telemetry provided by MongoDB Atlas, but you can make these changes with no downtime • Initiating database restores • How much you use
  • 23. Mongo DB Cloud Manager 17/07/2019 Big Data class by Alexandre Bergere 24
  • 24. Mongo DB Connector for BI 17/07/2019 Big Data class by Alexandre Bergere 25
  • 25. MongoDB Charts (beta) 17/07/2019 Big Data class by Alexandre Bergere 26 MongoDB Charts is the fastest and easiest way to build visualizations of MongoDB data.
  • 26. Architecture pseudo On premise 17/07/2019 Big Data class by Alexandre Bergere 27
  • 27. Change Streams 17/07/2019 Big Data class by Alexandre Bergere 28 More info. Change streams allow applications to access real-time data changes without the complexity and risk of tailing the oplog. Applications can use change streams to subscribe to all data changes on a collection and immediately react to them.
  • 28. Stitch 17/07/2019 Big Data class by Alexandre Bergere 29 Full access to MongoDB, declarative read/write controls, and integration with your choice of services MongoDB Stitch lets developers focus on building applications rather than on managing data manipulation code, service integration, or backend infrastructure. Whether you’re just starting up and want a fully managed backend as a service, or you’re part of an enterprise and want to expose existing MongoDB data to new applications, Stitch lets you focus on building the app users want, not on writing boilerplate backend logic.
  • 29. 17/07/2019 Big Data class by Alexandre Bergere 30 Modeling & request
  • 30. Document are rich data structure 17/07/2019 Big Data class by Alexandre Bergere 31 • JSON: • String, Number, Array, Object, NULL, Boolean. • BSON: • Date, BinData, ObjectID, Geo-Location. • Better storage performance. ObjectID: ◦ _id : 'DATE[4] | MAC_ADDR[3] | PID[2] | COUNTER[3]
  • 31. Available Types 17/07/2019 Big Data class by Alexandre Bergere 32 Type Number Alias Notes Double 1 “double” String 2 “string” Object 3 “object” Array 4 “array” Binary data 5 “binData” Undefined 6 “undefined” Deprecated. ObjectId 7 “objectId” Boolean 8 “bool” Date 9 “date” Null 10 “null” Regular Expression 11 “regex” DBPointer 12 “dbPointer” Deprecated. JavaScript 13 “javascript” Symbol 14 “symbol” Deprecated. JavaScript (with scope) 15 “javascriptWithScope” 32-bit integer 16 “int” Timestamp 17 “timestamp” 64-bit integer 18 “long” Decimal128 19 “decimal” New in version 3.4. Min key -1 “minKey” Max key 127 “maxKey”
  • 32. SQL vs MongoDB Terms 17/07/2019 Big Data class by Alexandre Bergere 33 SQL Terms/Concepts MongoDB Terms/Concepts Database Database Table Collection Line Document Column Field Index Index Join Embeded or linked document Primary key Primary key (start by « _id »)
  • 33. Documents are Flexible 17/07/2019 Big Data class by Alexandre Bergere 34
  • 34. Document Model 17/07/2019 Big Data class by Alexandre Bergere 35 Pers_ID Surname First_Name City 0 Miller Paul London 1 Ortega Alvaro Valencia 2 Huber Urs Zurich 3 Blanc Gaston Paris 4 Bertolini Fabrizio Rome Car_ID Model Year Value Pers_ID 101 Bently 1973 100000 0 102 Rolls Royce 1965 330000 0 103 Peugot 1993 500 3 104 Ferrari 2005 150000 4 105 Renault 1998 2000 3 106 Renault 2001 7000 3 107 Smart 1999 2000 2 CAR PERSON Mongo DB RDBMS
  • 35. One to many 17/07/2019 Big Data class by Alexandre Bergere 36
  • 36. CRUD 17/07/2019 Big Data class by Alexandre Bergere 37 # FIND() > db.<collection>.find ({<conditions>},{<champs>}) > db.products.find( { qty: { $gt: 25 } }, { item: 1, qty: 1 } ) Options: .pretty() .sort() : 1 : ASC, -1: DESC : sort({‘name’:-1}) .skip() : number .limit() : number .count() sort, first, skip, second, and limit last because that is the only order that makes sense.
  • 37. CRUD 17/07/2019 Big Data class by Alexandre Bergere 38 # INSERT() > db.<collection>.insert ({<value>}) > db.<collection>.insertMany([{<values>}]) > db.inventory.insertMany([ { item: "journal", qty: 25, tags: ["blank", "red"], size: { h: 14, w: 21, uom: "cm" } }, { item: "mat", qty: 85, tags: ["gray"], size: { h: 27.9, w: 35.5, uom: "cm" } }, { item: "mousepad", qty: 25, tags: ["gel", "blue"], size: { h: 19, w: 22.85, uom: "cm" } } ]) db.collection.insertOne() Inserts a single document into a collection. db.collection.insertMany() db.collection.insertMany() inserts multiple documents into a collection. db.collection.insert() db.collection.insert() inserts a single document or multiple documents into a collection.
  • 38. CRUD 17/07/2019 Big Data class by Alexandre Bergere 39 # UPDATE() > db.<collection>.update ({<conditions>},{<champs>},{upsert:true/false},{multi:true/false} ) > { "_id": "artist:271", "last_name": "Cotillard", "first_name": "Marion", "birth_date": "1975" } # Operator Update > db.artists.update({"_id": "artist:281"},{ $set : {"last_name" : "Page"}}) > { "_id": "artist:271", "last_name": “Page", "first_name": "Marion", "birth_date": "1975" } # Replacement Update > db.artists.update({"_id": "artist:281"},{"last_name" : "Page"}) > { "_id": "artist:271", "last_name": “Page"} ❑ Operator Update ❑ Replacement Update Upsert: boolean Optional. If set to true, creates a new document when no document matches the query criteria. The default value is false, which does not insert a new document when no match is found. Multi: boolean Optional. If set to true, updates multiple documents that meet the query criteria. If set to false, updates one document. The default value is false.
  • 39. CRUD 17/07/2019 Big Data class by Alexandre Bergere 40 # DELETE() > db.<collection>.remove ({<conditions>}) > db.artists.remove({"_id": "artist:39"}) # Remove all fields > db.artists.remove({})
  • 40. Query Operator 17/07/2019 Big Data class by Alexandre Bergere 41 Name Description $eq Matches values that are equal to a specified value. $gt Matches values that are greater than a specified value. $gte Matches values that are greater than or equal to a specified value. $lt Matches values that are less than a specified value. $lte Matches values that are less than or equal to a specified value. $ne Matches all values that are not equal to a specified value. $in Matches any of the values specified in an array.
  • 41. Query Operator : $set 17/07/2019 Big Data class by Alexandre Bergere 42 # $set > db.products.update( { _id: 100 }, { $set: { quantity: 500, details: { model: "14Q3", make: "xyz" }, tags: [ "coats", "outerwear", "clothing" ] } } ) # $set Embedded Documents > db.products.update( { _id: 100 }, { $set: { "details.make": "zzz" } } ) # $set in Arrays > db.products.update( { _id: 100 }, { $set: { "tags.1": "rain gear", "ratings.0.rating": 2 } } )
  • 42. Query Operator : Arrays 17/07/2019 Big Data class by Alexandre Bergere 43 Name Description $pull Removes all array elements that match a specified query. $push Add an element to an array. $pop Removes the first or last item of an array. $addToSet Adds elements to an array only if they do not already exist in the set. $in Matches any of the values specified in an array.
  • 43. DML 17/07/2019 Big Data class by Alexandre Bergere 44 # Returns all database > show dbs # The current database name: > db.getName() # Returns all database > show dbs # Returns all collection in the current database: > db.getCollectionNames() # Returns a collection or a view object: > db.getCollection(name) # The current database connection: > db.getMongo() # Clean the console log: > cls # Return collection informations: > db.getCollectionInfos({name: "name"})
  • 44. Command-line tools 17/07/2019 Big Data class by Alexandre Bergere 45 # Import multiples document: > mongoimport -d crunchbase -c companies D:MongoDBsrccompanies.json # Import multiples document in an array: > mongoimport -d crunchbase -c companies D:MongoDBsrccompanies.json --jsonArray # Export > mongoexport -d crunchbase -c artists --out D:MongoDBartists.json Launch in the shell, not in mongoDB instance. Command Description mongodump mongodump is a utility for creating a binary export of the contents of a database. mongodump can export data from either mongod or mongos instances. mongorestore The mongorestore program loads data from either a binary database dump created by mongodump or the standard input (starting in version 3.0.0) into a mongod or mongos instance. mongostat This utility constantly polls MongoDB and the system to provide helpful stats, including the number of operations per second (inserts, queries, updates, deletes, and so on), the amount of virtual memory allocated, and the number of connections to the server. mongoperf Helps you understand the disk operations happening in a running MongoDB instance. mongotop Similar to top, this utility polls MongoDB and shows the amount of time it spends reading and writing data in each collection. mongosniff A wire-sniffing tool for viewing operations sent to the database. It essentially translates the BSON going over the wire to human-readable shell statements.
  • 45. $text 17/07/2019 Big Data class by Alexandre Bergere 46 # $text > db.articles.find( { $text: { $search: "coffee" } } )) $text performs a text search on the content of the fields indexed with a text index. A $text expression has the following syntax: # $text > { $text: { $search: <string>, $language: <string>, $caseSensitive: <boolean>, $diacriticSensitive: <boolean> } } # Create index first - You can index multiple fields for the text index: db.reviews.createIndex( { subject: "text", comments: "text" } )
  • 46. Schema Validation 17/07/2019 Big Data class by Alexandre Bergere 47 Implement data governance without sacrificing the agility that comes from a dynamic schema. With schema validation, developers and operations spend less time defining data quality controls in their applications, and instead delegate these tasks to the database.
  • 47. Aggregation 17/07/2019 Big Data class by Alexandre Bergere 48 Swiss Army knife Executes in native code o Written in C++ o JSON parameter Flexible, funcional, simple o Operation pipeline o Computational expressions
  • 48. Pipeline operators 17/07/2019 Big Data class by Alexandre Bergere 49 Operator Description $match Filter documents $project Reshape documents $group Summarize documents $unwind Expand arrays in documents $sort Order documents $limit / $skip Paginate documents $redact Restrict documents $geoNear Proximity sort documents $let, $map Define variables
  • 49. $match 17/07/2019 Big Data class by Alexandre Bergere 50 # Matching field values > {$match:{ language:"Russian" } { title:"War and Peace", pages:1440, langugage:"Russian" } # Matching with query operators > {$match:{ pages:{$gt:100} } { title:"War and Peace", pages:1440, langugage:"Russian" }, { title:"Atlas Shrugged", pages:1088, langugage:"English" }
  • 50. $project 17/07/2019 Big Data class by Alexandre Bergere 51 # Renaming and cuputing fields > {$project:{ avgChapterLength:{ $divide:["$pages", "$chapters" ] }, lang: "$language" }} { _id:375, avgChapterLength: 24,2222 lang:"English" } # Including & excluding fields > {$project:{ _id:0, title:1, language:1 }} { title:"Great Gatsby", language:"English" }
  • 51. $group 17/07/2019 Big Data class by Alexandre Bergere 52 # Collect distinct values > {$group:{ _id:"$langugage", title:{$addToSet:"$title"} }} { _id:"English", language:[Atlas Shrugged" , "The Great Gatsby"] }, { _id:"Russian", language:["War and Peace"] } # Calculating average, summing fields… > {$group:{ _id:"$langugage", pages:{$sum:"$pages"}, books:{$sum:1}, avgPages:{$avg:"$pages"} }} { _id:"Russian", pages:1440, books:1, avgPages:1440 }
  • 52. $unwind 17/07/2019 Big Data class by Alexandre Bergere 53 # Collect distinct values > {$unwind:{ "subjects" } { title:"The Great Gatsby", ISBN:"9762832930920323" , subjects:"Long Island" }, { title:"The Great Gatsby", ISBN:"9762832930920323" , subjects:"New York" }, { title:"The Great Gatsby", ISBN:"9762832930920323" , subjects:"1920s" } { title:"The Great Gatsby", ISBN:"9762832930920323" , subjects:[ "Long Island", "New York", "1920s" ] }
  • 53. 17/07/2019 Big Data class by Alexandre Bergere 54 LABS
  • 54. Installation 17/07/2019 Big Data class by Alexandre Bergere 55 Download & Install
  • 55. Instance 17/07/2019 Big Data class by Alexandre Bergere 56 Launch as a service: mongod --dbpath C:UsersalexaDocumentsMongoDBdata Launch as a connection: mongo Options Shortcut --db -d --collection -c --username -u --password -p --host -h
  • 56. Request practice 17/07/2019 Big Data class by Alexandre Bergere 57 # 1.0 Load artists.json > mongoimport -d crunchbase -c artists --file C:UsersalexaDocumentsCoursMongoDB2017- 2018srcartists.json -–jsonArray -–port 27018 # 1.1 Return first_name and birth_date to all artists born in 1964 > db.artists.find({"birth_date": "1964"},{"_id":0,"first_name":1, "birth_date":1}) # 1.2 Return all arstists born after 1980 or with their first name begin by ‘Chri’ > db.artists.find({$or:[{"birth_date": {$gte:"1980"}},{"first_name":/^Chri/}]},{}) > db.artists.find({$or:[{"birth_date": {$gte:"1980"}},{"first_name":{$regex : /^Chri/}}]},{}) # 1.3 Return the 6e to the 9e artist by their name desc > db.artists.find().pretty().sort({"last_name":-1}).skip(5).limit(4) # 1.4 Insert the following artist: {"_id": "artist:282", "last_name": "Bergere", "first_name": "Alexandre", "birth_date": "1992"} : (Replace the id) > db.artists.insert({ "_id": "artist:282", "last_name": "Bergere", "first_name": "Alexandre", "birth_date": "1992" })
  • 57. Request practice 17/07/2019 Big Data class by Alexandre Bergere 58 # 1.5 Modify by « Jonathan » the first_name of the artists with the id artist:266 > db.artists.update({"_id": "artist:266"},{$set:{"first_name":"Jonathan"}}) # 1.6 Add « golf » to the 280 artist’s hobbies > db.artists.update({"_id": "artist:280"},{$push:{"hobbies":"golf"}}) # 1.7 Add « yoga » to the 282 artist’s hobbies > db.artists.update({"_id": "artist:282"},{$push:{"hobbies":"yoga"}}) # 1.8 Remove « poney » and « photo » from 280 artist’s hobbies > db.artists.update({"_id": "artist:280"},{$pull:{"hobbies": {$in:["poney","photo"]}}})
  • 58. Request practice 17/07/2019 Big Data class by Alexandre Bergere 59 # Convert string to integer > db.artists.find({birth_date: {$exists: true}}).forEach(function(obj) { obj.birth_date = new NumberInt(obj.birth_date); db.artists.save(obj); });
  • 59. 17/07/2019 Big Data class by Alexandre Bergere 60 Go Deeper
  • 60. Support MongoDB in action, 2nd Edition docs.mongodb.com 17/07/2019 MongoDB class by Alexandre Bergere 61
  • 63. 17/07/2019 Big Data class by Alexandre Bergere 64
  • 64. 17/07/2019 Big Data class by Alexandre Bergere 65 Graph database
  • 65. What is a graph database? 17/07/2019 Big Data class by Alexandre Bergere 66 A graph database is an online database management system with Create, Read, Update and Delete (CRUD) operations working on a graph data model. Graph databases are generally built for use with online transaction processing (OLTP) systems. Accordingly, they are normally optimized for transactional performance, and engineered with transactional integrity and operational availability in mind. ~ Neo4j Unlike other databases, relationships take first priority in graph databases.
  • 66. The case for graph databases 17/07/2019 Big Data class by Alexandre Bergere 67
  • 67. What is Graph? 17/07/2019 Big Data class by Alexandre Bergere 68 Graph is just a collection of vertices and edges—or, in less intimidating language, a set of nodes and the relationships that connect them.
  • 68. Definitions 17/07/2019 Big Data class by Alexandre Bergere 69 • Nodes o Nodes are the main data elements o Nodes are connected to other nodes via relationships o Nodes can have one or more properties (i.e., attributes stored as key/value pairs) o Nodes have one or more labels that describes its role in the graph o Example: Person nodes vs Car nodes • Relationships o Relationships connect two nodes o Relationships are directional o Nodes can have multiple, even recursive relationships o Relationships can have one or more properties (i.e., attributes stored as key/value pairs) Properties o Properties are named values where the name (or key) is a string o Properties can be indexed and constrained o Composite indexes can be created from multiple properties Labels o Labels are used to group nodes into sets o A node may have multiple labels o Labels are indexed to accelerate finding nodes in the graph o Native label indexes are optimized for speed
  • 69. Modelling relational to graph 17/07/2019 Big Data class by Alexandre Bergere 70 Relational Graph Rows Nodes Joins Relationships Table names Labels Columns Properties similarities relational model differs from the graph model Relational Graph Each column must have a field value. Nodes with the same label aren’t required to have the same set of properties. Joins are calculated at query time. Relationships are stored on disk when they are created. A row can belong to one table. A node can have many labels.
  • 70. RDBMS vs graph 17/07/2019 Big Data class by Alexandre Bergere 71
  • 71. 17/07/2019 Big Data class by Alexandre Bergere 72 Neo4j
  • 72. Neo4j Graph Platform 17/07/2019 Big Data class by Alexandre Bergere 73 The Neo4j Graph Platform includes out-of-the-box tooling that enables you to access graphs in Neo4j Databases. In addition, Neo4j provides APIs and drivers that enable you to create applications and custom tooling for accessing and visualizing graphs.
  • 73. Dev env. 17/07/2019 Big Data class by Alexandre Bergere 75 Neo4j SandboxNeo4j Desktop o Neo4j Database server o graph engine o kernel (Cypher execution) o Neo4j Browser o additional libraries and drivers for accessing the Neo4j database o temporary, cloud-based instance of a Neo4j Server with its associated graph that you can access from any Web browser o available for three days, but you can extend it for up to 10 days o you can use Neo4j Browser Sync to save Cypher scripts from your sandbox
  • 74. Neo4j Browser 17/07/2019 Big Data class by Alexandre Bergere 76
  • 75. 17/07/2019 Big Data class by Alexandre Bergere 77 Introduction to Cypher
  • 76. What’s Cypher? 17/07/2019 Big Data class by Alexandre Bergere 78 Cypher is a declarative query language that allows for expressive and efficient querying and updating of graph data. Cypher is ASCII art focuses on the clarity of expressing what to retrieve from a graph Cypher is inspired by SPARK QL SQL Python Haskell
  • 77. Node & Label 17/07/2019 Big Data class by Alexandre Bergere 79 () // anonymous node not be referenced later in the query (p) // variable p, a reference to a node used later (:Person) // anonymous node of type Person (p:Person) // p, a reference to a node of type Person (p:Actor:Director) // p, a reference to a node of types Actor and Director Examining the data model CALL db.schema
  • 78. Using MATCH to retrieve nodes 17/07/2019 Big Data class by Alexandre Bergere 80 MATCH (n) // returns all nodes in the graph RETURN n MATCH (p:Person) // returns all Person nodes in the graph RETURN p When you specify a pattern for a MATCH clause, you should always specify a node label if possible. In doing so, the graph engine uses an index to retrieve the nodes which will perform better than not using a label for the MATCH.
  • 79. Properties 17/07/2019 Big Data class by Alexandre Bergere 81 A property is defined for a node and not for a type of node. All nodes of the same type need not have the same properties. // Query the database for all property keys CALL db.propertyKeys MATCH (variable:Label {propertyKey: propertyValue, propertyKey2: propertyValue2}) RETURN variable MATCH (m:Movie {released: 2003, tagline: 'Free your mind'}) RETURN m
  • 80.  Filtering queries using property values 17/07/2019 Big Data class by Alexandre Bergere 82 // Retrieve all Movie nodes that have a released property value of 2003. MATCH (m:Movie {released:2003}) RETURN m // Retrieve all Movies released in 2006, returning their titles MATCH (m:Movie {released: 2006}) RETURN m.title // Display title, released, and tagline values for every Movie node in the graph MATCH (m:Movie) RETURN m.title AS `movie title`, m.released AS released, m.tagline AS tagLine
  • 81. Relationships 17/07/2019 Big Data class by Alexandre Bergere 83 A relationship is a directed connection between two nodes that has a relationship type (name). In addition, a relationship can have properties, just like nodes. () // a node ()--() // 2 nodes have some type of relationship ()-->() // the first node has a relationship to the second node ()<--() // the second node has a relationship to the first node Here is how Cypher uses ASCII art to specify path used for a query: Querying using relationships: MATCH (node1)-[:REL_TYPE]->(node2) RETURN node1, node2 MATCH (node1)-[:REL_TYPEA | :REL_TYPEB]->(node2) RETURN node1, node2 node1 is a specification of a node where you may include node labels and property values for filtering. :REL_TYPE is the type (name) for the relationship. For this syntax the relationship is from node1 to node2. :REL_TYPEA , :REL_TYPEB are the relationships from node1 to node2. The nodes are returned if at least one of the relationships exists. node2 is a specification of a node where you may include node labels and property values for filtering.
  • 82. Relationships 17/07/2019 Big Data class by Alexandre Bergere 84 Using patterns for queries: MATCH (p:Person)-[:FOLLOWS]->(:Person {name:'Angela Scope'}) RETURN p MATCH (p:Person)<-[:FOLLOWS]-(:Person {name:'Angela Scope'}) RETURN p
  • 83. Relationships 17/07/2019 Big Data class by Alexandre Bergere 85 Using patterns for queries: // Querying by any direction of the relationship MATCH (p1:Person)-[:FOLLOWS]-(p2:Person {name:'Angela Scope'}) RETURN p1, p2
  • 84. Relationships 17/07/2019 Big Data class by Alexandre Bergere 86 Using patterns for queries: // Traversing relationships : query to return all followers of the followers of Jessica Thompson. MATCH (p:Person)-[:FOLLOWS]->(:Person)-[:FOLLOWS]->(:Person {name:'Jessica Thompson'}) RETURN p // Traversing relationships : return each person along the path by specifying variables for the nodes and returning them MATCH path = (:Person)-[:FOLLOWS]->(:Person)-[:FOLLOWS]->(:Person {name:'Jessica Thompson'}) RETURN path
  • 85. Relationships 17/07/2019 Big Data class by Alexandre Bergere 87 Using a relationship in a query: MATCH (p:Person)-[rel:ACTED_IN]->(m:Movie {title: 'The Matrix'}) RETURN p, rel, m Variables: o p to represent the Person nodes during the query, the variable o m to represent the Movie node retrieved o rel to represent the relationship for the relationship type, ACTED_IN Querying by multiple relationships: MATCH (p:Person {name: 'Tom Hanks'})-[:ACTED_IN|:DIRECTED]->(m:Movie) RETURN p.name, m.title
  • 86. Relationships 17/07/2019 Big Data class by Alexandre Bergere 88 Using anonymous nodes in a query: MATCH (p:Person)-[:ACTED_IN]->(:Movie {title: 'The Matrix'}) RETURN p.name A best practice is to place named nodes (those with variables) before anonymous nodes in a MATCH clause. Using an anonymous relationship for a query: // find all people who are in any way connected to the movie MATCH (p:Person)-->(m:Movie {title: 'The Matrix'}) RETURN p, m MATCH (p:Person)--(m:Movie {title: 'The Matrix'}) RETURN p, m
  • 87. Relationships 17/07/2019 Big Data class by Alexandre Bergere 89 Retrieving the relationship types: MATCH (p:Person)-[rel]->(:Movie {title:'The Matrix'}) RETURN p.name, type(rel) Retrieving properties for relationships: MATCH (p:Person)-[:REVIEWED {rating: 65}]->(:Movie {title: 'The Da Vinci Code'}) RETURN p.name
  • 88. Filtering queries using relationships 17/07/2019 Big Data class by Alexandre Bergere 90 // Retrieve all people who wrote the movie Speed Racer MATCH (p:Person)-[:WROTE]->(:Movie {title: 'Speed Racer'}) RETURN p.name // Retrieve all movies that are connected to the person, Tom Hanks MATCH (m:Movie)<--(:Person {name: 'Tom Hanks'}) RETURN m.title or MATCH(:Person {name: 'Tom Hanks'})-->(m:Movie) RETURN m.title // Retrieve information about the relationships Tom Hanks has with the set of movies retrieved earlier MATCH (m:Movie)-[rel]-(:Person {name: 'Tom Hanks'}) RETURN m.title, type(rel) // Retrieve information about the roles that Tom Hanks acted in MATCH (m:Movie)-[rel:ACTED_IN]-(:Person {name: 'Tom Hanks'}) RETURN m.title, rel.roles
  • 89. Cypher style recommendations 17/07/2019 Big Data class by Alexandre Bergere 91 Here are the Neo4j-recommended Cypher coding standards: o Node labels are CamelCase and begin with an upper-case letter (examples: Person, NetworkAddress). Note that node labels are case-sensitive. o Property keys, variables, parameters, aliases, and functions are camelCase and begin with a lower-case letter (examples: businessAddress, title). Note that these elements are case-sensitive. o Relationship types are in upper-case and can use the underscore. (examples: ACTED_IN, FOLLOWS). Note that relationship types are case-sensitive and that you cannot use the “-” character in a relationship type. o Cypher keywords are upper-case (examples: MATCH, RETURN). Note that Cypher keywords are case-insensitive, but a best practice is to use upper-case. o String constants are in single quotes, unless the string contains a quote or apostrophe (examples: ‘The Matrix’, “Something’s Gotta Give”). Note that you can also escape single or double quotes within strings that are quoted with the same using a backslash character. o Specify variables only when needed for use later in the Cypher statement. o Place named nodes and relationships (that use variables) before anonymous nodes and relationships in your MATCH clauses when possible. o Specify anonymous relationships with -->, --, or <-- MATCH (:Person {name: 'Diane Keaton'})-[movRel:ACTED_IN]-> (:Movie {title:"Something's Gotta Give"}) RETURN movRel.roles Follow the Cypher Style Guide when writing your Cypher statements.
  • 90. 17/07/2019 Big Data class by Alexandre Bergere 92 Getting More Out of Queries
  • 91. Filtering queries using WHERE 17/07/2019 Big Data class by Alexandre Bergere 93 MATCH (p:Person)-[:ACTED_IN]->(m:Movie) WHERE m.released = 2008 RETURN p, m // complex conditions MATCH (p:Person)-[:ACTED_IN]->(m:Movie) WHERE m.released = 2008 AND m.released = 2009 RETURN p, m // same as previous MATCH (p:Person)-[:ACTED_IN]->(m:Movie) WHERE 2003 <= m.released <= 2004 RETURN p.name, m.title, m.released MATCH (p:Person)-[:ACTED_IN]->(m:Movie {released: 2008}) RETURN p, m 
  • 92. Filtering queries using WHERE 17/07/2019 Big Data class by Alexandre Bergere 94 // Testing labels MATCH (p:Person) RETURN p.name MATCH (p:Person)-[:ACTED_IN]->(:Movie {title: 'The Matrix'}) RETURN p.name  MATCH (p) WHERE p:Person RETURN p.name MATCH (p)-[:ACTED_IN]->(m) WHERE p:Person AND m:Movie AND m.title='The Matrix' RETURN p.name
  • 93. Filtering queries using WHERE 17/07/2019 Big Data class by Alexandre Bergere 95 // Testing the existence of a property MATCH (p:Person)-[:ACTED_IN]->(m:Movie) WHERE p.name='Jack Nicholson' AND exists(m.tagline) RETURN m.title, m.tagline // Testing strings : You can specify STARTS WITH, ENDS WITH, and CONTAINS MATCH (p:Person)-[:ACTED_IN]->() WHERE toLower(p.name) STARTS WITH 'michael' RETURN p.name // Testing with regular expressions; You use the syntax =~ MATCH (p:Person) WHERE p.name =~'Tom.*' RETURN p.name
  • 94. Filtering queries using WHERE 17/07/2019 Big Data class by Alexandre Bergere 96 // Testing with patterns // exclude people who directed that movie MATCH (p:Person)-[:WROTE]->(m:Movie) WHERE NOT exists( (p)-[:DIRECTED]->() ) RETURN p.name, m.title // find Gene Hackman and the movies that he acted in with another person who also directed the movie MATCH (gene:Person)-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(other:Person) WHERE gene.name= 'Gene Hackman' AND exists( (other)-[:DIRECTED]->() ) RETURN gene, other, m
  • 95. Filtering queries using WHERE 17/07/2019 Big Data class by Alexandre Bergere 97 // Testing with list values : elements of the list have to be the same type of data MATCH (p:Person) WHERE p.born IN [1965, 1970] RETURN p.name as name, p.born as yearBorn // You can also compare a value to an existing list in the graph. MATCH (p:Person)-[r:ACTED_IN]->(m:Movie) WHERE 'Neo' IN r.roles AND m.title='The Matrix' RETURN p.name There are a number of syntax elements of Cypher that we have not covered in this training. For example, you can specify CASE logic in your conditional testing for your WHERE clauses. You can learn more about these syntax elements in the Neo4j Cypher Manual and the Cypher Refcard.
  • 96. Filtering queries using WHERE 17/07/2019 Big Data class by Alexandre Bergere 98 // Retrieve all actors that were born in the 70’s MATCH (a:Person) WHERE a.born >= 1970 AND a.born < 1980 RETURN a.name as Name, a.born as `Year Born` // Retrieve all movies released in 2000 by testing the node label and the released property, returning the movie titles MATCH (m) WHERE m:Movie AND m.released = 2000 and exists(m.released) RETURN m.title // Retrieve all people that wrote movies by testing the relationship between two nodes MATCH (a)-[rel]->(m) WHERE a:Person AND type(rel) = 'WROTE' AND m:Movie RETURN a.name as Name, m.title as Movie // Retrieve all people in the graph that do not have the property ‘born’ MATCH (a:Person) WHERE NOT exists(a.born) RETURN a.name as Name
  • 97. Filtering queries using WHERE 17/07/2019 Big Data class by Alexandre Bergere 99 // Retrieve all people related to movies where the relationship has the rating property, then return their name, movie title, and the rating. MATCH (a:Person)-[rel]->(m:Movie) WHERE exists(rel.rating) RETURN a.name as Name, m.title as Movie, rel.rating as Rating // Retrieve all REVIEW relationships from the graph where the summary of the review contains the string fun, returning the movie title reviewed and the rating and summary of the relationship. MATCH (:Person)-[r:REVIEWED]->(m:Movie) WHERE toLower(r.summary) CONTAINS 'fun' RETURN m.title as Movie, r.summary as Review, r.rating as Rating // Retrieve all people who have produced a movie, but have not directed a movie MATCH (a:Person)-[:PRODUCED]->(m:Movie) WHERE NOT ((a)-[:DIRECTED]->(:Movie)) RETURN a.name, m.title // Retrieve the movies and their actors where one of the actors also directed the movie MATCH (a1:Person)-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(a2:Person) WHERE exists( (a2)-[:DIRECTED]->(m) ) RETURN a1.name as Actor, a2.name as `Actor/Director`, m.title as Movie
  • 98. Filtering queries using WHERE 17/07/2019 Big Data class by Alexandre Bergere 100 // Retrieve the movies that have an actor’s role that is the name of the movie MATCH (a:Person)-[r:ACTED_IN]->(m:Movie) WHERE m.title in r.roles RETURN m.title as Movie, a.name as Actor
  • 99. Controlling query processing 17/07/2019 Big Data class by Alexandre Bergere 101 MATCH (a:Person)-[:ACTED_IN]->(m:Movie), (m:Movie)<-[:DIRECTED]-(d:Person) WHERE m.released = 2000 RETURN a.name, m.title, d.name Specifying multiple MATCH patterns This MATCH clause includes a pattern specified by two paths separated by a comma: MATCH (a:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person) WHERE m.released = 2000 RETURN a.name, m.title, d.name If possible, you should write the same query as follows:
  • 100. Controlling query processing 17/07/2019 Big Data class by Alexandre Bergere 102 // retrieve the actors who acted in the same movies as Keanu Reeves, but not when Hugo Weaving acted in the same movie MATCH (keanu:Person)-[:ACTED_IN]->(movie:Movie)<-[:ACTED_IN]-(n:Person), (hugo:Person) WHERE keanu.name='Keanu Reeves' AND hugo.name='Hugo Weaving' AND NOT (hugo)-[:ACTED_IN]->(movie) RETURN n.name Specifying multiple MATCH patterns // Suppose we want to retrieve the movies that Meg Ryan acted in and their respective directors, as well as the other actors that acted in these movies. MATCH (meg:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person), (other:Person)-[:ACTED_IN]->(m) WHERE meg.name = 'Meg Ryan' RETURN m.title as movie, d.name AS director , other.name AS `co-actors`
  • 101. Controlling query processing 17/07/2019 Big Data class by Alexandre Bergere 103 MATCH megPath = (meg:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person), (other:Person)-[:ACTED_IN]->(m) WHERE meg.name = 'Meg Ryan' RETURN megPath Setting path variables
  • 102. Controlling query processing 17/07/2019 Big Data class by Alexandre Bergere 104 Specifying varying length paths // all of the followers of the followers of a Person MATCH (follower:Person)-[:FOLLOWS*2]->(p:Person) WHERE follower.name = 'Paul Blythe' RETURN p // Retrieve all paths of any length with the relationship, :RELTYPE from nodeA to nodeB and beyond: (nodeA)-[:RELTYPE*]->(nodeB) // Retrieve all paths of any length with the relationship, :RELTYPE from nodeA to nodeB or from nodeB to nodeA and beyond: (nodeA)-[:RELTYPE*]-(nodeB) // Retrieve the paths of length 3 with the relationship, (node1)-[:RELTYPE*3]->(node2) // Retrieve the paths of lengths 1, 2, or 3 with the relationship (node1)-[:RELTYPE*1..3]->(node2)
  • 103. Controlling query processing 17/07/2019 Big Data class by Alexandre Bergere 105 Finding the shortest path MATCH p = shortestPath((m1:Movie)-[*]-(m2:Movie)) WHERE m1.title = 'A Few Good Men' AND m2.title = 'The Matrix' RETURN p A built-in function that you may find useful in a graph that has many ways of traversing the graph to get to the same node is the shortestPath() function. Using the shortest path between two nodes improves the performance of the query. When you use the shortestPath() function, the query editor will show a warning that this type of query could potentially run for a long time. You should heed the warning, especially for large graphs. Read the Graph Algorithms documentation about the shortest path algorithm. When you use shortestPath(), you can specify a upper limits for the shortest path. In addition, you should aim to provide the patterns for the from an to nodes that execute efficiently. For example, use labels and indexes.
  • 104. Controlling query processing 17/07/2019 Big Data class by Alexandre Bergere 106 Specifying optional pattern matching MATCH (p:Person) WHERE p.name STARTS WITH 'James' OPTIONAL MATCH (p)-[r:REVIEWED]->(m:Movie) RETURN p.name, type(r), m.title OPTIONAL MATCH matches patterns with your graph, just like MATCH does. The difference is that if no matches are found, OPTIONAL MATCH will use NULLs for missing parts of the pattern. OPTIONAL MATCH could be considered the Cypher equivalent of the outer join in SQL.
  • 105. Controlling query processing 17/07/2019 Big Data class by Alexandre Bergere 107 Collecting results // the list of movies that Tom Cruise acted in MATCH (p:Person)-[:ACTED_IN]->(m:Movie) WHERE p.name ='Tom Cruise' RETURN collect(m.title) AS `movies for Tom Cruise` Cypher has a built-in function, collect() that enables you to aggregate a value into a list.
  • 106. Controlling query processing 17/07/2019 Big Data class by Alexandre Bergere 108 Aggregation in Cypher // implicitly groups by a.name and d.name MATCH (a)-[:ACTED_IN]->(m)<-[:DIRECTED]-(d) RETURN a.name, d.name, count(*) // count the paths retrieved where an actor and director collaborated in a movie MATCH (actor:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(director:Person) RETURN actor.name, director.name, count(m) AS collaborations, collect(m.title) AS movies Aggregation in Cypher is different from aggregation in SQL. In Cypher, you need not specify a grouping key. As soon as an aggregation function is used, all non-aggregated result columns become grouping keys. The grouping is implicitly done, based upon the fields in the RETURN clause. There are more aggregating functions such as min() or max() that you can also use in your queries. These are described in the Aggregating Functions section of the Neo4j Cypher Manual.
  • 107. Controlling query processing 17/07/2019 Big Data class by Alexandre Bergere 109 Additional processing using WITH // only return actors that have 2 or 3 movies MATCH (a:Person)-[:ACTED_IN]->(m:Movie) WITH a, count(a) AS numMovies, collect(m.title) as movies WHERE numMovies > 1 AND numMovies < 4 RETURN a.name, numMovies, movies During the execution of a MATCH clause, you can specify that you want some intermediate calculations or values that will be used for further processing of the query, or for limiting the number of results before further processing is done. You use the WITH clause to perform intermediate processing or data flow operations. You have to name all expressions with an alias in a WITH that are not simple variables. // find all actors who have acted in at least five movies, and find (optionally) the movies they directed and return the person and those movies MATCH (p:Person) WITH p, size((p)-[:ACTED_IN]->(:Movie)) AS movies WHERE movies >= 5 OPTIONAL MATCH (p)-[:DIRECTED]->(m:Movie) RETURN p.name, m.title
  • 108. Controlling query processing 17/07/2019 Big Data class by Alexandre Bergere 110 Additional processing using WITH // retrieves all actors that acted in movies, and collects the list of movies for any actor that acted in more than five movies. MATCH (p:Person)-[:ACTED_IN]->(m:Movie) WITH p, collect(m) AS movies WHERE size(movies) > 5 RETURN p.name, movies
  • 109. Controlling query processing 17/07/2019 Big Data class by Alexandre Bergere 111 // Write a Cypher query that retrieves all movies that Gene Hackman has acted it, along with the directors of the movies. In addition, retrieve the actors that acted in the same movies as Gene Hackman. Return the name of the movie, the name of the director, and the names of actors that worked with Gene Hackman. MATCH (a:Person)-[:ACTED_IN]->(m:Movie)<-[:DIRECTED]-(d:Person), (a2:Person)-[:ACTED_IN]->(m) WHERE a.name = 'Gene Hackman' RETURN m.title as movie, d.name AS director , a2.name AS `co-actors` // Retrieve particular nodes that have a relationship and when James Thompson is acting on it MATCH (p1:Person)-[:FOLLOWS]-(p2:Person) WHERE p1.name = 'James Thompson' RETURN p1, p2 // Modify the query to retrieve nodes that are one and two hops away MATCH (p1:Person)-[:FOLLOWS*1..2]-(p2:Person) WHERE p1.name = 'James Thompson' RETURN p1, p2 // Modify the query to retrieve particular nodes that are connected no matter how many hops are required MATCH (p1:Person)-[:FOLLOWS*]-(p2:Person) WHERE p1.name = 'James Thompson' RETURN p1, p2
  • 110. Controlling query processing 17/07/2019 Big Data class by Alexandre Bergere 112 // Retrieve all movie by collecting a list of all people who acted in it MATCH (p:Person)-[:ACTED_IN]->(m:Movie) RETURN p.name as actor, collect(m.title) AS `movie list` // Retrieve all movies that Tom Cruise has acted in and the co-actors that acted in the same movie by collecting a list MATCH (p:Person)-[:ACTED_IN]->(m:Movie)<-[:ACTED_IN]-(p2:Person) WHERE p.name ='Tom Cruise' RETURN m.title as movie, collect(p2.name) AS `co-actors` // Retrieve all people who reviewed a movie, returning the list of reviewers and how many reviewers reviewed the movie MATCH (p:Person)-[:REVIEWED]->(m:Movie) RETURN m.title as movie, count(p) as numReviews, collect(p.name) as reviewers // Retrieve all directors, their movies, and people who acted in the movies, returning the name of the director, the number of actors the director has worked with, and the list of actors. MATCH (d:Person)-[:DIRECTED]->(m:Movie)<-[:ACTED_IN]-(a:Person) RETURN d.name AS director, count(a) AS `number actors` , collect(a.name) AS `actors worked with`
  • 111. Controlling query processing 17/07/2019 Big Data class by Alexandre Bergere 113 // Retrieve the movies that have at least 2 directors, and optionally the names of people who reviewed the movies. MATCH (m:Movie) WITH m, size((:Person)-[:DIRECTED]->(m)) AS directors WHERE directors >= 2 OPTIONAL MATCH (p:Person)-[:REVIEWED]->(m) RETURN m.title, p.name
  • 112. Controlling how results are returned 17/07/2019 Big Data class by Alexandre Bergere 114 Eliminating duplication MATCH (p:Person)-[:DIRECTED | :ACTED_IN]->(m:Movie) WHERE p.name = 'Tom Hanks' RETURN m.released, collect(DISTINCT m.title) AS movies You have seen a number of query results where there is duplication in the results returned. In most cases, you want to eliminate duplicated results. You do so by using the DISTINCT keyword. Using WITH and DISTINCT to eliminate duplication MATCH (p:Person)-[:DIRECTED | :ACTED_IN]->(m:Movie) WHERE p.name = 'Tom Hanks' WITH DISTINCT m RETURN m.released, m.title Another way that you can avoid duplication is to with WITH and DISTINCT together as follows:
  • 113. Controlling how results are returned 17/07/2019 Big Data class by Alexandre Bergere 115 Ordering results MATCH (p:Person)-[:DIRECTED | :ACTED_IN]->(m:Movie) WHERE p.name = 'Tom Hanks' RETURN m.released, collect(DISTINCT m.title) AS movies ORDER BY m.released DESC If you want the results to be sorted, you specify the expression to use for the sort using the ORDER BY keyword and whether you want the order to be descending using the DESC keyword. Ascending order is the default.
  • 114. Controlling how results are returned 17/07/2019 Big Data class by Alexandre Bergere 116 Limiting the number of results MATCH (m:Movie) RETURN m.title as title, m.released as year ORDER BY m.released DESC LIMIT 10 Although you can filter queries to reduce the number of results returned, you may also want to limit the number of results.
  • 115. Controlling results returned 17/07/2019 Big Data class by Alexandre Bergere 117 // write a query to retrieve all actors that acted in movies during the 1990s, where you return the released date, the movie title, and the collected actor names for the movie. For now do not worry about duplication. MATCH (a:Person)-[:ACTED_IN]->(m:Movie) WHERE m.released >= 1990 AND m.released < 2000 RETURN DISTINCT m.released, m.title, collect(a.name) // modify the query so that the released date records returned are not duplicated. To implement this, you must add the collection of the movie titles to the results returned. MATCH (a:Person)-[:ACTED_IN]->(m:Movie) WHERE m.released >= 1990 AND m.released < 2000 RETURN m.released, collect(m.title), collect(a.name) // The results returned from the previous query returns the collection of movie titles with duplicates. That is because there are multiple actors per released year. Next, modify the query so that there is no duplication of the movies listed for a year. MATCH (a:Person)-[:ACTED_IN]->(m:Movie) WHERE m.released >= 1990 AND m.released < 2000 RETURN m.released, collect(DISTINCT m.title), collect(a.name)
  • 116. Controlling results returned 17/07/2019 Big Data class by Alexandre Bergere 118 // Retrieve the top 5 ratings and their associated movies, returning the movie title and the rating. MATCH (:Person)-[r:REVIEWED]->(m:Movie) RETURN m.title AS movie, r.rating AS rating ORDER BY r.rating DESC LIMIT 5
  • 117. Working with Cypher data 17/07/2019 Big Data class by Alexandre Bergere 119 Unwinding lists // create a list with three elements, unwind the list and then return the values WITH [1, 2, 3] AS list UNWIND list AS row RETURN list, row There may be some situations where you want to perform the opposite of collecting results, but rather separate the lists into separate rows. This functionality is done using the UNWIND clause. The UNWIND clause is frequently used when importing data into a graph.
  • 118. Working with Cypher data 17/07/2019 Big Data class by Alexandre Bergere 120 Dates MATCH (actor:Person)-[:ACTED_IN]->(:Movie) WHERE exists(actor.born) // calculate the age with DISTINCT actor, date().year - actor.born as age RETURN actor.name, age as `age today` ORDER BY actor.born DESC Cypher has a built-in date() function, as well as other temporal values and functions that you can use to calculate temporal values. You use a combination of numeric, temporal, spatial, list and string functions to calculate values that are useful to your application. For example, suppose you wanted to calculate the age of a Person node, given a year they were born (the born property must exist and have a value).
  • 119. Working with Cypher data 17/07/2019 Big Data class by Alexandre Bergere 121 // Modify the query you just wrote so that before the query processing ends, you unwind the list of movies and then return the name of the actor and the title of the associated movie MATCH (p:Person)-[:ACTED_IN]->(m:Movie) WITH p, collect(m) AS movies WHERE size(movies) > 5 WITH p, movies UNWIND movies AS movie RETURN p.name, movie.title // retrieves all movies that Tom Hanks acted in, returning the title of the movie, the year the movie was released, the number of years ago that the movie was released, and the age of Tom when the movie was released MATCH (a:Person)-[:ACTED_IN]->(m:Movie) WHERE a.name = 'Tom Hanks' RETURN m.title, m.released, date().year - m.released as yearsAgoReleased, m.released - a.born AS `age of Tom` ORDER BY yearsAgoReleased
  • 120. 17/07/2019 Big Data class by Alexandre Bergere 122 Go further
  • 121. Neo4j Bookshelf 17/07/2019 Big Data class by Alexandre Bergere 123
  • 122. Ressources 17/07/2019 Big Data class by Alexandre Bergere 124 ressources: blog:
  • 123. Training & Certification 17/07/2019 Big Data class by Alexandre Bergere 125
  • 124. Labs 17/07/2019 Big Data class by Alexandre Bergere 126
  • 125. GraphGists 17/07/2019 Big Data class by Alexandre Bergere 127
  • 127. Azure Cosmos DB 17/07/2019 Big Data class by Alexandre Bergere 129 A globally distributed, massively scalable, multi-model database service Azure Cosmos DB
  • 128. Global Distribution 17/07/2019 Big Data class by Alexandre Bergere 130 Policy-based geo-fencing Dynamically add and remove regions Failover prioritiesDynamically configurable read and write regions Geo-local reads and writes 99.99% SLA for read availability Database designed for modern web and mobile applications, which are (typically) global applications in nature.
  • 129. Multi-Master 17/07/2019 Big Data class by Alexandre Bergere 131 Improved write latency for end users Improved write scalability and write throughput Better support for disconnected environments (for example, edge devices) Load balancing
  • 130. Consistency 17/07/2019 Big Data class by Alexandre Bergere 133 Consistency Level Guarantees Strong Linearizability (once operation is complete, it will be visible to all) Bounded Staleness Consistent Prefix. Reads lag behind writes by at most k prefixes or t interval Similar properties to strong consistency (except within staleness window), while preserving 99.99% availability and low latency. Session Consistent Prefix. Within a session: monotonic reads, monotonic writes, read-your-writes, write-follows-reads Predictable consistency for a session, high read throughput + low latency Consistent Prefix Reads will never see out of order writes (no gaps). Eventual Potential for out of order reads. Lowest cost for reads of all consistency levels.
  • 131. COMPREHENSIVE SLAs 17/07/2019 Big Data class by Alexandre Bergere 134 RUN YOUR APP ON WORLD-CLASS INFRASTRUCTURE Azure Cosmos DB is the only service with financially-backed SLAs for millisecond latency at the 99th percentile, 99.999% HA and guaranteed throughput and consistency HALatency <10 ms 99th percentile 99.999% Throughput Consistency Guaranteed Guaranteed
  • 132. Trust your data to industry-leading Security & Compliance 17/07/2019 Big Data class by Alexandre Bergere 135 Azure is the world’s most trusted cloud, with more certifications than any other cloud provider. • Enterprise grade security • Encryption at Rest • Encryption is enabled automatically by default • Comprehensive Azure compliance certification
  • 133. Throughput 17/07/2019 Big Data class by Alexandre Bergere 136 Request unit calculator Request unit considerations Item size Item property count Data consistency Indexex properties Document indexing Script usage The currency of Azure Cosmos DB is the request unit (RU). With request units, you don't need to reserve read/write capacities or provision CPU, memory, and IOPS.
  • 134. Serverless database 17/07/2019 Big Data class by Alexandre Bergere 137 Serverless computing is all about the ability to focus on individual pieces of logic that are repeatable and stateless. o no infrastructure management. o consume resources only for the seconds, or milliseconds, they run for. Azure Cosmos DB trigger to invoke an Azure Function Use an input binding to get data from Azure Cosmos DB Use an ouput binding to write data to Azure Cosmos DB
  • 135. Serverless database 17/07/2019 Big Data class by Alexandre Bergere 139 Serverless computing is all about the ability to focus on individual pieces of logic that are repeatable and stateless. o no infrastructure management. o consume resources only for the seconds, or milliseconds, they run for.
  • 136. Cosmos DB Change Feed 17/07/2019 Big Data class by Alexandre Bergere 140
  • 137. 17/07/2019 Big Data class by Alexandre Bergere 141 Uses cases
  • 138. Top 10 reasons why customers use Azure Cosmos DB 17/07/2019 Big Data class by Alexandre Bergere 142 different types of data multi-tenancy and enterprise-grade security global distribution turnkey capability mission critical massive storage/throughput scalability to optimize for speed and cost 5 well-defined consistency models analytics- ready event-driven architectures single digit millisecond latency at 99th percentile worldwide big data high availability and reliability
  • 139. Powering global solutions 17/07/2019 Big Data class by Alexandre Bergere 143 Azure Cosmos DB was built to support modern app patterns and use cases. It enables industry-leading organizations to unlock the value of data, and respond to global customers and changing business dynamics in real-time. Data distributed and available globally Puts data where your users are Build real-time customer experiences Enable latency-sensitive personalization, bidding, and fraud detection. Ideal for gaming, IoT & eCommerce Predictable and fast service, even during traffic spikes Simplified development with serverless architecture Fully-managed event- driven micro-services with elastic computing power Run Spark analytics over operational data Accelerate insights from fast, global data Lift and shift NoSQL data Lift and shift MongoDB and Cassandra workloads
  • 140. Data distributed and available globally 17/07/2019 Big Data class by Alexandre Bergere 144 Put your data where your users are to give real-time access and uninterrupted service to customers anywhere in the world. o Turnkey global data replication across all Azure regions o Guaranteed low-latency experience for global users o Resiliency for high availability and disaster recovery
  • 141. Build Real-Time Customer experiences 17/07/2019 Big Data class by Alexandre Bergere 145 Offer latency-sensitive applications with personalization, bidding, and fraud-detection. o Machine learning models generate real-time recommendations across product catalogues o Product analysis in milliseconds o Low-latency ensures high app performance worldwide o Tunable consistency models for rapid insight Online Recommendations Service HOT path Offline Recommendations Engine COLD path
  • 142. Ideal for gaming, IoT and ecommerce 17/07/2019 Big Data class by Alexandre Bergere 146 Maintain service quality during high-traffic periods requiring massive scale and performance. o Instant, elastic scaling handles traffic bursts o Uninterrupted global user experience o Low-latency data access and processing for large and changing user bases o High availability across multiple data centers
  • 143. Massive Scale Telemetry Stores for IOT 17/07/2019 Big Data class by Alexandre Bergere 147 Diverse and unpredictable IoT sensor workloads require a responsive data platform o Seamless handling of any data output or volume o Data made available immediately, and indexed automatically o High writes per second, with stable ingestion and query performance
  • 144. simplified development with serverless architecture 17/07/2019 Big Data class by Alexandre Bergere 148 Experience decreased time-to-market, enhanced scalability, and freedom from framework management with event-driven micro-services. o Seamless handling of any data output or volume o Data made available immediately, and indexed automatically o High writes per second, with stable ingestion and query performance o Real-time, resilient change feeds logged forever and always accessible o Native integration with Azure Functions
  • 145. Run spark over operational data 17/07/2019 Big Data class by Alexandre Bergere 149 Accelerate analysis of fast-changing, high-volume, global data. o Real-time big data processing across any data model o Machine learning at scale over globally-distributed data o Speeds analytical queries with automatic indexing and push-down predicate filtering o Native integration with Spark Connector
  • 146. Lift and shift nosql apps 17/07/2019 Big Data class by Alexandre Bergere 150 Make data modernization easy with seamless lift and shift migration of NoSQL workloads to the cloud. o Azure Cosmos DB APIs for MongoDB and Cassandra bring app data from anywhere to Azure Cosmos DB o Leverage existing tools, drivers, and libraries, and continue using existing apps’ current SDKs o Turnkey geo-replication o No infrastructure or VM management required .NET
  • 147. Retail and marketing 17/07/2019 Big Data class by Alexandre Bergere 151
  • 148. 17/07/2019 Big Data class by Alexandre Bergere 152 Model
  • 149. Document Data Model 17/07/2019 Big Data class by Alexandre Bergere 153 “Because at the end of the day, it’s all just keys and values – not just the key-value data model, but all these data models.” “When it comes to actually building applications – well, that’s the developer’s job, and this is where the decision of which data model to choose comes into play.” Document SQL API (JSON) MongoDB API Graph Gremlin API (graph transversal language) Key-Value Table API (replaces Azure Table Storage) Columnar Cassandra API
  • 150. Atom Record Sequence (ARS) 17/07/2019 Big Data class by Alexandre Bergere 154 Your data is always stored as ARS – or Atom Record Sequence – a Microsoft creation that defines the persistence layer for key-value pairs. Switching Between Data Models choosing an API = choosing a data model
  • 151. Switching Between Data Models 17/07/2019 Big Data class by Alexandre Bergere 155 Each data model is merely a projection of the same underlying ARS format, and so eventually you will be able to create a single account, and then switch freely between different APIs within the account. So that then, you’ll be able to access one database as graph, key-value, document, or columnar, all at once. Future release ?
  • 152. Resource Model 17/07/2019 Big Data class by Alexandre Bergere 156
  • 153. Resource Model 17/07/2019 Big Data class by Alexandre Bergere 157 Account Database Container Item User Permission
  • 154. Resource Model 17/07/2019 Big Data class by Alexandre Bergere 158 Account Database Container Item User Permission
  • 155. Resource Model 17/07/2019 Big Data class by Alexandre Bergere 159 Account Database Container Item = Collection Graph Table
  • 156. Handle any data with no schema or indexing required 17/07/2019 Big Data class by Alexandre Bergere 160 Azure Cosmos DB’s schema-less service automatically indexes all your data, regardless of the data model, to delivery blazing fast queries. Item Color Microwave safe Liquid capacity CPU Memory Storage Geek mug Graphite Yes 16ox ??? ??? ??? Coffee Bean mug Tan No 12oz ??? ??? ??? Surface book Gray ??? ??? 3.4 GHz Intel Skylake Core i7- 6600U 16GB 1 TB SSD o Automatic index management o Synchronous auto-indexing o No schemas or secondary indices needed o Works across every data model GEEK
  • 157. Index 17/07/2019 Big Data class by Alexandre Bergere 161 Schema-agnostic, automatic indexing o Automatically index every property of every record without having to define schemas and indices upfront. o No need for schema and index management o Works across every data model o Latch free data structure for highly write-optimized database engine o Multiple index types: Hash, range, and geospatial
  • 158. Index POLICIES 17/07/2019 Big Data class by Alexandre Bergere 162 CUSTOM INDEXING POLICIES Though all Azure Cosmos DB data is indexed by default, you can specify a custom indexing policy for your collections. Custom indexing policies allow you to design and customize the shape of your index while maintaining schema flexibility. o Define trade-offs between storage, write and query performance, and query consistency o Include or exclude documents and paths to and from the index o Configure various index types { "automatic": true, "indexingMode": "Consistent", "includedPaths": [{ "path": "/*", "indexes": [{ "kind": "Hash", "dataType": "String", "precision": -1 }, { "kind": "Range", "dataType": "Number", "precision": -1 }, { "kind": "Spatial", "dataType": "Point" }] }], "excludedPaths": [{ "path": "/nonIndexedContent/*" }] }
  • 159. Ressource Model in Cosmos DB 17/07/2019 Big Data class by Alexandre Bergere 163
  • 160. 17/07/2019 Big Data class by Alexandre Bergere 164 SQL QUERY SYNTAX
  • 161. SQL SYNTAX 17/07/2019 Big Data class by Alexandre Bergere 165 Using the popular query language, SQL, to access semi- structured JSON data. This module will reference querying in the context of the SQL API for Azure Cosmos DB.
  • 162. SQL QUERY SYNTAX 17/07/2019 Big Data class by Alexandre Bergere 166 BASIC QUERY SYNTAX The SELECT & FROM keywords are the basic components of every query. > SELECT tickets.id, tickets.pricePaid FROM tickets > SELECT t.id, t.pricePaid FROM tickets t
  • 163. SQL QUERY SYNTAX - WHERE 17/07/2019 Big Data class by Alexandre Bergere 167 FILTERING WHERE supports complex scalar expressions including arithmetic, comparison and logical operators > SELECT tickets.id, tickets.pricePaid FROM tickets WHERE tickets.pricePaid > 500.00 AND tickets.pricePaid <= 1000.00
  • 164. SQL QUERY SYNTAX - PROJECTION 17/07/2019 Big Data class by Alexandre Bergere 168 FILTERING If your workloads require a specific JSON schema, Azure Cosmos DB supports JSON projection within its queries > SELECT { "id": tickets.id, "flightNumber": tickets.assignedFlight.flightNumber, "purchase": { "cost": tickets.pricePaid }, "stops": [ tickets.assignedFlight.origin, tickets.assignedFlight.destination ] } AS ticket FROM tickets
  • 165. SQL QUERY SYNTAX - PROJECTION 17/07/2019 Big Data class by Alexandre Bergere 169 FILTERING If your workloads require a specific JSON schema, Azure Cosmos DB supports JSON projection within its queries > SELECT VALUE { "id": tickets.id, "flightNumber": tickets.assignedFlight.flightNumber, "purchase": { "cost": tickets.pricePaid }, "stops": [ tickets.assignedFlight.origin, tickets.assignedFlight.destination ] } FROM tickets
  • 166. INTRA-DOCUMENT JOIN 17/07/2019 Big Data class by Alexandre Bergere 170 Azure Cosmos DB supports intra-document JOIN’s for de-normalized arrays Let’s assume that we have two JSON documents in a collection: { "pricePaid": 575.5, "assignedFlight": { "number": "F125", "origin": "SEA", "destination": "JFK" }, "seat": “12A", "requests": [ "kosher_meal", "aisle_seat" ], "id": "6ebe1165836a" } { "pricePaid": 234.75, "assignedFlight": { "number": "F752", "origin": "SEA", "destination": "LGA" }, "seat": "14C", "requests": [ "early_boarding", "window_seat" ], "id": "c4991b4d2efc" }
  • 167. INTRA-DOCUMENT JOIN 17/07/2019 Big Data class by Alexandre Bergere 171 We can filter on a particular array index position without JOIN: > SELECT tickets.assignedFlight.number, tickets.seat, ticket.requests FROM tickets WHERE ticket.requests[1] == "aisle_seat" [ { "number":"F125","seat":"12A", "requests": [ "kosher_meal", "aisle_seat" ] } ]
  • 168. INTRA-DOCUMENT JOIN 17/07/2019 Big Data class by Alexandre Bergere 172 JOIN allows us to merge embedded documents or arrays across multiple documents and returned a flattened result set: > SELECT tickets.assignedFlight.number, tickets.seat, requests FROM tickets JOIN requests IN tickets.requests [ { "number":"F125","seat":"12A", "requests":"kosher_meal" }, { "number":"F125","seat":"12A", "requests":"aisle_seat" }, { "number":"F752","seat":"14C", "requests":"early_boarding" }, { "number":"F752","seat":"14C", "requests":"window_seat" } ]
  • 169. INTRA-DOCUMENT JOIN 17/07/2019 Big Data class by Alexandre Bergere 173 Along with JOIN, we can also filter the cross products without knowing the array index position: > SELECT tickets.id, requests FROM tickets JOIN requests IN tickets.requests WHERE requests IN ("aisle_seat", "window_seat") [ { "number":"F125","seat":"12A“, "requests": "aisle_seat" }, { "number":"F752","seat":"14C", "requests": "window_seat" } ]
  • 170. 17/07/2019 Big Data class by Alexandre Bergere 174 Tools
  • 171. Cosmos DB Emulator 17/07/2019 Big Data class by Alexandre Bergere 175 The Azure Cosmos DB Emulator provides a local environment that emulates the Azure Cosmos DB service for development purposes. Using the Azure Cosmos DB Emulator, you can develop and test your application locally, without creating an Azure subscription or incurring any costs. When you're satisfied with how your application is working in the Azure Cosmos DB Emulator, you can switch to using an Azure Cosmos DB account in the cloud. At this time the Data Explorer in the emulator only fully supports SQL API collections and MongoDB collections. Table, Graph, and Cassandra containers are not fully supported. The Azure Cosmos DB Emulator provides a high-fidelity emulation of the Azure Cosmos DB service. It supports identical functionality as Azure Cosmos DB, including support for creating and querying JSON documents, provisioning and scaling collections, and executing stored procedures and triggers. You can develop and test applications using the Azure Cosmos DB Emulator, and deploy them to Azure at global scale by just making a single configuration change to the connection endpoint for Azure Cosmos DB. The Azure Cosmos DB Emulator by default runs on the local machine ("localhost") listening on port 8081.
  • 172. Azure Cosmos DB : Data migration tools 17/07/2019 Big Data class by Alexandre Bergere 176 Data Migration Tools SQL API Mongo DB API Graph APITable API Cassandra API
  • 173. Cosmos DB Explorer 17/07/2019 Big Data class by Alexandre Bergere 177 With Cosmos DB Explorer you can: o Take advantage of the full screen real estate for your queries and results. o Access your database account and collections with a connection string, without needing access to the Azure subscription or portal. o Share query results with authorized peers who do not have Azure portal access. o Work with Cosmos DB data without having to download any desktop tools locally. https://cosmos.azure.com/
  • 174. Azure Cosmos DB – Interface demo 17/07/2019 Big Data class by Alexandre Bergere 178
  • 175. Azure Cosmos DB – SQL Query Exercice 17/07/2019 Big Data class by Alexandre Bergere 179 Add data using Data Explorer https://docs.microsoft.com/en-ie/learn/modules/access-data-with-cosmos-db-and- sql-api/3-add-data Explore SQL query types https://docs.microsoft.com/en-ie/learn/modules/access-data- with-cosmos-db-and-sql-api/4-query-types
  • 176. 17/07/2019 Big Data class by Alexandre Bergere 180 Add cosmos DB to you architecture
  • 177. Partitioning 17/07/2019 Big Data class by Alexandre Bergere 181
  • 178. 17/07/2019 Big Data class by Alexandre Bergere 182 Stored procedure & UDFs
  • 179. Stored Procedures 17/07/2019 Big Data class by Alexandre Bergere 183 BENEFITS o Familiar programming language o Atomic Transactions o Built-in Optimizations o Business Logic Encapsulation Stored procedures perform complex transactions on documents and properties. Stored procedures are written in JavaScript and are stored in a container on Azure Cosmos DB. By performing the stored procedures on the database engine and close to the data, you can improve performance over client-side programming. Stored procedures are the only way to achieve atomic transactions within Azure Cosmos DB; the client-side SDKs do not support transactions. Performing batch operations in stored procedures is also recommended because of the reduced need to create separate transactions.
  • 180. Simple Stored Procedure 17/07/2019 Big Data class by Alexandre Bergere 184 function createSampleDocument(documentToCreate) { var context = getContext(); var collection = context.getCollection(); var accepted = collection.createDocument( collection.getSelfLink(), documentToCreate, function (error, documentCreated) { context.getResponse().setBody(documentCreated.id) } ); if (!accepted) return; }
  • 181. Multi-DOCUMENT Transactions 17/07/2019 Big Data class by Alexandre Bergere 185 DATABASE TRANSACTIONS In a typical database, a transaction can be defined as a sequence of operations performed as a single logical unit of work. Each transaction provides ACID guarantees. In Azure Cosmos DB, JavaScript is hosted in the same memory space as the database. Hence, requests made within stored procedures and triggers execute in the same scope of a database session. Create New Document Query Collection Update Existing Document Delete Existing Document Stored procedures utilize snapshot isolation to guarantee all reads within the transaction will see a consistent snapshot of the data
  • 182. Bounded Execution 17/07/2019 Big Data class by Alexandre Bergere 186 EXECUTION WITHIN TIME BOUNDARIES All Azure Cosmos DB operations must complete within the server-specified request timeout duration. If an operation does not complete within that time limit, the transaction is rolled back. HELPER BOOLEAN VALUE All functions under the collection object (for create, read, replace, and delete of documents and attachments) return a Boolean value that represents whether that operation will complete: o If true, the operation is expected to complete o If false, the time limit will soon be reached and your function should end execution as soon as possible.
  • 183. Transaction Continuation Model 17/07/2019 Big Data class by Alexandre Bergere 187 CONTINUING LONG-RUNNING TRANSACTIONS o JavaScript functions can implement a continuation-based model to batch/resume execution o The continuation value can be any value of your own choosing. This value can then be used by your applications to resume a transaction from a new “starting point” Bulk Create Documents Return a “pointer” to resume later Observe Return Value Try Create Each Document Done
  • 184. Control Flow 17/07/2019 Big Data class by Alexandre Bergere 188 JAVASCRIPT CONTROL FLOW Stored procedures allow you to naturally express control flow, variable scoping, assignment, and integration of exception handling primitives with database transactions directly in terms of the JavaScript programming language. ES6 PROMISES ES6 promises can be used to implement promises for Azure Cosmos DB stored procedures. Unfortunately, promises “swallow” exceptions by default. It is recommended to use callbacks instead of ES6 promises.
  • 185. Stored Procedure Control Flow 17/07/2019 Big Data class by Alexandre Bergere 189 function createTwoDocuments(docA, docB) { var ctxt = getContext(); var coll = context.getCollection(); var collLink = coll.getSelfLink(); var aAccepted = coll.createDocument(collLink, docA, docACallback); function docACallback(error, created) { var bAccepted = coll.createDocument(collLink, docB, docBCallback); if (!bAccepted) return; }; function docBCallback(error, created) { context.getResponse().setBody({ "firstDocId": created.id, "secondDocId": created.id }); }; }
  • 186. Rolling Back Transactions 17/07/2019 Big Data class by Alexandre Bergere 190 TRANSACTION ROLL-BACK Inside a JavaScript function, all operations are automatically wrapped under a single transaction: o If the function completes without any exception, all data changes are committed o If there is any exception that’s thrown from the script, Azure Cosmos DB’s JavaScript runtime will roll back the whole transaction. Create New Document Query Collection Update Existing Document Delete Existing Document If exception, undo changes Transaction Scope
  • 187. Transaction ROLLBACK in Stored Procedure 17/07/2019 Big Data class by Alexandre Bergere 191 collection.createDocument( collection.getSelfLink(), documentToCreate, function (error, documentCreated) { if (error) throw "Unable to create document, aborting..."; } ); collection.createDocument( documentToReplace._self, replacementDocument, function (error, documentReplaced) { if (error) throw "Unable to update document, aborting..."; } );
  • 188. User-defined Functions 17/07/2019 Big Data class by Alexandre Bergere 192 UDF User-defined functions (UDFs) are used to extend the Azure Cosmos DB SQL API’s query language grammar and implement custom business logic. UDFs can only be called from inside queries They do not have access to the context object and are meant to be used as compute-only code.
  • 189. User-Defined Function Definition 17/07/2019 Big Data class by Alexandre Bergere 193 var taxUdf = { id: "tax", serverScript: function tax(income) { if (income == undefined) throw 'no input’; if (income < 1000) return income * 0.1; else if (income < 10000) return income * 0.2; else return income * 0.4; } }
  • 190. User-Defined Function USAGE in Queries 17/07/2019 Big Data class by Alexandre Bergere 194 > SELECT * FROM TaxPayers t WHERE udf.tax(t.income) > 20000
  • 191. Create multiple Cosmos DB triggers 17/07/2019 Big Data class by Alexandre Bergere 195
  • 192. 17/07/2019 Big Data class by Alexandre Bergere 196 Modelization
  • 193. Modelling Data 17/07/2019 Big Data class by Alexandre Bergere 197 Embeded “The guiding premise when normalizing data is to avoid storing redundant data on each record and rather refer to data.” Embedding data
  • 194. Modelling Data 17/07/2019 Big Data class by Alexandre Bergere 198 Embeded data When to embed: o There are contains relationships between entities. o There are one-to-few relationships between entities. o There is embedded data that changes infrequently. o There is embedded data won't grow without bound. o There is embedded data that is integral to data in a document.
  • 195. Modelling Data 17/07/2019 Big Data class by Alexandre Bergere 199 Referenced data The problem with this example is that the comments array is unbounded, meaning that there is no (practical) limit to the number of comments any single post can have. Referencing data
  • 196. Modelling Data 17/07/2019 Big Data class by Alexandre Bergere 200 Referenced data
  • 197. Modelling Data 17/07/2019 Big Data class by Alexandre Bergere 201 Referenced data When to reference: o Representing one-to-many relationships. o Representing many-to-many relationships. o Related data changes frequently. o Referenced data could be unbounded.
  • 198. Modelling Data 17/07/2019 Big Data class by Alexandre Bergere 202 Where do I put the relationship? We have dropped the unbounded collection on the publisher document. Instead we just have a reference to the publisher on each book document.
  • 199. Modelling Data 17/07/2019 Big Data class by Alexandre Bergere 203 The “Ladder” pattern
  • 200. Modelling Data 17/07/2019 Big Data class by Alexandre Bergere 204 How do I model many:many relationships?
  • 201. Modelling Data 17/07/2019 Big Data class by Alexandre Bergere 205 Hybrid data models Pre-calculated aggregates values to save expensive processing on a read operation. In the example, some of the data embedded in the author document is data that is calculated at run-time. Every time a new book is published, a book document is created and the countOfBooks field is set to a calculated value based on the number of book documents that exist for a particular author. This optimization would be good in read heavy systems where we can afford to do computations on writes in order to optimize reads. We could've just stuck with id and left the application to get any additional information it needed from the respective author document using the "link", but because our application displays the author's name and a thumbnail picture with every book displayed we can save a round trip to the server per book in a list by denormalizing some data from the author. Sure, if the author's name changed or they wanted to update their photo we'd have to go an update every book they ever published but for our application, based on the assumption that authors don't change their names very often, this is an acceptable design decision.
  • 202. Modelling Data 17/07/2019 Big Data class by Alexandre Bergere 206
  • 203. Modelling Data 17/07/2019 Big Data class by Alexandre Bergere 207
  • 204. 17/07/2019 Big Data class by Alexandre Bergere 208 Architectures
  • 205. Azure Cosmos DB - Change Feed Lab 17/07/2019 Big Data class by Alexandre Bergere 209
  • 206. Cosmos DB & Spark 17/07/2019 Big Data class by Alexandre Bergere 210
  • 207. Broadcast Real-time Updates from Cosmos DB with SignalR Service and Azure Functions 17/07/2019 Big Data class by Alexandre Bergere 211
  • 208. Advanced Analytics on big data architecture 17/07/2019 Big Data class by Alexandre Bergere 212
  • 209. STRIIM FOR AZURE COSMOS DB 17/07/2019 Big Data class by Alexandre Bergere 213 Continuous, Real-Time Data Movement
  • 210. Querying An Azure Cosmos DB Database using the SQL API 17/07/2019 Big Data class by Alexandre Bergere 214 https://cosmosdb.github.io/labs/dotnet/technical_deep_dive/03-querying_the_database_using_sql.html Azure Data Factory Azure Cosmos DB Visual Studio Code
  • 211. 17/07/2019 Big Data class by Alexandre Bergere 215 Through examples
  • 212. How Skype modernized its backend infrastructure using Azure Cosmos DB 17/07/2019 Big Data class by Alexandre Bergere 216 Lessons learned Looking back at the project, Kaduk recalls several “lessons learned.” These include: o Use direct mode for better performance – How a client connects to Azure Cosmos DB has important performance implications, especially with respect to observed client side latency. The team began by using the default Gateway Mode connection policy, but switched to a Direct Mode connection policy because it delivers better performance. o Learn how to write and handle stored procedures – With Azure Cosmos DB, transactions can only be implemented using stored procedures—pieces of application logic that are written in JavaScript that are registered and executed against a collection as a single transaction. (In Azure Cosmos DB, JavaScript is hosted in the same memory space as the database. Hence, requests made within stored procedures execute in the same scope of a database session, which enables Azure Cosmos DB to guarantee ACID for all operations that are part of a single stored procedure.) o Pay attention to query design – With Azure Cosmos DB, queries have a large impact in terms of RU consumption. Developers didn’t pay much attention to query design at first, but soon found that RU costs were higher than desired. This led to an increased focus on optimizing query design, such as using point document reads wherever possible and optimizing the query selections per API. o Use the Azure Cosmos DB SDK 2.x to optimize connection usage – Within Azure Cosmos DB, the data stored in each region is distributed across tens of thousands of physical partitions. To serve reads and writes, the Azure Cosmos DB client SDK must establish a connection with the physical node hosting the partition. The team started by using the Azure Cosmos DB SDK 1.x, but found that its lack of support for connection multiplexing led to excessive connection establishment and closing rates. Switching to the Azure Cosmos DB SDK 2.x, which supports connection multiplexing, helped solve the problem —and also helped mitigate SNAT port exhaustion issues.
  • 213. 17/07/2019 Big Data class by Alexandre Bergere 217 Deeper
  • 214. Cosmic notes 17/07/2019 Big Data class by Alexandre Bergere 218
  • 215. Become an Azure Cosmonauts 17/07/2019 Big Data class by Alexandre Bergere 219