An Overview of Data Management Paradigms: Relational, Document, and Graph

An Overview of Data Management Paradigms:
Relational, Document, and Graph

Marko A. Rodriguez
T-5, Center for Nonlinear Studies
Los Alamos National Laboratory
http://markorodriguez.com

February 15, 2010

Relational, Document, and Graph Database Data Models
Relational Database Document Database Graph Database

d

{ data } { data } a
c
a
{ data }
b

MySQL MongoDB Neo4j
PostgreSQL CouchDB AllegroGraph
Oracle HyperGraphDB

Database models are optimized for solving particular types of problems. This is why diﬀerent database
models exist — there are many types of problems in the world.

Data Management Workshop – Albuquerque, New Mexico – February 15, 2010

Finding the Right Solution to your Problem
1. Come to terms with your problem.
• “I have metadata for a massive number of objects and I don’t know
how to get at my data.”

2. Identify the solution to your problem.
• “I need to be able to ﬁnd objects based on their metadata.”

3. Identify the type of database that is optimized for that type of solution.
• “A document database scales, stores metadata, and can be queried.”

4. Identify the database of that type that best meets your particular needs.
• “CouchDB has a REST web interface and all my developers are
good with REST.”


Relational Databases

• Relational databases have been the de facto data management solution
for many years.

MySQL is available at http://www.mysql.com
PostgreSQL is available at http://www.postgresql.org
Oracle is available at http://www.oracle.com


Relational Databases: The Relational Structure

• Relational databases require a schema before data can be inserted.

• Relational databases organizes data according to relations — or tables.

columns (attributes/properties)
j
rows (tuples/objects)

i x

Object i has the value x for property j.


Relational Databases: Creating a Table

• Relational databases organizes data according to relations — or tables.

• Relational databases require a schema before data can be inserted.

• Lets create a table for Grateful Dead songs.

mysql> CREATE TABLE songs (
name VARCHAR(255) PRIMARY KEY,
performances INT,
song_type VARCHAR(20));
Query OK, 0 rows affected (0.40 sec)


Relational Databases: Inserting Rows into a Table

• Lets insert song names, the number of times they were played in
concert, and whether they were and original or cover.

mysql> INSERT INTO songs VALUES ("DARK STAR", 219, "original");
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO songs VALUES (
"FRIEND OF THE DEVIL", 304, "original");

mysql> INSERT INTO songs VALUES (
"MONKEY AND THE ENGINEER", 32, "cover");


Document Databases
• Document databases store structured documents. Usually these
documents are organized according a standard (e.g. JavaScript Object
Notation—JSON, XML, etc.)

• Document databases tend to be schema-less. That is, they do not
require the database engineer to apriori specify the structure of the data
to be held in the database.

MongoDB is available at http://mongodb.org and CouchDB is available at http://couchdb.org


Document Databases: JavaScript Object Notation

• A JSON document is a collection of key/value pairs, where a value can
be yet another collection of key/value pairs.
string: a string value (e.g. “marko”, “rodriguez”).
number: a numeric value (e.g. 1234, 67.012).
boolean: a true/false value (e.g. true, false)
null: a non-existant value.
array: an array of values (e.g. [1,“marko”,true])
object: a key/value map (e.g. { “key” : 123 })

The JSON speciﬁcation is very simple and can be found at http://www.json.org/.


Document Databases: JavaScript Object Notation
{
_id : "D0DC29E9-51AE-4A8C-8769-541501246737",
name : "Marko A. Rodriguez",
homepage : "http://markorodriguez.com",
age : 30,
location : {
country : "United States",
state : "New Mexico",
city : "Santa Fe",
zipcode : 87501
},
interests : ["graphs", "hockey", "motorcycles"]
}


Document Databases: Handling JSON Documents

• Use object-oriented “dot notation” to access components.

> marko = eval({_id : "D0DC29E9...", name : "Marko...})
> marko._id
D0DC29E9-51AE-4A8C-8769-541501246737
> marko.location.city
Santa Fe
> marko.interests[0]
graphs

All document database examples presented are using MongoDB [http://mongodb.org].


Document Databases: Inserting JSON Documents
• Lets insert a Grateful Dead document into the database.

> db.songs.insert({
_id : "91",
properties : {
name : "TERRAPIN STATION",
song_type : "original",
performances : 302
}
})


Document Databases: Finding JSON Documents

• Searching is based on created a “subset” document and pattern matching
it in the database.

• Find all songs where properties.name equals TERRAPIN STATION.

> db.songs.find({"properties.name" : "TERRAPIN STATION"})
{ "_id" : "91", "properties" :
{ "name" : "TERRAPIN STATION", "song_type" : "original",
"performances" : 302 }}
>


Document Databases: Finding JSON Documents

• You can also do comparison-type operations.

• Find all songs where properties.performances is greater than 200.

> db.songs.find({"properties.performances" : { $gt : 200 }})
{ "_id" : "104", "properties" :
{ "name" : "FRIEND OF THE DEVIL", "song_type" : "original",
"performances" : 304}}
{ "_id" : "122", "properties" :
{ "name" : "CASEY JONES", "song_type" :
"original", "performances" : 312}}
has more
>


Document Databases: Processing JSON Documents
• Sharding is the process of distributing a database’s data across multiple
machines. Each partition of the data is known as a shard.

• Document databases shard easily because there are no explicit references
between documents.

client appliation

communication service

{ _id : } { _id : } { _id : } { _id : }
{ _id : } { _id : } { _id : } { _id : }
{ _id : } { _id : } { _id : } { _id : }



• Most document databases come with a Map/Reduce feature to allow for
the parallel processing of all documents in the database.
Map function: apply a function to every document in the database.
Reduce function: apply a function to the grouped results of the map.

M : D → (K, V ),
where D is the space of documents, K is the space of keys, and V is the
space of values.
R : (K, V n) → (K, V ),
where V n is the space of all possible combination of values.


• Create a distribution of the Grateful Dead original song performances.

> map = function(){
if(this.properties.song_type == "original")
emit(this.properties.performances, 1);
};

> reduce = function(key, values) {
var sum = 0;
for(var i in values) {
sum = sum + values[i];
}
return sum;
};


> results = db.songs.mapReduce(map, reduce)
{
"result" : "tmp.mr.mapreduce_1266016122_8",
"timeMillis" : 72,
"counts" : {
"input" : 809,
"emit" : 184,
"output" : 119
},
"ok" : 1,
}


{ _id : 122, { _id : 100, { _id : 91,
properties : { properties : { properties : {
name : "CASEY ..." name : "PLAYIN..." name : "TERRAP..."
performances : 312 performances : 312 performances : 302
}} }} }}

map = function(){
if(this.properties.song_type == "original")
emit(this.properties.performances, 1);
};

key value
312 : 1
312 : 1
302 : 1
...
key values
312 : [1,1]
302 : [1]
...

reduce = function(key, values) {
var sum = 0;
for(var i in values) {
sum = sum + values[i];
}
return sum;
};

{
312 : 2
302 : 1
...
}


> db[results.result].find()
{ "_id" : 0, "value" : 11 }
{ "_id" : 1, "value" : 14 }
{ "_id" : 2, "value" : 5 }
{ "_id" : 3, "value" : 8 }
{ "_id" : 4, "value" : 3 }
{ "_id" : 5, "value" : 4 }
...
{ "_id" : 554, "value" : 1 }
{ "_id" : 582, "value" : 1 }
{ "_id" : 583, "value" : 1 }
{ "_id" : 594, "value" : 1 }
{ "_id" : 1386, "value" : 1 }


Graph Databases
• Graph databases store objects (vertices) and their relationships to one
another (edges). Usually these relationships are typed/labeled and
directed.

• Graph databases tend to be optimized for graph-based traversal
algorithms.

Neo4j is available at http://neo4j.org
AllegroGraph is available at http://www.franz.com/agraph/allegrograph
HyperGraphDB is available at http://www.kobrix.com/hgdb.jsp


Graph Databases: Property Graph Model
name = "lop"
lang = "java"

weight = 0.4 3
name = "marko"
age = 29 created weight = 0.2
9
1
created
8 created
12
7 weight = 1.0
weight = 0.4 6
weight = 0.5
knows
knows 11 name = "peter"
age = 35
name = "josh"
4 age = 32
2
10
name = "vadas"
weight = 1.0
age = 27
created

5

name = "ripple"
lang = "java"

Graph data models vary. This section will use the data model popularized by Neo4j.


Graph Databases: Handling Property Graphs
• Gremlin is a graph-based programming language that can be used to
interact with graph databases.

• However, graph databases also come with their own APIs.

Gremlin G = (V, E)

Gremlin is available at http://gremlin.tinkerpop.com.
All the examples in this section are using Gremlin and Neo4j.


Graph Databases: Moving Around a Graph in Gremlin
gremlin> $_ := g:key(‘name’,‘marko’)
==>v[1]
gremlin> ./outE
==>e[7][1-knows->2]
==>e[9][1-created->3]
==>e[8][1-knows->4]
gremlin> ./outE/inV
==>v[2]
==>v[3]
==>v[4]
gremlin> ./outE/inV/@name
==>vadas
==>lop
==>josh


Graph Databases: Inserting Vertices and Edges

• Lets create a Grateful Dead graph.

gremlin> $_g := neo4j:open(‘/tmp/grateful-dead’)
==>neo4jgraph[/tmp/grateful-dead]
gremlin> $v := g:add-v(g:map(‘name’,‘TERRAPIN STATION’))
==>v[0]
gremlin> $u := g:add-v(g:map(‘name’,‘TRUCKIN’))
==>v[1]
gremlin> $e := g:add-e(g:map(‘weight’,1),$v,‘followed_by’,$u)
==>e[2][0-followed_by->1]

You can batch load graph data as well: g:load(‘data/grateful-dead.xml’) using the GraphML
speciﬁcation [http://graphml.graphdrawing.org/]


Graph Databases: Inserting Vertices and Edges
• When all the data is in, you have a directed, weighted graph of the
concert behavior of the Grateful Dead. A song is followed by another
song if the second song was played next in concert. The weight of the
edge denotes the number of times this happened in concert over the 30
years that the Grateful Dead performed.


Graph Databases: Finding Vertices

• Find the vertex with the name TERRAPIN STATION.

• Find the name of all the songs that followed TERRAPIN STATION in
concert more than 3 times.

gremlin> $_ := g:key(‘name’,‘TERRAPIN STATION’)
==>v[0]
gremlin> ./outE[@weight > 3]/inV/@name
==>DRUMS
==>MORNING DEW
==>DONT NEED LOVE
==>ESTIMATED PROPHET
==>PLAYING IN THE BAND


Graph Databases: Processing Graphs
• Most graph algorithms are aimed at traversing a graph in some manner.

• The traverser makes use of vertex and edge properties in order to
guide its walk through the graph.


• Find all songs related to TERRAPIN STATION according to concert
behavior.

$e := 1.0
$scores := g:map()
repeat 75
$_ := (./outE[@label=‘followed_by’]/inV)[g:rand-nat()]
if $_ != null()
g:op-value(‘+’,$scores,$_/@name,$e)
$e := $e * 0.85
else
$_ := g:key(‘name, ‘TERRAPIN STATION)
$e := 1.0
end
end


gremlin> g:sort($scores,‘value’,true())
==>PLAYING IN THE BAND=1.9949905250390623
==>THE MUSIC NEVER STOPPED=0.85
==>MEXICALI BLUES=0.5220420095726453
==>DARK STAR=0.3645706137191774
==>SAINT OF CIRCUMSTANCE=0.20585176856988666
==>ALTHEA=0.16745479118927242
==>ITS ALL OVER NOW=0.14224175713617204
==>ESTIMATED PROPHET=0.12657286655816163
...


Conclusions
• Relational Databases
Stable, solid technology that has been used in production for decades.
Good for storing inter-linked tables of data and querying within and across tables.
They do not scale horizontally due to the interconnectivity of table keys and the
cost of joins.
• Document Databases
For JSON documents, there exists a one-to-one mapping from document-to-
programming object.
They scale horizontally and allow for parallel processing due to forced sharding at
document.
Performing complicated queries requires relatively sophisticated programming skills.
• Graph Databases
Optimized for graph traversal algorithms and local neighborhood searches.
Low impedance mismatch between a graph in a database and a graph of objects in
object-oriented programming.
They do not scale well horizontally due to interconnectivity of vertices.


A Collection of References
• http://www.wakandasoftware.com/blog/nosql-but-so-much-more/
• http://horicky.blogspot.com/2009/07/choosing-between-sql-and-non-sql.html
• http://ai.mee.nu/seeking a database that doesnt suck
• http://blogs.neotechnology.com/emil/2009/11/
nosql-scaling-to-size-and-scaling-to-complexity.html
• http://horicky.blogspot.com/2009/11/nosql-patterns.html
• http://horicky.blogspot.com/2010/02/nosql-graphdb.html


Fin.
Thank your for your time...

• My homepage: http://markorodriguez.com

• TinkerPop: http://tinkerpop.com

Acknowledgements: Peter Neubauer (Neo Technology) for comments and review.


An Overview of Data Management Paradigms: Relational, Document, and Graph

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Ähnlich wie An Overview of Data Management Paradigms: Relational, Document, and Graph

Ähnlich wie An Overview of Data Management Paradigms: Relational, Document, and Graph (20)

Mehr von Marko Rodriguez

Mehr von Marko Rodriguez (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

An Overview of Data Management Paradigms: Relational, Document, and Graph