Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Nosql databases
1. Web Data Management Course (SCOM7348)
University of Birzeit, Palestine
January, 2014
Synthesis Paper Talk
Synthesis Paper: NoSQL Databases
Nader Abulhalaweh (Student)
Nagham Ghanim Hamad (Student)
Master of Computing, Birzeit University
Master of Computing, Birzeit University
abulhalaweh@gmail.com
naghamghanim@gmail.com
Fayez Shayeb (Student)
Dima Taji (Student)
Master of Computing, Birzeit University
Master of Computing, Birzeit University
Fayez.aauj@hotmail.com
dima.taji@gmail.com
This is a student talk, at Web Data Management course, each student is is asked
to present his/her synthesis paper.
Course Page: http://jarrar-courses.blogspot.com/2013/11/web-data-management.html
1
3. Introduction
•
What are NoSQL Databases
–
–
•
Not Only SQL Databases
Mostly distributed, non-relational storage systems
Where do NoSQL DBs prove effective
–
Effective for simple operations such as
•
•
–
Key lookups
Reads and writes of one record or a small number of records
The records usually are for
•
•
•
•
Electronic mail
Personal profiles
Wikis
Customer records
3
4. Introduction: NoSQL needs
• The need of NoSQL comes to fit the following issues:
–
–
–
–
–
–
–
–
–
–
–
Distributed (Large Data volume)
Scalability
Performance (Query needs to return answers quickly)
High Avalability (replications give tolerance-partioning)
Mostly Query, few updates
Asynchronous inserts, updates
Free-schema (schema-less)
ACID transactions not needed alternative is the BASE
CAP theorem in distributed systems
Open source development
Cheap
4
5. General Characteristics of NoSQL Databases
• The ability to horizontally scale simple operations throughput over
many servers.
• The ability to replicate and to distribute (partition) data over many
servers.
• A simple call level interface or protocol (in contrast to a SQL binding).
• A weaker concurrency model than the ACID transactions of most
relational (SQL) database systems.
• Efficient use of distributed indices and RAM for data storage.
• The ability to dynamically add new attributes to data records.
5
7. Types of NoSQL Databases
1.
2.
3.
4.
Key Value
Column
Document
Graph
7
8. Key- Value DataStores
• Definition: In simple words, it’s a Hash Table of Keys and it’s
values.
• We’ll call these systems key-value stores.
• Provide fast access for small data ! (MapReduce for big data)
• Subdivided into in-memory and disk-based solutions.
• Has the largest number of members
• No deal with highly connected data.
• Provide:
– replication, versioning, locking, transactions, sorting.
• The client interface provides:
– inserts, deletes, and index lookups.
• Examples: Riak, Voldemort
8
9. MapReduce
• It’s a technique for indexing and searching
large data volumes.
• Two phase:
– Map: Select a subset of data (key, value pair).
Potentially in parallel on multiple machines.
– Reduce: Merge the subsets of (key,value) pairs,
and then sort it.
• The results from MapReduce may be useful
for other searches.
9
10. Key- Value DataStores - Riak
• Riak: It is a “key-value store”
which written in Erlang by Basho. Riak
described as a “key-value store” and “document store” by Basho.
• Features:
•
Riak objects can be fetched and stored in JSON format
{
"bucket":"customers",
"key":"12345",
"object":{
"name":"Mr. Smith",
"phone":”415-555-6524” }
"links":[
["sales","Mr. Salesguy","salesrep"],
["cust-orders","12345","orders"] ]
"vclock":"opaque-riak-vclock",
"lastmod":"Mon, 03 Aug 2009 18:49:42 GMT"
} “ [2]
10
11. Key- Value DataStores - Riak
• Supports replication of objects and sharding by hashing on the
primary key.
• Support MVCC.
• Includes a map/reduce mechanism to split work over all the
nodes in a cluster.
• There is also a programmatic interface for Erlang, Java, and
other languages.
• One unique feature of Riak is that it can store “links” between
objects (documents).
11
12. Code Examples on KVS
• Riak is written in Java:
– Example on java code
IRiakClient client = RiakFactory.pbcClient(); // Note: Use this
line instead of the former if using a local devrel cluster//
IRiakClient client = RiakFactory.pbcClient("127.0.0.1", 10017);
Bucket myBucket = client.fetchBucket("test").execute();
int val1 = 1;
myBucket.store("one", val1).execute();
12
14. Key- Value DataStores - Project-Voldemort
• Project Voldemort: is an advanced key-value store, written in
Java.[2]. It is developed by LinkedIn
• To enable high performance and availability we allow only
very simple key-value data access. Both keys and values
can be complex compound objects including lists or maps,
but none-the-less the only supported queries are
effectively the following:
14
15. Key- Value DataStores - Project-Voldemort
• Features:
• Voldemort provides multi-version concurrency control (MVCC) for
updates.
• Supports optimistic locking for consistent multi-record updates.(in
conflict updates, they can be backed out)
• provide an ordering on versions using Vector clocks.
• Supports automatic sharding of data.Consistent hashing is used
to distribute data around a ring of nodes.
• Can store data in RAM, Storage engine (Berkeley-DB).
• Supports lists and records in addition to simple scalar values.
• Easy to add and remove a Node to the cluster.
• Support Lists and Maps for values.
15
17. Voldemort …
• Disadvantages
– no complex query filters
– all joins must be done in code
– no foreign key constraints
– no triggers (no cascading)
– Does not support MapReduce mechanism
17
19. Column Databases
•
•
•
•
What are Column Databases
Structure and Patterns
Problems With Column Databases
Comparison
– HBase
– Cassandra
19
20. What are Column Databases
• Also called Extensible Record Stores.
• Many consider Column databases to be Key-Value
databases
• A column can be part of a Column Family that resembles
at most a relational row, but it may appear in one row and
not in the others.
• Standard column families have a schema-less nature so
that each of their "row"s can contain a different number of
columns, and even different column names could be in
each row.
20
21. What are Column Databases (cont.)
• A column of a distributed data store is a NoSQL object of
the lowest level in a keyspace. It is a tuple (a key-value
pair) consisting of three elements:
– Unique name: Used to reference the column
– Value: The content of the column. It can have different types,
like AsciiType, LongType, TimeUUIDType, UTF8Type among
others.
– Timestamp: The system timestamp used to determine the valid
content.
21
22. What are Column Databases (Example)
UserProfile = {
Cassandra = { emailAddress:”casandra@apache.org” , age:”20”}
TerryCho = { emailAddress:”terry.cho@apache.org” , gender:”male”}
Cath = { emailAddress:”cath@apache.org” ,
age:”20”,gender:”female”,address:”Seoul”}
}
where "Cassandra", "TerryCho", "Cath" correspond to row keys; and
"emailAddress", "age", "gender", "address" correspond to the column
names.
22
23. Structure and Patterns
• Each column family is partitioned horizontally across different nodes,
each node hosting partitions for all columns.
• A single partition of a column family contains
– Time-stamped K/V pairs that are appended at the tail end of the file.
– The partition also includes an index which sorts the pairs lexicographically and
points to their current version.
• This allows for in-order traversal of the data and efficient location of the
most-recent version of the data.
• Since updates are appends to the data stores rather than in-place
over-writes, old versions with stale data or deleted values in the
column segments need to be periodically garbage-collected.
23
24. Problems with Column Databases
1. Inserted tuples have to be broken up into their
component attributes and each attribute must be
written separately
2. The dense-packed data layout makes moving
tuples within a page nearly impossible.
3. Most queries access more than one attribute
from an entity, but information about a logical
entity is stored in multiple locations.
24
25. HBase vs. Cassandra
HBase
Cassandra
High Availability
Yes
Distributed
Transaction Processing
Yes
No
Implementation language
Java
Java
Influence
Google's BigTable
Amazon's Dynamo
License
Apache 2.0
Apache 2.0
Persistence
Yes
CAP Theorem Focus
Major version upgrades require
imports
Uses HDFS, Amazon S3, or
Amazon Elastic Block Store
Facebook's Messaging
Platform
HBase Shell Commands or
Cloudera Impala Query Engine
Consistency & Availability
Optimized For
Reads
CQL (Cassandra Query
Language)
Availability and Partition
Tolerence
Writes
Partitioning Strategies
Ordered Partitions
Random Partitions
Replication
Major Use
Query Language
Yes
Facebook's Inbox Search
25
27. Comparison
Relational
Article
- id
- authorid
- title
- content
Author
- id
- name
- email
Comment
- id
- articleid
- message
Document db
Article
- _id
- title
- content
- author
- _id
- name
- email
- comments[]
28. Concepts
Joins
No joins
Joins at "design time", not at "query time“
Constraints
No foreign key constraints
Unique indexes
Transactions
No commit/rollback
Atomic operations
Multiple actions inside the same document
Incl. embedded documents
29. Benefits
Scalable: good for a lot of data / traffic
Horizontal scaling: to more nodes
Good for web-apps
Performance
No joins and constraints
user friendly
Data is modeled to how the app is going to use it
No conversion between object oriented > relational
No static schema = agile
30. One-to-many
Embedded array / array keys
Some queries get harder
You can index arrays!
Article
- _id
- content
- tags: {“foo”, “bar”}
- comments: {“id1”, “id2”}
31. many-to-many
Using array keys
No join table
References on both sides
Article
- _id
- content
- category_ids : {“id1”, “id2”}
Category
- _id
- name
- article_ids: {“id7”, “id8”}
Advantage: simple queries
articles.Where(p => p.CategoryIds.Contains(categoryId))
categories.Where(c => c.ArticleIds.Contains(articleId))
Disadvantage: duplication, update two docs
32. json
•
•
•
Stands for Javascript Object Notation
Derived from Javascript scripting language
Used for representing simple data
structures and associative arrays
{
{
"_id": "BCCD12CBB",
"type": "person",
"name": "Darth Vader",
"age": 63,
"headware":
["Helmet", "Sombrero"],
"dark_side": true
}
"_id": "BCCD12CBC",
"type": "person",
"name": "Luke",
"age": 35,
"powers":
["Pull", "Jedi Mind Trick"],
"dark_side": false
}
33. Replication
Only one server is active for writes (the primary, or
master) at a given time – this is to allow strong
consistent (atomic) operations. One can optionally
send read operations to the secondaries when
eventual consistency semantics are acceptable.
34. Sharding
Sharding is the partitioning of data among multiple
machines in an order-preserving manner.(horizontal
scaling )
Machine 1
Machine 2
Machine 3
Alabama → Arizona
Colorado → Florida
Arkansas → California
Indiana → Kansas
Idaho → Illinois
Georgia → Hawaii
Maryland → Michigan
Kentucky → Maine
Minnesota → Missouri
Montana → Montana
Nebraska → New Jersey
Ohio → Pennsylvania
New Mexico → North Dakota
Rhode Island → South Dakota
Tennessee → Utah
Vermont → West Virgina
Wisconsin → Wyoming
35. Replication & Sharding conclusion
sharding is the tool for scaling a system, and
replication is the tool for data safety, high availability,
and disaster recovery. The two work in tandem yet are
orthogonal concepts in the design.
37. What is MongoDB?
Creator: 10 gen, former doublick
Name: short for humongous ()عمالق
Language: C++
38. What is MongoDB?
Defination: MongoDB is an open source, document-
oriented database designed with both scalability and
developer agility in mind. Instead of storing your data
in tables and rows as you would with a relational
database, in MongoDB you store JSON-like
documents with dynamic schemas(schema-free,
schemaless).
39. What is MongoDB?
Goal: bridge the gap between key-value stores (which
are fast and scalable) and relational databases (which
have rich functionality).
42. MongoDB Data model
Data model: Using BSON (binary JSON), developers
can easily map to modern object-oriented languages
without a complicated ORM layer.
BSON is a binary format in which zero or more
key/value pairs are stored as a single entity.
lightweight, traversable, efficient
51. Graph databases
•
What is a Graph Database?
• is a model that describe a real data in the world and the whole
database shown as structured crowded with relationships between
nodes. Graph databases are generally built for use with transactional
(OLTP) systems.
51
52. Graph database
•
A Graph contains Nodes and Relationships
A Graph —records data in→ Nodes —which have→ Properties
•
Relationships organize the Graph
Nodes —are organized by→ Relationships —which also have→ Properties
•
Labels group the Nodes
Nodes —are grouped by→ Labels —into→ Sets
•
Query a Graph with a Traversal
A Traversal —navigates→ a Graph; it —identifies→ Paths —which order→ Nodes
•
Indexes look-up Nodes or Relationships
An Index —maps from→ Properties —to either→ Nodes or Relationships
52
55. Graph Databases Models- Neo4j
•
Neo4j is an open source implemented using Java as a graph databases that
allow representing complex and huge data in easy way using a key-value
property.
•
Both nodes and relationships can contain properties.
•
Nodes are often used to represent entities, but depending on the domain
relationships may be used for that purpose as well.
•
Neo4j's Cypher language is purpose built for working with graph data that
supports most algorithms used for searching and finding path in graph such as
shortest path and graph pattern matching
•
Schema-free
55
56. Graph Databases Models- Neo4j
• Architecture
•
The node and relationship stores are concerned only with the structure of the
graph, not its property data. Both stores use fixed-sized records so that any
individual record’s location within a store file can be rapidly computed given its
ID.
56
57. Graph Databases Models- AllegroGraph
• AllegroGraph is a semantic web technology using for storing RDF in
triples that depends on disk-based storage Store data and meta-data
as triples.
• One thing to note is that AllegroGraph doesn't restrict the contents of
its triples to pure RDF.
• represent any graph data-structure by treating its nodes as subjects
and objects, its edges as predicates and creating a triple for every
edge. The named-graph slot can be used to hold additional,
application-specific, information. Used this way, AllegroGraph
becomes a powerful graph-oriented database.
• Allegro Generates 12 byte unique identifier for each string in the triple
store.
• Query these triples through various query APIs like SPARQL.
• It has Java, Python, Lisp, Prolog interfaces.
• AllegroGraph's RDFS++ reasoning supports all the RDFS predicates
and some of OWL's.
57
58. Graph Databases Models- AllegroGraph
• Architecture
•
•
AllegroGraph provides a REST protocol architecture
AllegroGraph is designed to store RDF triples, a standard format for Linked Data
58
59. Graph Databases Models- AllegroGraph
HOW DOES IT WORK?
• AllegroGraph runs as a service.
• It can work with large variety of programming interfaces
like Java, Lisp.
• Although it’s called a “triple”, each triple has five fields.
–
–
–
–
–
Subject
Predicate
Object
Graph
ID – stands for triple identifier
59
60. Graph Databases Models- Virtuoso
•
Universal server is a multi-protocol RDBMS.
•
Virtuoso provides support for web servers which help in querying and loading
data over HTTP.
•
Support SPARQL query language which typecasting is differs from SQL where
Virtuoso provide a QUITCAST query hint that helps developers from writing
and complex cast expression in SQL.
•
Implements a quad (graph, subject, predicate, object)
60
61. Virtuoso Architecture
A high performance, scalable,
secure and operating-system
independent server designed to
handle contemporary challenges
associated with standards
compliant data access, data
integration, and data
management
61
63. indexing
• Performance is gained by creating indexes, which improve the speed
of looking up nodes in the databases.
• Neo4j does not support automatic indexing, you have specifies which
properties to index, since Neo4j schema optional, which you can use it
without any schema (schema-free).
• In the other hand Allegro and Virtuoso support a dynamic and
automatic indexing. By default Allegro AllegroGraph now supports any
index combination of S, P, O, G. The default indices default are
SPOGI, POSGI, OSPGI, GPOSI, GOSPI, and I. The graph indices,
namely GSPOI, GPOSI, GOSPI are used when the triple store is
divided into sub graphs. The I index is a special type of index which
lists all the triples by an id number.
63
64. Indexing – cont.
• In virtuoso the indexing of RDF data includes 2 full indices over RDF
quads and 3 partial indices. The indexing scheme in Virtuoso consists
of the following indices:
–
–
–
–
–
PSOG - primary key
POGS - bitmap index for lookups on object value.
SP - partial index for cases where only S is specified.
OP - partial index for cases where only O is specified.
GS - partial index for cases where only G is specified.
64
65. Query languages
• In Neo4j there is no supporting for sql query. It is support a Cypher
language.
• Virtuoso support SQL,SPARQL,XQuery,XPath and RDF++ resoning.
And in the other hand, allegro support a SPARQL as a query
language.
65
66. Storing
• In neo4j is a native graph databases implemented java. Which means
it’s internal database structure directly represent nodes, relationships
and properties as records in its database files. Neo4j represents nodes
and relationships as java objects in its embedded java-API, as ASCII
art in its query language cipher.
• Allegrograph and Virtuoso supports a native datatypes, which stores
wide range of data types directly in its low level triple represtation.
66
67. Loading and indexing
•
•
The author in [16] did the comparison between previous models in loading and indexing
triples and measure the time for each one.
The evaluation here focuses on centralized triple stores with a single client making
queries at a given time. The queries are executed one after the other.
triples
40377
374911
1075626
1809874
24711725
9279230
Allegro
1.608
15.505
45.135
77.627
1156.8
3198
Virtuoso
1.38
13.22
55.74
90.48
6408
4860
Neo4j
24.98
6.35
1347
2484
53496
175032
67
68. Loading and indexing-cont.
•
AllegroGraph and Virtuoso have similar best performance for smaller datasets.
For larger datasets, AllegroGraph seems to be performing somewhat better
than Virtuoso in our case where the maximum number of triples that were
tested was 50 million triples.
•
Neo4j shows poor 25 performance among the four triple stores, The
comparison of Neo4j shows though, that Neo4j has the longest loading times,
furthermore Neo4j scales worst.
68
77. NoSQL DBs vs. Relational DBs
• Relational Databases:
– New relational systems can do everything that a NoSQL system
can, with analogous performance and scalability, and with the
convenience of transaction and SQL.
– RDBMSs have taken and retained majority market share over other
competitors in the past 30 years.
– Successful RDBMSs have been build to handle other specific
application loads in the past
– SQL has a common interface that gives advantages in training,
continuity, and data interchange.
77
78. NoSQL DBs vs. Relational DBs (cont.)
• NoSQL Databases
– RDBMSs cannot achieve scaling in comparison with NoSQL
systems
– Key-value stores is adequate for a lookup of objects based on a
single key, and is probably easier to understand than an RDBMS.
(Learning curve for the level of complexity you require only)
– Provides flexible schema that allows each object in a collection to
have different attributes.
– NoSQL systems make expensive operation impossible or obviously
expensive for programmers.
– NoSQL products have established smaller but non-trivial markets
in areas where there is a need for particular capabilities.
78
79. NoSQL DBs vs. Relational DBs (cont.)
Data Model
Performance
Scalability
Flexibility
Complexity
Functionality
Key–value Stores
high
high
high
low
variable (none)
Column Store
high
high
moderate
low
minimal
Document Store
high
variable (high)
high
low
variable (low)
Graph Database
variable
variable
high
high
graph theory
Relational Database
variable
variable
low
moderate
relational algebra
Ben Scofield, Wikipedia, NoSQL (http://en.wikipedia.org/wiki/NoSQL)
79
80. The Future of NoSQL DBs
• NoSQL databases are popular enough so that many developers may
abandon globally-ACID transactions.
• NoSQL data stores will not be a passing fad, since they fill a market
niche.
• New RDBMSs will take a significant share of the scalable data storage
market.
• Scalable data stores will not prove to be enterprise ready for a while.
• There will be a major consolidation among some systems.
80
81. References 1
1.
Decker, S., Passant, A. and Breslin, J. (2009). The Social Semantic Web. Berlin,
Heidelberg: Springer.
2.
Grubar, T. (1993). Toward Principles for the Design of Ontologies Used for Knowledge
Sharing. International Journal Human-Computer Studies, 43, 907-928. Retrieved from
http://tomgruber.org/writing/onto-design.pdf
3.
Recordon, D. (2010). The Open Graph protocol Understanding the design decisions,
in WWW Conference 2010, April 29th, 2010,
http://www.scribd.com/doc/30715288/The-Open-Graph-Protocol-Design-Decisions
4.
Brunati, M. (2010). Facebook Open Graph and the Semantic Web [PowerPoint slides],
December, 2012 (http://www.slideshare.net/dagoneye/facebook-open-graph-and-thesemantic-web)
5.
Facebook Developers. Open Graph Protocol, Retrieved December 20, 2012 from
https://developers.facebook.com/docs/opengraphprotocol/
6.
Facebook Developers. Open Graph Concepts, Retrieved December 20, 2012 from
https://developers.facebook.com/docs/concepts/opengraph/
7.
Facebook. Open Graph protocol, Retrieved December 20, 2012 from http://ogp.me/
81
82. References 2
8.
Berners-Lee, Tim; Fischetti, Mark,Weaving the Web. HarperSanFrancisco. chapter 12.
(1999). ISBN 978-0-06-251587-2.
9.
Victoria Shannon "A 'more revolutionary' Web". International Herald Tribune. (June 26,
2006). (http://www.nytimes.com/2006/05/23/technology/23iht-web.html). Retrieved
December, 2012.
10.
"W3C Semantic Web Activity". World Wide Web Consortium (W3C). November 7,
2011. http://www.w3.org/2001/sw/. Retrieved December, 2012
11.
"OWL Web Ontology Language Overview". World Wide Web Consortium (W3C).
February 10, 2004. (http://www.w3.org/TR/owl-features/) . Retrieved December, 2012.
12.
Bizer, Christian; Heath, Tom; Berners-Lee, Tim,"Linked Data—The Story So Far".
International Journal on Semantic Web and Information Systems 5 (3): 1–22.
doi:10.4018/jswis.2009081901. (2009). ISSN 15526283. Retrieved Dec, 2012
13.
Heath and Christian Bizer ,Linked Data: Evolving the Web into a Global Data Space,
Synthesis Lectures on the Semantic Web: Theory and Technology, Morgan &
Claypool. (2011)
14.
linked open data:
(http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData)
(Visited December 2012)
82
83. References 3
15.
Erwin Folmer, Paul Oude Luttighuis,Jos van Hillegersberg, Do semantic standards
lack quality? A survey among 34 semantic standards, June 2011
16.
[9] W3C: Linked Data. Webpage, Copyright
2012(http://www.w3.org/standards/semanticweb/daa)(Visited December 2012)
17.
XHTML Metain formation Attributes Module,( XHTML Metainformation Attributes
Module)(Visited in Dec, 2012)
18.
RDFa in XHTML: Syntax and Processing,( http://www.w3.org/TR/2007/WD-rdfasyntax-20071018/) (Visited in Jan, 2013)
19.
Zollmann, Johannes. "NoSQL Databases." SigMod , 2011.
20.
Schindler, Jiri. "I/O Characteristics of NoSQL Databases." VLDB Endowment 5, no. 12
(August 2012): 2020-2021.
21.
Cattell, Rick. "Scalable SQL and NoSQL Data Stores." SigMod Record 39, no. 4
(December 2010): 12-27.
22.
Rabl, Tilmann, Mohammad Sadoghi, Hans-Arno Jacobsen, Sergio Gόmez-Villamor,
Victor Muntés-Mulero, and Serge Mankovskii. "Solving Big Data Challenges for
Enterprise Application Performance Management." VLDB Endownment 5, no. 12
(2012): 1724-1735.
83
84. References 4
15.
Abadi, Daniel J., Peter A. Boncz, and Stavros Harizopoulos. "Column-oriented
Database Systems." VLDB Endowment, 2009.
16.
Atzeni, Paolo, Chrstian S. Jensen, Giorgio Orsi, Sudha Ram, Letizia Tanca, and
Riccardo Torlone. "The relational model is dead, SQL is dead, and I don't feel so good
myself." SigMod Record 42, no. 2 (June 2013): 64-68.
17.
Stonebraker, M., S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P.
Helland. "The end of an architectural era: (it's time for a complete rewrite)." VLDB
Endowment , 2007: 1150-1160.
84