1. TOWARDS
AN ARCHITECTURE
FOR MANAGING
BIG SEMANTIC DATA
IN REAL-TIME
Carlos E. Cuesta, VorTIC3, URJC, Spain
Miguel A. Martínez-Prieto, UVa, Spain
Javier D. Fernández, UVa, Spain & UChile, Chile
Montpellier, France, 02/07/2013
2. CONTENTS
Introduction
Problem Statement
Context: the RDF world
Proposal: SOLID Architecture
Unfolding in five Layers
SOLID in Practice
The RDF/HDT format
The SOLID/HDT Architecture
Conclusions & Future work 2
3. INTRODUCTION
Big Data has become an important topic
When the size of the data itself becomes part of the
problem (Loukides)
Characterized by the “three Vs”
Volume: large amounts of data gathered and stored
The challenge is storage, but also computing
Volume is relative: depends on available resources
Velocity: different flows of data at different rates
Variety: the kind of structures within the data
Each source has its own semantics
Need of a logical model to allow data integration
Architecture for Big Data must consider all these 3
4. INTRODUCTION
One of the dimensions gets always critical
E.g. storage in mobile applications, velocity in real-
time applications (vs. batch processes)
We promote variety
The dataset value is increased when multiple sources
are integrated, achieving more knowledge
This also influences velocity and volume
We choose a graph-based model
Allows to manage higher levels of variety
Data can be linked and queried together
In practice, this means using RDF as data model
The cornerstone of the “practical” Semantic Web
The basis of the emergent Web of Data
4
5. PROBLEM STATEMENT
Most solutions to manage Big Data intend to
maximize the volume dimension
Therefore promoting efficient storage
Datastores able to cope with large datasets
Indexing strategies to achieve high space
Datastores must be assumed to be stable
In spite of the assumed immutability property
But, the volume of incoming data is also big
Datastores must be periodically updated & reindexed
This is very complex in a Real-Time context
Data must be received and integrated in real time
No time to process the flow of incoming data 5
6. OUR PROPOSAL: SOLID ARCHITECTURE
We propose an specific architecture to manage
Real-Time flows in this context
A multi-tiered architecture
Separate comsuption of Big Semantic Data…
… from the complexities of Real-Time operation
Data must be preserved compact
It is stored and indexed in a compressed way
Data & Index Layers
Needs to efficiently cope with data updates
The reason for the Online Layer
Needs to query all of this together
The reason for the Service Layer 6
7. CONTEXT: RDF
RDF: Resource Description Framework
Data described as (subject, predicate, object) triples
An RDF dataset is a graph of knowledge
Entities linked to values via labelled edges
Essential for Linked Open Data
Adopted in many different contexts
Simple integration: everything has an URI
7
John Car
owns
8. CONTEXT: RDF
The origin of the Web of Data
Two datasets can become connected by a single triple
<“Station #123, location, Canal Street>
The web becomes data-centric
Every unit is a small piece of data
“The Big Data’s long tail”
But their integration in large contexts become
complex: Big Semantic Data
A variety of sources become easily integrated
RDF is not a serialization format
Describes what data is stored, not how this is done 8
9. SOLID ARCHITECTURE
10
INDEX LAYER
New Data
Dump
Rd
DataStore
DATA LAYER
Big Data
MERGE LAYER
(BATCH)
Query
Join
SERVICE LAYER
ONLINE LAYER
Parallelizable
Processing
10. SOLID ARCHITECTURE
11
INDEX LAYER
New Data
Dump
Rd
DataStore
DATA LAYER
Big Data
MERGE LAYER
(BATCH)
Query
Join
SERVICE LAYER
ONLINE LAYER
Parallelizable
Processing
RDF
SPARQL
11. SOLID ARCHITECTURE
Online Layer
Receives incoming new data
Deals with real-time needs
Data Layer
The core of the architecture
The main datastore: the Big Data repository
Stores compressed RDF
Index Layer
Provides an index for the Data Layer, to make
possible high-speed access
Most accesses to the repository are made through it
12
12. SOLID ARCHITECTURE
Service Layer
The façade to the external user
Able to ask federated SPARQL queries to the
separate datastores in different layers
Every query is distributed, and the different answers
are joined
Merge Layer
Makes possible to integrate the two datastores
Receives a dump of data of the online layer
Integrates that with the existing repository
Producing a fresh copy of the Data Layer
Immutability properties are preserved 13
13. SOLID IN PRACTICE
This abstract architecture is possible due to
application to existing technology
In particular, the RDF/HDT binary format
Decisions must be taken, layer by layer, about
how to actually implement it
Other alternatives would also be possible (and some
of them are also being implemented)
Data-Centric Layers
Do not use a textual RDF representation
Inefficient, prevents some potential uses
RDF/HDT is a binary format
Conceived specifically for serialization purposes 14
14. SOLID IN PRACTICE
RDF/HDT format
Designed for machine processing
About 15 times less space than equivalent formats
Uses compact (compressed) data structures
Data Layer
Big Semantic Data in RDF/HDT
Data saving and guaranteed immutability
Instant mapping to memory
Allow querying withoug decompressing
Index Layer
Implements the HDT/FoQ proposal
Lightweight index on top of the HDT binary format
Efficient SPARQL retrieval without decompressing 15
15. SOLID IN PRACTICE
Online Layer
Copes with the incoming flow of real-time data
HDT is inadequate (designed for read-only)
Must resolve SPARQL efficiently
Choose a general-purpose NoSQL technology
Still able to dump data in an RDF format
Service Layer
Resolves any potential queries
SPARQL considered expressive enough
Queries are forwarded to Online and Index Layers
Their results are retrieved and combined
Using an (scalable) Pipe-Filter approach 16
16. SOLID IN PRACTICE
Merge Layer
Able to combine incoming data from the Online Layer
with the existing datastore in the Data Layer
The data dump is merged into a copy of the datastore
Then the fresh datastore replaces the previous one
Periodical process, can also be manually triggered
Requires high-performance computation
In practice, this means a Map/Reduce approach
Raw RDF data from Online Layer is converted
Then ordered for internal merging
Depends on the size of the smaller store
Also triggers reindexing the Index Layer 17
17. SOLID ARCHITECTURE IN PRACTICE
18
INDEX LAYER
New Data
Dump
Rd
NoSQL
DATA LAYER
RDF/HDT
MERGE LAYER
(BATCH)
HADOOP
SPARQL
SPARQL
+ P/F
SERVICE LAYER
ONLINE LAYER
Semantic
Data
18. CONCLUSIONS & FUTURE WORK
We propose SOLID as a generic architecture for
managing Big Semantic Data
Our particular implementation relies on HDT
Also NoSQL for real-time incoming data
Cassandra, but (still) not the only choice
Map/Reduce (Hadoop) for intensive processing
Highly effective in terms of space & time
Initial empirical results are very significant
Currently developing an optimized prototype
Already working on variants of the architecture
Limited version for mobile devices
The Merge Layer is not directly requred
19