ECSA 2013 (Cuesta)

TOWARDS
AN ARCHITECTURE
FOR MANAGING
BIG SEMANTIC DATA
IN REAL-TIME
Carlos E. Cuesta, VorTIC3, URJC, Spain
Miguel A. Martínez-Prieto, UVa, Spain
Javier D. Fernández, UVa, Spain & UChile, Chile
Montpellier, France, 02/07/2013

CONTENTS
 Introduction
 Problem Statement
 Context: the RDF world
 Proposal: SOLID Architecture
 Unfolding in five Layers
 SOLID in Practice
 The RDF/HDT format
 The SOLID/HDT Architecture
 Conclusions & Future work 2

INTRODUCTION
 Big Data has become an important topic
 When the size of the data itself becomes part of the
problem (Loukides)
 Characterized by the “three Vs”
 Volume: large amounts of data gathered and stored
 The challenge is storage, but also computing
 Volume is relative: depends on available resources
 Velocity: different flows of data at different rates
 Variety: the kind of structures within the data
 Each source has its own semantics
 Need of a logical model to allow data integration
 Architecture for Big Data must consider all these 3

INTRODUCTION
 One of the dimensions gets always critical
 E.g. storage in mobile applications, velocity in real-
time applications (vs. batch processes)
 We promote variety
 The dataset value is increased when multiple sources
are integrated, achieving more knowledge
 This also influences velocity and volume
 We choose a graph-based model
 Allows to manage higher levels of variety
 Data can be linked and queried together
 In practice, this means using RDF as data model
 The cornerstone of the “practical” Semantic Web
 The basis of the emergent Web of Data
4

PROBLEM STATEMENT
 Most solutions to manage Big Data intend to
maximize the volume dimension
 Therefore promoting efficient storage
 Datastores able to cope with large datasets
 Indexing strategies to achieve high space
 Datastores must be assumed to be stable
 In spite of the assumed immutability property
 But, the volume of incoming data is also big
 Datastores must be periodically updated & reindexed
 This is very complex in a Real-Time context
 Data must be received and integrated in real time
 No time to process the flow of incoming data 5

OUR PROPOSAL: SOLID ARCHITECTURE
 We propose an specific architecture to manage
Real-Time flows in this context
 A multi-tiered architecture
 Separate comsuption of Big Semantic Data…
 … from the complexities of Real-Time operation
 Data must be preserved compact
 It is stored and indexed in a compressed way
 Data & Index Layers
 Needs to efficiently cope with data updates
 The reason for the Online Layer
 Needs to query all of this together
 The reason for the Service Layer 6

CONTEXT: RDF
 RDF: Resource Description Framework
 Data described as (subject, predicate, object) triples
 An RDF dataset is a graph of knowledge
 Entities linked to values via labelled edges
 Essential for Linked Open Data
 Adopted in many different contexts
 Simple integration: everything has an URI
7
John Car
owns

CONTEXT: RDF
 The origin of the Web of Data
 Two datasets can become connected by a single triple
<“Station #123, location, Canal Street>
 The web becomes data-centric
 Every unit is a small piece of data
 “The Big Data’s long tail”
 But their integration in large contexts become
complex: Big Semantic Data
 A variety of sources become easily integrated
 RDF is not a serialization format
 Describes what data is stored, not how this is done 8

SOLID ARCHITECTURE
10
INDEX LAYER
New Data
Dump
Rd
DataStore
DATA LAYER
Big Data
MERGE LAYER
(BATCH)
Query
Join
SERVICE LAYER
ONLINE LAYER
Parallelizable
Processing

SOLID ARCHITECTURE
11
INDEX LAYER
New Data
Dump
Rd
DataStore
DATA LAYER
Big Data
MERGE LAYER
(BATCH)
Query
Join
SERVICE LAYER
ONLINE LAYER
Parallelizable
Processing
RDF
SPARQL

SOLID ARCHITECTURE
 Online Layer
 Receives incoming new data
 Deals with real-time needs
 Data Layer
 The core of the architecture
 The main datastore: the Big Data repository
 Stores compressed RDF
 Index Layer
 Provides an index for the Data Layer, to make
possible high-speed access
 Most accesses to the repository are made through it
12

SOLID ARCHITECTURE
 Service Layer
 The façade to the external user
 Able to ask federated SPARQL queries to the
separate datastores in different layers
 Every query is distributed, and the different answers
are joined
 Merge Layer
 Makes possible to integrate the two datastores
 Receives a dump of data of the online layer
 Integrates that with the existing repository
 Producing a fresh copy of the Data Layer
 Immutability properties are preserved 13

SOLID IN PRACTICE
 This abstract architecture is possible due to
application to existing technology
 In particular, the RDF/HDT binary format
 Decisions must be taken, layer by layer, about
how to actually implement it
 Other alternatives would also be possible (and some
of them are also being implemented)
 Data-Centric Layers
 Do not use a textual RDF representation
 Inefficient, prevents some potential uses
 RDF/HDT is a binary format
 Conceived specifically for serialization purposes 14

SOLID IN PRACTICE
 RDF/HDT format
 Designed for machine processing
 About 15 times less space than equivalent formats
 Uses compact (compressed) data structures
 Data Layer
 Big Semantic Data in RDF/HDT
 Data saving and guaranteed immutability
 Instant mapping to memory
 Allow querying withoug decompressing
 Index Layer
 Implements the HDT/FoQ proposal
 Lightweight index on top of the HDT binary format
 Efficient SPARQL retrieval without decompressing 15

SOLID IN PRACTICE
 Online Layer
 Copes with the incoming flow of real-time data
 HDT is inadequate (designed for read-only)
 Must resolve SPARQL efficiently
 Choose a general-purpose NoSQL technology
 Still able to dump data in an RDF format
 Service Layer
 Resolves any potential queries
 SPARQL considered expressive enough
 Queries are forwarded to Online and Index Layers
 Their results are retrieved and combined
 Using an (scalable) Pipe-Filter approach 16

SOLID IN PRACTICE
 Merge Layer
 Able to combine incoming data from the Online Layer
with the existing datastore in the Data Layer
 The data dump is merged into a copy of the datastore
 Then the fresh datastore replaces the previous one
 Periodical process, can also be manually triggered
 Requires high-performance computation
 In practice, this means a Map/Reduce approach
 Raw RDF data from Online Layer is converted
 Then ordered for internal merging
 Depends on the size of the smaller store
 Also triggers reindexing the Index Layer 17

SOLID ARCHITECTURE IN PRACTICE
18
INDEX LAYER
New Data
Dump
Rd
NoSQL
DATA LAYER
RDF/HDT
MERGE LAYER
(BATCH)
HADOOP
SPARQL
SPARQL
+ P/F
SERVICE LAYER
ONLINE LAYER
Semantic
Data

CONCLUSIONS & FUTURE WORK
 We propose SOLID as a generic architecture for
managing Big Semantic Data
 Our particular implementation relies on HDT
 Also NoSQL for real-time incoming data
 Cassandra, but (still) not the only choice
 Map/Reduce (Hadoop) for intensive processing
 Highly effective in terms of space & time
 Initial empirical results are very significant
 Currently developing an optimized prototype
 Already working on variants of the architecture
 Limited version for mobile devices
 The Merge Layer is not directly requred
19

ECSA 2013 (Cuesta)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (12)

Ähnlich wie ECSA 2013 (Cuesta)

Ähnlich wie ECSA 2013 (Cuesta) (20)

Mehr von Carlos Cuesta

Mehr von Carlos Cuesta (7)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

ECSA 2013 (Cuesta)