SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
TOWARDS
AN ARCHITECTURE
FOR MANAGING
BIG SEMANTIC DATA
IN REAL-TIME
Carlos E. Cuesta, VorTIC3, URJC, Spain
Miguel A. Martínez-Prieto, UVa, Spain
Javier D. Fernández, UVa, Spain & UChile, Chile
Montpellier, France, 02/07/2013
CONTENTS
 Introduction
 Problem Statement
 Context: the RDF world
 Proposal: SOLID Architecture
 Unfolding in five Layers
 SOLID in Practice
 The RDF/HDT format
 The SOLID/HDT Architecture
 Conclusions & Future work 2
INTRODUCTION
 Big Data has become an important topic
 When the size of the data itself becomes part of the
problem (Loukides)
 Characterized by the “three Vs”
 Volume: large amounts of data gathered and stored
 The challenge is storage, but also computing
 Volume is relative: depends on available resources
 Velocity: different flows of data at different rates
 Variety: the kind of structures within the data
 Each source has its own semantics
 Need of a logical model to allow data integration
 Architecture for Big Data must consider all these 3
INTRODUCTION
 One of the dimensions gets always critical
 E.g. storage in mobile applications, velocity in real-
time applications (vs. batch processes)
 We promote variety
 The dataset value is increased when multiple sources
are integrated, achieving more knowledge
 This also influences velocity and volume
 We choose a graph-based model
 Allows to manage higher levels of variety
 Data can be linked and queried together
 In practice, this means using RDF as data model
 The cornerstone of the “practical” Semantic Web
 The basis of the emergent Web of Data
4
PROBLEM STATEMENT
 Most solutions to manage Big Data intend to
maximize the volume dimension
 Therefore promoting efficient storage
 Datastores able to cope with large datasets
 Indexing strategies to achieve high space
 Datastores must be assumed to be stable
 In spite of the assumed immutability property
 But, the volume of incoming data is also big
 Datastores must be periodically updated & reindexed
 This is very complex in a Real-Time context
 Data must be received and integrated in real time
 No time to process the flow of incoming data 5
OUR PROPOSAL: SOLID ARCHITECTURE
 We propose an specific architecture to manage
Real-Time flows in this context
 A multi-tiered architecture
 Separate comsuption of Big Semantic Data…
 … from the complexities of Real-Time operation
 Data must be preserved compact
 It is stored and indexed in a compressed way
 Data & Index Layers
 Needs to efficiently cope with data updates
 The reason for the Online Layer
 Needs to query all of this together
 The reason for the Service Layer 6
CONTEXT: RDF
 RDF: Resource Description Framework
 Data described as (subject, predicate, object) triples
 An RDF dataset is a graph of knowledge
 Entities linked to values via labelled edges
 Essential for Linked Open Data
 Adopted in many different contexts
 Simple integration: everything has an URI
7
John Car
owns
CONTEXT: RDF
 The origin of the Web of Data
 Two datasets can become connected by a single triple
<“Station #123, location, Canal Street>
 The web becomes data-centric
 Every unit is a small piece of data
 “The Big Data’s long tail”
 But their integration in large contexts become
complex: Big Semantic Data
 A variety of sources become easily integrated
 RDF is not a serialization format
 Describes what data is stored, not how this is done 8
SOLID ARCHITECTURE
10
INDEX LAYER
New Data
Dump
Rd
DataStore
DATA LAYER
Big Data
MERGE LAYER
(BATCH)
Query
Join
SERVICE LAYER
ONLINE LAYER
Parallelizable
Processing
SOLID ARCHITECTURE
11
INDEX LAYER
New Data
Dump
Rd
DataStore
DATA LAYER
Big Data
MERGE LAYER
(BATCH)
Query
Join
SERVICE LAYER
ONLINE LAYER
Parallelizable
Processing
RDF
SPARQL
SOLID ARCHITECTURE
 Online Layer
 Receives incoming new data
 Deals with real-time needs
 Data Layer
 The core of the architecture
 The main datastore: the Big Data repository
 Stores compressed RDF
 Index Layer
 Provides an index for the Data Layer, to make
possible high-speed access
 Most accesses to the repository are made through it
12
SOLID ARCHITECTURE
 Service Layer
 The façade to the external user
 Able to ask federated SPARQL queries to the
separate datastores in different layers
 Every query is distributed, and the different answers
are joined
 Merge Layer
 Makes possible to integrate the two datastores
 Receives a dump of data of the online layer
 Integrates that with the existing repository
 Producing a fresh copy of the Data Layer
 Immutability properties are preserved 13
SOLID IN PRACTICE
 This abstract architecture is possible due to
application to existing technology
 In particular, the RDF/HDT binary format
 Decisions must be taken, layer by layer, about
how to actually implement it
 Other alternatives would also be possible (and some
of them are also being implemented)
 Data-Centric Layers
 Do not use a textual RDF representation
 Inefficient, prevents some potential uses
 RDF/HDT is a binary format
 Conceived specifically for serialization purposes 14
SOLID IN PRACTICE
 RDF/HDT format
 Designed for machine processing
 About 15 times less space than equivalent formats
 Uses compact (compressed) data structures
 Data Layer
 Big Semantic Data in RDF/HDT
 Data saving and guaranteed immutability
 Instant mapping to memory
 Allow querying withoug decompressing
 Index Layer
 Implements the HDT/FoQ proposal
 Lightweight index on top of the HDT binary format
 Efficient SPARQL retrieval without decompressing 15
SOLID IN PRACTICE
 Online Layer
 Copes with the incoming flow of real-time data
 HDT is inadequate (designed for read-only)
 Must resolve SPARQL efficiently
 Choose a general-purpose NoSQL technology
 Still able to dump data in an RDF format
 Service Layer
 Resolves any potential queries
 SPARQL considered expressive enough
 Queries are forwarded to Online and Index Layers
 Their results are retrieved and combined
 Using an (scalable) Pipe-Filter approach 16
SOLID IN PRACTICE
 Merge Layer
 Able to combine incoming data from the Online Layer
with the existing datastore in the Data Layer
 The data dump is merged into a copy of the datastore
 Then the fresh datastore replaces the previous one
 Periodical process, can also be manually triggered
 Requires high-performance computation
 In practice, this means a Map/Reduce approach
 Raw RDF data from Online Layer is converted
 Then ordered for internal merging
 Depends on the size of the smaller store
 Also triggers reindexing the Index Layer 17
SOLID ARCHITECTURE IN PRACTICE
18
INDEX LAYER
New Data
Dump
Rd
NoSQL
DATA LAYER
RDF/HDT
MERGE LAYER
(BATCH)
HADOOP
SPARQL
SPARQL
+ P/F
SERVICE LAYER
ONLINE LAYER
Semantic
Data
CONCLUSIONS & FUTURE WORK
 We propose SOLID as a generic architecture for
managing Big Semantic Data
 Our particular implementation relies on HDT
 Also NoSQL for real-time incoming data
 Cassandra, but (still) not the only choice
 Map/Reduce (Hadoop) for intensive processing
 Highly effective in terms of space & time
 Initial empirical results are very significant
 Currently developing an optimized prototype
 Already working on variants of the architecture
 Limited version for mobile devices
 The Merge Layer is not directly requred
19
THANKS FOR YOUR ATTENTION
20

Weitere ähnliche Inhalte

Was ist angesagt?

معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان دادهمعرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان دادهWeb Standards School
 
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...rajappaiyer
 
Open Source ETL using Talend Open Studio
Open Source ETL using Talend Open StudioOpen Source ETL using Talend Open Studio
Open Source ETL using Talend Open Studiosantosluis87
 
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...HostedbyConfluent
 
Olap, oltp and data mining
Olap, oltp and data miningOlap, oltp and data mining
Olap, oltp and data miningzafrii
 
Etl with talend (data integeration)
Etl with talend (data integeration)Etl with talend (data integeration)
Etl with talend (data integeration)Pooja Mishra
 
Data Virtualization and ETL
Data Virtualization and ETLData Virtualization and ETL
Data Virtualization and ETLLily Luo
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRamakant Soni
 
SQL Server Abbreviations
SQL Server AbbreviationsSQL Server Abbreviations
SQL Server AbbreviationsUmar Ali
 
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Databricks
 
The Data Web and PLM
The Data Web and PLMThe Data Web and PLM
The Data Web and PLMKoneksys
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageBethmi Gunasekara
 
Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETLganblues
 
Online analytical processing (olap) tools
Online analytical processing (olap) toolsOnline analytical processing (olap) tools
Online analytical processing (olap) toolskulkarnivaibhav
 
Operationalizing Big Data
Operationalizing Big DataOperationalizing Big Data
Operationalizing Big DataStratio
 
Hadoop at LinkedIn
Hadoop at LinkedInHadoop at LinkedIn
Hadoop at LinkedInKeith Dsouza
 
Data Archiving -Ramesh sap bw
Data Archiving -Ramesh sap bwData Archiving -Ramesh sap bw
Data Archiving -Ramesh sap bwramesh rao
 

Was ist angesagt? (20)

معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان دادهمعرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
 
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
 
The CIARD RINGValeri
The CIARD RINGValeriThe CIARD RINGValeri
The CIARD RINGValeri
 
Open Source ETL using Talend Open Studio
Open Source ETL using Talend Open StudioOpen Source ETL using Talend Open Studio
Open Source ETL using Talend Open Studio
 
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...
 
Olap, oltp and data mining
Olap, oltp and data miningOlap, oltp and data mining
Olap, oltp and data mining
 
ETL DW-RealTime
ETL DW-RealTimeETL DW-RealTime
ETL DW-RealTime
 
Etl with talend (data integeration)
Etl with talend (data integeration)Etl with talend (data integeration)
Etl with talend (data integeration)
 
Data Virtualization and ETL
Data Virtualization and ETLData Virtualization and ETL
Data Virtualization and ETL
 
tecFinal 451 webinar deck
tecFinal 451 webinar decktecFinal 451 webinar deck
tecFinal 451 webinar deck
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data Warehouse
 
SQL Server Abbreviations
SQL Server AbbreviationsSQL Server Abbreviations
SQL Server Abbreviations
 
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
 
The Data Web and PLM
The Data Web and PLMThe Data Web and PLM
The Data Web and PLM
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data Storage
 
Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETL
 
Online analytical processing (olap) tools
Online analytical processing (olap) toolsOnline analytical processing (olap) tools
Online analytical processing (olap) tools
 
Operationalizing Big Data
Operationalizing Big DataOperationalizing Big Data
Operationalizing Big Data
 
Hadoop at LinkedIn
Hadoop at LinkedInHadoop at LinkedIn
Hadoop at LinkedIn
 
Data Archiving -Ramesh sap bw
Data Archiving -Ramesh sap bwData Archiving -Ramesh sap bw
Data Archiving -Ramesh sap bw
 

Andere mochten auch

VADER 2011 (Younessi)
VADER 2011 (Younessi)VADER 2011 (Younessi)
VADER 2011 (Younessi)Carlos Cuesta
 
MI COMPUTADOR IDEAL
MI  COMPUTADOR  IDEALMI  COMPUTADOR  IDEAL
MI COMPUTADOR IDEALjulipita
 
Powers 5 13 dissertation presentation
Powers 5 13 dissertation presentationPowers 5 13 dissertation presentation
Powers 5 13 dissertation presentationShawn Powers
 
PITA Y SU MÁQUINA
PITA Y  SU  MÁQUINAPITA Y  SU  MÁQUINA
PITA Y SU MÁQUINAjulipita
 
Useful v. beautiful
Useful v. beautifulUseful v. beautiful
Useful v. beautifulShawn Powers
 
On demand access to Big Data through Semantic Technologies
 On demand access to Big Data through Semantic Technologies On demand access to Big Data through Semantic Technologies
On demand access to Big Data through Semantic TechnologiesPeter Haase
 
Linking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionLinking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionRonald Ashri
 
ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...
ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...
ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...Robert Cole
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionFlink Forward
 
Building Blocks for Distributed Geo-Knowledge Graphs
Building Blocks for Distributed Geo-Knowledge GraphsBuilding Blocks for Distributed Geo-Knowledge Graphs
Building Blocks for Distributed Geo-Knowledge Graphskjanowicz
 

Andere mochten auch (12)

SOAR 2009 (Cuesta)
SOAR 2009 (Cuesta)SOAR 2009 (Cuesta)
SOAR 2009 (Cuesta)
 
ECSA 2011 (Navarro)
ECSA 2011 (Navarro)ECSA 2011 (Navarro)
ECSA 2011 (Navarro)
 
VADER 2011 (Younessi)
VADER 2011 (Younessi)VADER 2011 (Younessi)
VADER 2011 (Younessi)
 
MI COMPUTADOR IDEAL
MI  COMPUTADOR  IDEALMI  COMPUTADOR  IDEAL
MI COMPUTADOR IDEAL
 
Powers 5 13 dissertation presentation
Powers 5 13 dissertation presentationPowers 5 13 dissertation presentation
Powers 5 13 dissertation presentation
 
PITA Y SU MÁQUINA
PITA Y  SU  MÁQUINAPITA Y  SU  MÁQUINA
PITA Y SU MÁQUINA
 
Useful v. beautiful
Useful v. beautifulUseful v. beautiful
Useful v. beautiful
 
On demand access to Big Data through Semantic Technologies
 On demand access to Big Data through Semantic Technologies On demand access to Big Data through Semantic Technologies
On demand access to Big Data through Semantic Technologies
 
Linking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionLinking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An Introduction
 
ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...
ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...
ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
Building Blocks for Distributed Geo-Knowledge Graphs
Building Blocks for Distributed Geo-Knowledge GraphsBuilding Blocks for Distributed Geo-Knowledge Graphs
Building Blocks for Distributed Geo-Knowledge Graphs
 

Ähnlich wie ECSA 2013 (Cuesta)

IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...In-Memory Computing Summit
 
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End UsersFrom Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End UsersDenodo
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtGenoveva Vargas-Solar
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.Navdeep Charan
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
A big-data architecture for real-time analytics
A big-data architecture for real-time analyticsA big-data architecture for real-time analytics
A big-data architecture for real-time analyticsramikaurraminder
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
 
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
 Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference KeynoteKingsley Uyi Idehen
 
No sqlpresentation
No sqlpresentationNo sqlpresentation
No sqlpresentationSalma Gouia
 
Comparing sql and nosql dbs
Comparing sql and nosql dbsComparing sql and nosql dbs
Comparing sql and nosql dbsVasilios Kuznos
 
Bridging the gap between the semantic web and big data: answering SPARQL que...
Bridging the gap between the semantic web and big data:  answering SPARQL que...Bridging the gap between the semantic web and big data:  answering SPARQL que...
Bridging the gap between the semantic web and big data: answering SPARQL que...IJECEIAES
 
Data Lakes: A Logical Approach for Faster Unified Insights
Data Lakes: A Logical Approach for Faster Unified InsightsData Lakes: A Logical Approach for Faster Unified Insights
Data Lakes: A Logical Approach for Faster Unified InsightsDenodo
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQLbalwinders
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureDenodo
 
Sigma EE: Reaping low-hanging fruits in RDF-based data integration
Sigma EE: Reaping low-hanging fruits in RDF-based data integrationSigma EE: Reaping low-hanging fruits in RDF-based data integration
Sigma EE: Reaping low-hanging fruits in RDF-based data integrationRichard Cyganiak
 
Virtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDFVirtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDFOpenLink Software
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 

Ähnlich wie ECSA 2013 (Cuesta) (20)

IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
 
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End UsersFrom Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
 
HadoopDB in Action
HadoopDB in ActionHadoopDB in Action
HadoopDB in Action
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbt
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
A big-data architecture for real-time analytics
A big-data architecture for real-time analyticsA big-data architecture for real-time analytics
A big-data architecture for real-time analytics
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
 Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
 
No sqlpresentation
No sqlpresentationNo sqlpresentation
No sqlpresentation
 
Comparing sql and nosql dbs
Comparing sql and nosql dbsComparing sql and nosql dbs
Comparing sql and nosql dbs
 
Bridging the gap between the semantic web and big data: answering SPARQL que...
Bridging the gap between the semantic web and big data:  answering SPARQL que...Bridging the gap between the semantic web and big data:  answering SPARQL que...
Bridging the gap between the semantic web and big data: answering SPARQL que...
 
Data Lakes: A Logical Approach for Faster Unified Insights
Data Lakes: A Logical Approach for Faster Unified InsightsData Lakes: A Logical Approach for Faster Unified Insights
Data Lakes: A Logical Approach for Faster Unified Insights
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
 
Sigma EE: Reaping low-hanging fruits in RDF-based data integration
Sigma EE: Reaping low-hanging fruits in RDF-based data integrationSigma EE: Reaping low-hanging fruits in RDF-based data integration
Sigma EE: Reaping low-hanging fruits in RDF-based data integration
 
Virtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDFVirtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDF
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 

Mehr von Carlos Cuesta

JITICE 2015 (Cuesta)
JITICE 2015 (Cuesta)JITICE 2015 (Cuesta)
JITICE 2015 (Cuesta)Carlos Cuesta
 
JITICE 2014 (Cuesta)
JITICE 2014 (Cuesta)JITICE 2014 (Cuesta)
JITICE 2014 (Cuesta)Carlos Cuesta
 
Redes Sociales 2014 (Cuesta)
Redes Sociales 2014 (Cuesta)Redes Sociales 2014 (Cuesta)
Redes Sociales 2014 (Cuesta)Carlos Cuesta
 
Semana de la Ciencia 2014 (Martínez-Prieto)
Semana de la Ciencia 2014 (Martínez-Prieto)Semana de la Ciencia 2014 (Martínez-Prieto)
Semana de la Ciencia 2014 (Martínez-Prieto)Carlos Cuesta
 
VADER 2011 (Pérez-Sotelo)
VADER 2011 (Pérez-Sotelo)VADER 2011 (Pérez-Sotelo)
VADER 2011 (Pérez-Sotelo)Carlos Cuesta
 
VADER 2011 (Moreno-Rivera)
VADER 2011 (Moreno-Rivera)VADER 2011 (Moreno-Rivera)
VADER 2011 (Moreno-Rivera)Carlos Cuesta
 

Mehr von Carlos Cuesta (7)

JITICE 2015 (Cuesta)
JITICE 2015 (Cuesta)JITICE 2015 (Cuesta)
JITICE 2015 (Cuesta)
 
JITICE 2014 (Cuesta)
JITICE 2014 (Cuesta)JITICE 2014 (Cuesta)
JITICE 2014 (Cuesta)
 
Redes Sociales 2014 (Cuesta)
Redes Sociales 2014 (Cuesta)Redes Sociales 2014 (Cuesta)
Redes Sociales 2014 (Cuesta)
 
Semana de la Ciencia 2014 (Martínez-Prieto)
Semana de la Ciencia 2014 (Martínez-Prieto)Semana de la Ciencia 2014 (Martínez-Prieto)
Semana de la Ciencia 2014 (Martínez-Prieto)
 
SESoS 2013 (Romay)
SESoS 2013 (Romay)SESoS 2013 (Romay)
SESoS 2013 (Romay)
 
VADER 2011 (Pérez-Sotelo)
VADER 2011 (Pérez-Sotelo)VADER 2011 (Pérez-Sotelo)
VADER 2011 (Pérez-Sotelo)
 
VADER 2011 (Moreno-Rivera)
VADER 2011 (Moreno-Rivera)VADER 2011 (Moreno-Rivera)
VADER 2011 (Moreno-Rivera)
 

Kürzlich hochgeladen

Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxleah joy valeriano
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsManeerUddin
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...JojoEDelaCruz
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 

Kürzlich hochgeladen (20)

Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
 
Food processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture honsFood processing presentation for bsc agriculture hons
Food processing presentation for bsc agriculture hons
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
ENG 5 Q4 WEEk 1 DAY 1 Restate sentences heard in one’s own words. Use appropr...
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 

ECSA 2013 (Cuesta)

  • 1. TOWARDS AN ARCHITECTURE FOR MANAGING BIG SEMANTIC DATA IN REAL-TIME Carlos E. Cuesta, VorTIC3, URJC, Spain Miguel A. Martínez-Prieto, UVa, Spain Javier D. Fernández, UVa, Spain & UChile, Chile Montpellier, France, 02/07/2013
  • 2. CONTENTS  Introduction  Problem Statement  Context: the RDF world  Proposal: SOLID Architecture  Unfolding in five Layers  SOLID in Practice  The RDF/HDT format  The SOLID/HDT Architecture  Conclusions & Future work 2
  • 3. INTRODUCTION  Big Data has become an important topic  When the size of the data itself becomes part of the problem (Loukides)  Characterized by the “three Vs”  Volume: large amounts of data gathered and stored  The challenge is storage, but also computing  Volume is relative: depends on available resources  Velocity: different flows of data at different rates  Variety: the kind of structures within the data  Each source has its own semantics  Need of a logical model to allow data integration  Architecture for Big Data must consider all these 3
  • 4. INTRODUCTION  One of the dimensions gets always critical  E.g. storage in mobile applications, velocity in real- time applications (vs. batch processes)  We promote variety  The dataset value is increased when multiple sources are integrated, achieving more knowledge  This also influences velocity and volume  We choose a graph-based model  Allows to manage higher levels of variety  Data can be linked and queried together  In practice, this means using RDF as data model  The cornerstone of the “practical” Semantic Web  The basis of the emergent Web of Data 4
  • 5. PROBLEM STATEMENT  Most solutions to manage Big Data intend to maximize the volume dimension  Therefore promoting efficient storage  Datastores able to cope with large datasets  Indexing strategies to achieve high space  Datastores must be assumed to be stable  In spite of the assumed immutability property  But, the volume of incoming data is also big  Datastores must be periodically updated & reindexed  This is very complex in a Real-Time context  Data must be received and integrated in real time  No time to process the flow of incoming data 5
  • 6. OUR PROPOSAL: SOLID ARCHITECTURE  We propose an specific architecture to manage Real-Time flows in this context  A multi-tiered architecture  Separate comsuption of Big Semantic Data…  … from the complexities of Real-Time operation  Data must be preserved compact  It is stored and indexed in a compressed way  Data & Index Layers  Needs to efficiently cope with data updates  The reason for the Online Layer  Needs to query all of this together  The reason for the Service Layer 6
  • 7. CONTEXT: RDF  RDF: Resource Description Framework  Data described as (subject, predicate, object) triples  An RDF dataset is a graph of knowledge  Entities linked to values via labelled edges  Essential for Linked Open Data  Adopted in many different contexts  Simple integration: everything has an URI 7 John Car owns
  • 8. CONTEXT: RDF  The origin of the Web of Data  Two datasets can become connected by a single triple <“Station #123, location, Canal Street>  The web becomes data-centric  Every unit is a small piece of data  “The Big Data’s long tail”  But their integration in large contexts become complex: Big Semantic Data  A variety of sources become easily integrated  RDF is not a serialization format  Describes what data is stored, not how this is done 8
  • 9. SOLID ARCHITECTURE 10 INDEX LAYER New Data Dump Rd DataStore DATA LAYER Big Data MERGE LAYER (BATCH) Query Join SERVICE LAYER ONLINE LAYER Parallelizable Processing
  • 10. SOLID ARCHITECTURE 11 INDEX LAYER New Data Dump Rd DataStore DATA LAYER Big Data MERGE LAYER (BATCH) Query Join SERVICE LAYER ONLINE LAYER Parallelizable Processing RDF SPARQL
  • 11. SOLID ARCHITECTURE  Online Layer  Receives incoming new data  Deals with real-time needs  Data Layer  The core of the architecture  The main datastore: the Big Data repository  Stores compressed RDF  Index Layer  Provides an index for the Data Layer, to make possible high-speed access  Most accesses to the repository are made through it 12
  • 12. SOLID ARCHITECTURE  Service Layer  The façade to the external user  Able to ask federated SPARQL queries to the separate datastores in different layers  Every query is distributed, and the different answers are joined  Merge Layer  Makes possible to integrate the two datastores  Receives a dump of data of the online layer  Integrates that with the existing repository  Producing a fresh copy of the Data Layer  Immutability properties are preserved 13
  • 13. SOLID IN PRACTICE  This abstract architecture is possible due to application to existing technology  In particular, the RDF/HDT binary format  Decisions must be taken, layer by layer, about how to actually implement it  Other alternatives would also be possible (and some of them are also being implemented)  Data-Centric Layers  Do not use a textual RDF representation  Inefficient, prevents some potential uses  RDF/HDT is a binary format  Conceived specifically for serialization purposes 14
  • 14. SOLID IN PRACTICE  RDF/HDT format  Designed for machine processing  About 15 times less space than equivalent formats  Uses compact (compressed) data structures  Data Layer  Big Semantic Data in RDF/HDT  Data saving and guaranteed immutability  Instant mapping to memory  Allow querying withoug decompressing  Index Layer  Implements the HDT/FoQ proposal  Lightweight index on top of the HDT binary format  Efficient SPARQL retrieval without decompressing 15
  • 15. SOLID IN PRACTICE  Online Layer  Copes with the incoming flow of real-time data  HDT is inadequate (designed for read-only)  Must resolve SPARQL efficiently  Choose a general-purpose NoSQL technology  Still able to dump data in an RDF format  Service Layer  Resolves any potential queries  SPARQL considered expressive enough  Queries are forwarded to Online and Index Layers  Their results are retrieved and combined  Using an (scalable) Pipe-Filter approach 16
  • 16. SOLID IN PRACTICE  Merge Layer  Able to combine incoming data from the Online Layer with the existing datastore in the Data Layer  The data dump is merged into a copy of the datastore  Then the fresh datastore replaces the previous one  Periodical process, can also be manually triggered  Requires high-performance computation  In practice, this means a Map/Reduce approach  Raw RDF data from Online Layer is converted  Then ordered for internal merging  Depends on the size of the smaller store  Also triggers reindexing the Index Layer 17
  • 17. SOLID ARCHITECTURE IN PRACTICE 18 INDEX LAYER New Data Dump Rd NoSQL DATA LAYER RDF/HDT MERGE LAYER (BATCH) HADOOP SPARQL SPARQL + P/F SERVICE LAYER ONLINE LAYER Semantic Data
  • 18. CONCLUSIONS & FUTURE WORK  We propose SOLID as a generic architecture for managing Big Semantic Data  Our particular implementation relies on HDT  Also NoSQL for real-time incoming data  Cassandra, but (still) not the only choice  Map/Reduce (Hadoop) for intensive processing  Highly effective in terms of space & time  Initial empirical results are very significant  Currently developing an optimized prototype  Already working on variants of the architecture  Limited version for mobile devices  The Merge Layer is not directly requred 19
  • 19. THANKS FOR YOUR ATTENTION 20