This document discusses an empirical study of RDF stream processing systems. The study aimed to understand why different systems can produce different outputs for the same inputs. Through experiments, the study found that differences could be explained by parameters like the starting time (t0) of windows in continuous queries. A more detailed model called SECRET was then developed to describe stream processing and help predict system outputs. This led to the CSR-bench benchmark for evaluating and comparing RDF stream reasoning systems.
Driving Behavioral Change for Information Management through Data-Driven Gree...
An experience on empirical research about rdf stream
1. Dipartimento di
Elettronica, Informazione e
Bioingegneria
An Experience on Empirical
Research about RDF Stream
Processing
Daniele Dell’Aglio – daniele.dellaglio@polimi.it
Joint work with: Jean-Paul Calbimonte, Marco Balduini, Oscar Corcho
and Emanuele Della Valle
2. Dipartimento di Elettronica, Informazione
e Bioingegneria
RDF Stream Processing in a nutshell
Continuous queries over RDF streams - infinite
sequences of time-stamped RDF statements (RDF
streams)
Bring together DSMS/CEP and Semantic Web research
fields
Several prototypes – with similar models – are available
today
Trend on evaluation and comparison of the existing
systems
26 May 2014 - EMPIRICAL@ESWC2014
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
2
3. Dipartimento di Elettronica, Informazione
e Bioingegneria
The CQL model for RSPs
Transform a set of mappings in another set of
mappings
SPARQL 1.0/1.1 queries
Each set of mapping produced by the R2R operator
is transformed and appended to the output
stream
Operators: RStream, DStream, IStream
Converts the infinite stream of RDF elements in a
finite set of mappings
The window operators: time-based, tuple-based, …
S2R
operator
R2R
operator
R2S
operator
Input stream
Output stream
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
3
26 May 2014 - EMPIRICAL@ESWC2014
4. Dipartimento di Elettronica, Informazione
e Bioingegneria
R2R operator
S2R - Time-based sliding window
S3
S4 S5
S6
S7
S8
S9 S10
S11
S12
S
S1
S2
W(ω,β)
β
ω
t
widthslide
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
4
26 May 2014 - EMPIRICAL@ESWC2014
5. Dipartimento di Elettronica, Informazione
e Bioingegneria
Implementations (oversimplified!)
C-SPARQL
– RDF Store + Stream processor
RDF Store
Stream
processor
Continuous
query
continuous
results
translator
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
5
26 May 2014 - EMPIRICAL@ESWC2014
6. Dipartimento di Elettronica, Informazione
e Bioingegneria
Implementations (oversimplified!)
C-SPARQL
– RDF Store + Stream processor
CQELS:
– Implemented from scratch. Focus on performance
RDF Store
Stream
processor
Continuous
query
continuous
results
Native RSP
Continuous
query
continuous
results
translator
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
5
26 May 2014 - EMPIRICAL@ESWC2014
7. Dipartimento di Elettronica, Informazione
e Bioingegneria
Implementations (oversimplified!)
C-SPARQL
– RDF Store + Stream processor
CQELS:
– Implemented from scratch. Focus on performance
SPARQLstream:
– Ontology-based stream query answering
RDF Store
Stream
processor
Continuous
query
continuous
results
Native RSP
Continuous
query
continuous
results
translator
DSMS/CEP
Continuous
query
continuous
results
rewriter
R2RML mappings
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
5
26 May 2014 - EMPIRICAL@ESWC2014
8. Dipartimento di Elettronica, Informazione
e Bioingegneria
Same inputs, different outputs…
And the continuous
query:
– Where are Alice and
Bob, when they are
together?
– With a tumbling
window W(ω=β=5)
Execution 1° answer 2° answer
1 :hall [6] :kitchen [11]
2 :hall [5] :kitchen [10]
3 :hall [6] :kitchen [11]
4 - [7] - [12]
S1 S2 S3 S4S
t3 6 91
:alice :isIn :hall
:bob :isIn :hall
:alice :isIn :kitchen
:bob :isIn :kitchen
width
slide
After 4 executions:
Let’s consider the following stream:
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
8
26 May 2014 - EMPIRICAL@ESWC2014
9. Dipartimento di Elettronica, Informazione
e Bioingegneria
The first hypothesis
All the three systems show similar behaviours
Intuition: there are one or more parameters that are not
taken into account by the model
As consequence, the implementations can output
different correct answers
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
9
26 May 2014 - EMPIRICAL@ESWC2014
10. Dipartimento di Elettronica, Informazione
e Bioingegneria
The first hypothesis
HP1: it is possible to have a unique correct answer if we
can control the time instant on which the sliding window
operator starts to work (t0)
S1 S2 S3 S4S
t3 6 91
:bob :isIn :hall :bob :isIn :kitchen
t0=0
:alice :isIn :hall :alice :isIn :kitchen
t0=1
t0=2
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
10
26 May 2014 - EMPIRICAL@ESWC2014
11. Dipartimento di Elettronica, Informazione
e Bioingegneria
The experiment
We work on the difference between the time
instant on which the stream starts (ts) and the
query registration time (tq)
– At each execution, we check the result
– We estimated the delay between tq and t0
tq
ts
Black box approach
– we work on inputs/outputs
– the source code of all the systems
RSP
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
11
26 May 2014 - EMPIRICAL@ESWC2014
t0
12. Dipartimento di Elettronica, Informazione
e Bioingegneria
Observation and explanation
As result, for each system
– We identified the value of the t0 parameter
– We are able to produce the different results for each t0
value
Is it enough to claim that hypothesis 1 holds?
Exec 1° answer 2° answer
1 :hall [6] :kitchen [11]
2 :hall [5] :kitchen [10]
3 :hall [6] :kitchen [11]
4 - [7] - [12]
Window 1° answer 2° answer
t0=0 :hall [5] :kitchen [10]
t0=1 :hall [6] :kitchen [11]
t0=2 - [7] - [12]
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
12
26 May 2014 - EMPIRICAL@ESWC2014
13. Dipartimento di Elettronica, Informazione
e Bioingegneria
Some consideration on the experiment
Comparison:
– We ran the experiment multiple times to collect
instances and check them
Reproducibility: can other researchers reproduce the
experiment?
– We released both the code and the data used for the
experiment (see
http://streamreasoning.org/Benchmarks/)
Repeatability: is the result universally valid?
– We changed inputs (streams and queries) and
OS/JVM to verify if the hypothesis holds
– We repeated the experiment with different
implementations (C-SPARQL, CQELS, etc.)
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
13
26 May 2014 - EMPIRICAL@ESWC2014
14. Dipartimento di Elettronica, Informazione
e Bioingegneria
Something more on repeatability…
We made some assumptions on the setting
26 May 2014 - EMPIRICAL@ESWC2014
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
14
S2R
R2R R2SS2R
S2R
From single
to multi
window
From single to
multi stream
Reasoning
q2
Static
knowledge
Multiple
queries
15. Dipartimento di Elettronica, Informazione
e Bioingegneria
As “side effect” of the first experiment, we
discovered that results of different systems are
not the same:
Intuition: t0 is not the only parameter our model
lacks
A more complex problem…
Exec 1° answer 2° answer
1 :hall [6] :kitchen [11]
2 :hall [5] :kitchen [10]
3 :hall [6] :kitchen [11]
4 - [7] - [12]
Exec 1° answer 2° answer
1 :hall [3] :kitchen [9]
2 No answers
3 :hall [3] :kitchen [9]
4 No answers
C-SPARQL CQELS
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
15
26 May 2014 - EMPIRICAL@ESWC2014
16. Dipartimento di Elettronica, Informazione
e Bioingegneria
R2R operator
The SECRET framework
S3
S4 S5
S6
S7
S8
S9 S10
S11
S12
S
S1
S2
W(ω,β)
β
ω
t0: When does the
window start?
(internal window
param)
TICK: When are
data stream
elements added to
the window?
Triple-based vs
graph-based
REPORT: When is the window content
made available to the R2R operator?
Non-empty content, Content-change,
Window-close, Periodic
t
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
16
26 May 2014 - EMPIRICAL@ESWC2014
17. Dipartimento di Elettronica, Informazione
e Bioingegneria
SECRET and RSPs
HP2: given an input stream, a query, the value of t0 and
description of the RSP w.r.t. SECRET, we can determine
the answer that will be provided by the system
To investigate it, we built a software that evaluates in
batch the answer and matches it with the RSP one
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
17
26 May 2014 - EMPIRICAL@ESWC2014
18. Dipartimento di Elettronica, Informazione
e Bioingegneria
Observation and analysis
We prepared a set of seven
queries (to stress different part of
the sliding window)
We run each query multiple times
Most of the times, we can foresee the
answer that will be provided
CQELS
C-SPARQL
SPARQLstream
Q1
Q2
Q3
Q4
Q5
Q6
Q7
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
18
26 May 2014 - EMPIRICAL@ESWC2014
19. Dipartimento di Elettronica, Informazione
e Bioingegneria
Observation and analysis
We investigated the observations where there is
not a match, and we discovered that they were
errors in the implementations, such as:
– Initialization
– Slide parameter
– Window contents
– Internal timestamp management
Conclusion: HP2 seems to be valid in the
considered setting
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
19
26 May 2014 - EMPIRICAL@ESWC2014
20. Dipartimento di Elettronica, Informazione
e Bioingegneria
CSR-bench
The main outcome of our experience is CSR-bench, an
extension of the CSR benchmark
– More info at http://www.w3.org/wiki/CSRBench
Two main components:
– A common model for the RDF stream processor
operational semantics
– An oracle (an automatic correctness validator),
available at https://github.com/dellaglio/csrbench-
oracle
– A test suite
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
20
26 May 2014 - EMPIRICAL@ESWC2014
21. Dipartimento di Elettronica, Informazione
e Bioingegneria
References
Daniele Dell'Aglio, Marco Balduini, Emanuele Della Valle. On the need to
include functional testing in RDF stream engine benchmarks. 1st
International Workshop on Benchmarking RDF Systems (BeRSys2013)
Daniele Dell'Aglio, Jean-Paul Calbimonte, Marco Balduini, Óscar Corcho,
Emanuele Della Valle: On Correctness in RDF Stream Processor
Benchmarking. International Semantic Web Conference (2) 2013: 326-342
Barbieri, D.F., Braga, D., Ceri, S., Della Valle, E., Grossniklaus, M.: C-
SPARQL: A continuous query language for RDF data streams. IJSC 4(1)
(2010) 3–25
Calbimonte, J.P., Jeung, H., Corcho, O., Aberer, K.: Enabling Query
Technologies for the Semantic Sensor Web. IJSWIS 8(1) (2012) 43–63
Le-Phuoc, D., Dao-Tran, M., Xavier Parreira, J., Hauswirth, M.: A native and
adaptive approach for unified processing of linked streams and linked data.
In: ISWC. (2011) 370–388
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
21
26 May 2014 - EMPIRICAL@ESWC2014
22. Dipartimento di Elettronica, Informazione
e Bioingegneria
Thank you! Questions?
An Experience on Empirical Research about
RDF Stream Processing
Daniele Dell’Aglio
(DEIB, Politecnico di Milano)
daniele.dellaglio@polimi.it
DanieleDell'Aglio-ExperimentalresearchaboutRSPs
22
26 May 2014 - EMPIRICAL@ESWC2014