SlideShare ist ein Scribd-Unternehmen logo
1 von 6
Downloaden Sie, um offline zu lesen
An Analytical Approach for Query Optimization
Based on Hypergraph
Sangeeta Sen
NIT, DURGAPUR
Email: sangeetaaec@
gmail.com
Anisha Agrawal
NIT, DURGAPUR
Email: anishaagrawal23@
gmail.com
Ankit Rathi
NIT DURGAPUR
Email: ankitrathi94@
it.nitdgp.ac.in
Animesh Dutta
NIT DURGAPUR
Email: animesh.dutta@
it.nitdgp.ac.in
Biswanath Dutta
ISI Bangalore
Email: bisu@drtc
.isibang.ac.in
Abstract—In Semantic web, Resource Description Framework
(RDF) plays an important role for storing the data. Growth of
RDF data throws a challenge for data management and evaluation
of SPARQL in optimized time. In this paper we propose a
hypergraph based data management system and SPARQL query
optimization technique. We use the concept of hypergraph which
is a generalization of the graph where the edges connect more
than two vertices. We propose some algorithms for storing RDF
data as hypergraph and make query on it. We compare the
performance of our algorithms with 3 other systems (Rdf-3x,
Apache-Jena and AllegroGraph) based on SP2
Bench a SPARQL
performance benchmark dataset and SPARQL queries.
I. INTRODUCTION
Resource Description Framework 1
(RDF) is a core data
model for storing information on the Semantic Web [2]. RDF
data is stored as triplets (subject or resource, predicate or
property and object or property value). Due to the development
of RDF data, optimized query processing of SPARQL2
queries
over RDF datasets has become an challenging issue. SPARQL
is a graph based query language. The main task of SPARQL is
to match the pattern, which consists of subject, predicate and
object, in the underlying RDF graph.
In this work, we propose a hypergraph [6], [16] based query
optimization and data management technique that needs less
computation time than the existing techniques discussed in
rdf-3x [1], AllegroGraph [13], Jena [14]. A hypergraph is a
generalization of the graph where the edges connect more
than two vertices and such edges are called hyper-edges.
To illustrate the novelty of our proposed work, we provide
an experimental study that contrast the performance of 4
systems (including ours) against SP2
Bench benchmark [9]
with different triples datasets and SPARQL queries. Following
are the main contributions of this paper:
• Hypergraph based database of RDF data: All sub-
jects and objects connected with a particular predicate
resides under a particular hyper-edge (predicate itself),
so searching of RDF triples become easy to execute.
• Query path based query Processing: To process
a query, first the model will generates a query path
according to rearranged query predicates. According
to this path, query is processed in the hypergraph.
1www.w3.org/RDF/
2www.w3.org/2001/sw/wiki/SPARQL
Rest of the paper is organized as follows. In section II, current
state of the art is highlighted. Section III provides an overview
of the problem related to SPARQL query optimization and
section IV presents the system architecture. In section V we
describe the approach definition of the terms used throughout
the paper. In section VI, we discuss a method for solving the
problem of query optimization in RDF graph. In Section VII,
we discuss the results of the proposed method. Section VIII
concludes the work mentioning our future work.
II. STATE-OF-THE-ART
There has been a lot of research work in the field of
RDF data storage and SPARQL query optimization. There
are many approaches for SPARQL query processing, GRIN
Index [12], PIG [11], and g-store [18] focuses on reducing the
search space of the graph traversing algorithm using graph
index. RG-Index [7] also uses graph index but focuses on
reducing the joins in relation based RDF store. The model
of the paper [8] represents an efficient RDF query engine to
evaluate SPARQL query. This paper employs inverted index
structure for indexing the RDF triple. A tree shaped optimized
algorithm is developed that transforms a SPARQL query graph
into optimal query plan by effectively reducing the searching
space. In paper [15], the authors present TripleBit, a fast
and compact system for storing and accessing RDF data.
The compact design of TripleBit reduces both the size of
stored data and size of its indexes. They develop a query
processor which dynamically generates an optimal execution
ordering for join queries, leading to fast query execution and
effective reduction of intermediate result. System SQUIN, in
paper [4], implements a novel query execution approach. The
main idea is to integrate the traversal of data links into the
construction of result. The authors of paper [5] introduce
a scalable RDF data management system. They introduce
techniques for partitioning the data across nodes in a manner
that helps to accelerate query processing. Authors of the paper
[17] introduces Trinity. RDF, a distributed, memory based
graph engine for web scale data. The paper also introduces
a new cost model, novel cardinality estimation technique and
optimization algorithm for distributed query plan generation. A
new join ordering algorithm that performs a SPARQL tailored
query simplification is introduced by paper [3]. The paper
also represents a novel RDF statistical synopsis that accurately
estimates cardinality targeting SPARQL queries. They provide
dynamic programming that decomposes the SPARQL graph
into disjoint star shaped sub-queries.978-1-4799-7961-5/15/$31.00 c 2015 IEEE
In the current work, we propose a hypergraph based query
optimization technique as discussed in the following sections.
III. PROBLEM SCENARIO
SPARQL, a query language, is developed to provide a so-
lution to the problem of RDF data querying. In 2008 SPARQL
becomes a W3C recommendation, but query optimization is a
big challenge [[7], [8]]. To explain the problems we take an
exemplary RDF graph as shown in figure 1.
a4a1 a7
type
playsfor
bornPosition
type
a2
a3
a5
position
born
playsfor
region
manages
a9
a10population
a8
a6
Fig. 1: RDF Graph
The above figure 1 shows that the vertices subject and
object are connected by edge predicates. Direction of the
edges are from subject to object. Here a4, a5, a9 are subjects,
a1, a2, a3, a7, a10 are objects and a6, a8 represent both
subjects and objects. The edges playsfor, born, position, etc.
are the predicates of the graph. SPARQL query contains match
patterns as conditions. These match patterns are required to be
matched in RDF database. Query1 contains one match pattern
to extract information about player’s birth place.
query1 : Qp
1 ⇒ {?player born ?region}.
Although there is a single predicate match pattern in the above
query, it becomes challenging when RDF graph is too large.
Because to find the particular edge labeled as predicate of
SPARQL query from the graph becomes complex and time
consuming. We can understand that as the number of match
pattern increases, the complexity and execution time also
increases. For instance, the following query 2 contains 5 match
patterns.
query2 : Qp
1 ⇒ {?player playesfor ?club},
Qp
2 ⇒ {?player type a2},
Qp
3 ⇒ {?player born ?region} ,
Qp
4 ⇒ {?club region ?region},
Qp
5 ⇒ {?region population ?pop}.
In the current work our aim is to match these patterns in the
database in optimized time. In our earlier work in [10] we have
provided the idea of using hypergraph on top of the RDF graph.
To be more specific, we have applied the concept of undirected
hypergraph with an idea of placing the subjects and objects
of a particular predicate into a single hyperedge. The paper
provides some algorithms to achieve the goal of SPARQL
query optimization. In the current work we implement the
algorithms with some refinements to improve the performance
of the system.
IV. SYSTEM ARCHITECTURE
Figure 2 presents the architecture of our proposed system.
It consists of 3 layers: (1) user interface layer; (2) query
processing layer; and (3) storage layer.
1) User Interface Layer : In this layer user invokes
their queries. After executing the queries our system
provides the answer to the user.
2) Query Processing Layer : This layer contains 3 sub
components.
Query parser, for extracting the match patterns from
the query.
Query optimizer, for rearranging the match patterns
according to the size of hyperedges.
Query processor, for executing the query against the
stored information.
3) Storage Layer : This layer also contains 2 sub
components.
RDF data, represented as a graph of triplets.
Indexer and Storage, for building the index by tak-
ing subjects and objects of particular predicate in a
single unit and storing the data as a key-value pair
{predicate as key and subject and object pairs as
value}.
USER
QUERY PARSER
SPARQL QUERY RESULT
RDF DATA OF EACH PREDICATE)
MATCH
PATTERNS
(TAKING SUBJECTS, OBJECTS
USER INTERFACE LAYER
QUERY PROCESSING LAYER
STORAGE LAYER
QUERY
PROCESSOR
QUERY OPTIMIZER
STORAGEINDEXER AND
(STORE AS KEY−VALUE
PAIR)
Fig. 2: System Architecture
V. APPROACH DEFINITION
In this section we provide formal definition of the basic
terms used in this paper.
Definition 1 : RDF graph: a RDF graph G=(V, E) where
V={v|v∈ S ∪ O} and E={ e1, e2..} ∃ e={u, v} where u,
v∈V. There are two function le and lv. le is an edge-labeling
function. le(S, O)=P and lv is the node labeling function.
lv(vt)=t where t ∈ (S ∪ O) and S=Subject (URI ∪ BLANKS),
P=Predicate (URI), O=Object(URI ∪ BLANKS ∪ LIT).
Definition 2 : Hypergraph: a Hypergraph HG=(V, E) where
node V= { v1, v2.., vn } and E={ e1, e2,.., en } where
V={v|v∈ S ∪ O ∪ P} and each edge E is a non-empty set
of V. E=∪n
i=1V.
Definition 3 : Proposed Hypergraph: a Proposed Hy-
pergraph Hy(G)=(V, E, lev) where V= { v1, v2.., vn }|v ∈
(S, O, P), E={ e1, e2,.., en }|e ∈ {vi, vj} and lev is the node
labeling function. lev = X where X ∈ (subject, object and
predicate).
Definition 4 : SPARQL query: a SPARQL query (QR
)
mainly contains <Qq
, Qs
, Qp
> where Qq
is the query form
and Qp
is the match pattern. If ?x∈var(Qq
) then ?x∈varQp
and
Qs
represents constrains over the query. It contains constraints
like FILTER, OPTIONAL.
VI. HYPERGRAPH CREATION AND QUERY PROCESSING
In this section, we describe our proposed approach. The
approach consists of 3 steps.
Steps 1: Transformation of RDF graph into a hypergraph:
In this step we convert RDF graph into a hypergraph by
calculating the subjects and objects connected with the distinct
predicate.
Steps 2: Query parsing and Rearrangement of SPARQL
query according to predicate size: In this step we rearrange
the match patterns of a given query according to hyperedge
size and build a query path.
Steps 3: Query processing: In this step we process and
execute the query based on the rearranged query in step 2.
This step involves 3 sub-steps:
1) Loop over the match patterns;
2) process(t, res, d, tempdict);
3) FindVariables(p, tempdict, d, t). The advantage of the
proposed approach is we eliminate the joining cost of tables
of a relational database. Here all the subjects and objects
connected with a particular predicate are grouped together.
A. Transformation of RDF graph into a hypergraph
To transform the RDF graph into a hypergraph, we deploy
algorithm 1.
Algorithm 1 Creating Hypergraph using Dictionary for storing
triples
Input: G: Graph containing triples
Output: data: Dictionary storing subjects objects pairs for each predicate
1: data ← ∅ , arr ← ∅
2: for all Gj where 0≤j≤G.length-1 do
3: arr ← arr ∪ Gj {predicate}
4: j ← j + 1
5: end for
6: remove duplicate entries from arr
7: for all arri where 0≤i≤arr.length-1 do
8: for all Gj where 0≤j≤G.length-1 do
9: if arri == Gj {predicate} then
10: data[arri] ← data[arri] ∪ Gj {subject} ∪ Gj {object}
11: end if
12: j ← j + 1
13: end for
14: i ← i + 1
15: end for
As shown above, we store the subject object pairs for
each predicate in a dictionary. Here dictionary refers to the
idea of python dictionary (a.k.a. mappings) 3
. Dictionary is an
unordered set of key: value pairs, with the requirement that
the keys are unique (within one dictionary). Unlike sequences,
which are indexed by a range of numbers, dictionaries are
indexed by keys, which can be any immutable type; strings
and numbers can always be keys. Figure 3 is the pictorial
representation of dictionary as a hypergraph that is generated
from the RDF graph (figure 1). Here, in this figure 3, each
predicate is a hyperedge which connects subjects and objects
associated with it.
3https://docs.python.org/2/tutorial/datastructutes.html dictionaries
Fig. 3: Hypergraph
B. Query parsing and Rearrangement of SPARQL query ac-
cording to predicate size
When a SPARQL query is invoked into the system, we
parse the query and arrange the match patterns in the query
in ascending order of hyperedge size and use the sorted order
to build a query path. To rearrange the predicate list for query
execution we deploy the following algorithm 2.
Algorithm 2 Rearranging match patterns in query according
to hyperedge size
Input: data, q: list storing the match patterns of input query.
Output: sizeEdge: dictionary storing size of each hyperedge, predicatelist: list storing
rearranged match patterns and predicate according to hyperedge size.
sizeEdge ← ∅ , predicate ← ∅ , predicatelist ← ∅ , line ← ∅ ,
pred ← ∅
2: for all predicate in data.keys do
sizeEdge[predicate] ← data[predicate].length/2
4: end for
for each line in q do
6: pred ← line {predicate}
predicatelist ← predicatelist ∪ {line, pred, sizeEdge[pred]}
8: end for
sort predicatelist based on sizeEdge[pred]
10: return sizeEdge, predicatelist
Here we take the input query in a list q. We store the size
i.e. the number of subject object pairs for each predicate in
dictionary sizeEdge. From query q, we extract the predicates,
store each predicate along with its match pattern and size in
predicatelist and sort it on predicate size.
C. Query processing
Using the rearranged SPARQL query, we extract the re-
quired subjects and objects from the data dictionary. We
achieve this using the steps below.
1) Loop over the match patterns To loop over the match
patterns in the rearranged query, we deploy algorithm 3. Here,
for each match pattern, 3 cases are possible: (1) string at
subject position is constant and string at object position is a
variable; (2) string at subject position is a variable and string at
object position is constant; (3) both strings at subject position
and object position are variables. For the first two cases we
use the function process(t, res, d, tempdict), where res stores
the answer of the variable in current triple. The third case
has three further subcases: (1) subject and object have been
encountered before in the match patterns already processed in
the query; (2) either subject or object has been encountered
before in the match patterns already processed; (3) neither
subject nor object has been encountered before in the triples
already processed. For the first two subcases, we store, in p all
the variables associated with subject(and/or object) in match
patterns already processed, and answers of these variables in
tempdict. Then we use the function FindVariables(p, tempdict,
d, t). We find the answers common to answers of subject
variable(and/or object) in current match pattern and answers of
same variable previously encountered and accordingly modify
the answers of variables in p. We store the final answers of
all the variables in d. For the last subcase, we simply find the
subjects and objects of predicate of current match pattern and
store in d.
2) process(t, res, d, tempdict) This method handles cases
when either subject position string or object position string is
variable.
The algorithm 4, describes the function process. If subject vari-
able(or object) is not present in d(i.e has not been encountered
before) then we just store all the subjects associated with object
position string through predicate of the current match pattern,
in d. Else we store in p, all variables associated with subject(or
object) through predicate in match patterns already processed
and answers of these variables in tempdict. Then we use the
function FindVariables(p, tempdict, d, t). Then we find the
answers common to answers of subject(or object) variable in
current match pattern and answers of same subject(or object)
variable previously encountered and accordingly modify the
answers of variables in p.
3) FindVariables(p, tempdict, d, t) In this step we find
variables whose answers will be affected by the current match
pattern and will undergo modification.
Using the algorithm 5, we find variables associated with the
variables(through match patterns already processed in current
query) stored in p. t stores the subject(or object) of the current
match pattern being processed. curvar stores the variables in
the current triple.
VII. EVALUATION AND RESULT
In this section, we evaluate the performance of our
algorithm, both in terms of dataset loading time and query
execution time. For this purpose, we use SP2
Bench
benchmark datasets in [9] and also apply the following 3
SPARQL queries. SP2
Bench is settled in the DBLP scenario
and comprises both data generator for creating arbitrarily
large DBLP-like documents and a set of carefully designed
benchmark queries. They provide datasets in N3 format.
RDFLib-3.2.0, a python library, is also used to implement our
algorithms.
query3: SELECT ?article WHERE {?article
http://www.w3.org/1999/02/22-rdf-syntax-ns type
http://localhost/vocabulary/bench/Article}
query4: SELECT ?yr WHERE {?journal
http://www.w3.org/1999/02/22-rdf-syntax-ns type
http://localhost/vocabulary/bench/Journal .?journal
http://purl.org/dc/terms/issued ?yr }
query5: SELECT ?inproc ?booktitle ?title WHERE
{?inproc http://www.w3.org/1999/02/22-rdf-syntax-ns type
http://localhost/vocabulary/bench/Inproceedings .?inproc
http://purl.org/dc/elements/1.1/creator ?author .?inproc
http://localhost/vocabulary/bench/booktitle ?booktitle .?inproc
http://purl.org/dc/elements/1.1/title ?title }
Our proposed algorithms as discussed above are implemented
in python and compared with apache-jena-2.12.1, rdf3x-0.3.7
and AllegroGraph WebView 4.14.1. Jena is implemented in
java, rdf-3x is in C++. All tests are run on a laptop with a
processor Intel(R) Core(TM)2 duo CPU P8600@ 2.40GHz,
4GB RAM and 465GB 5400RPM storage capacity, while the
operating system is Fedora20. We use NetbeansIDE8.0.2 for
implementing jena, rdf-3x and our system.
Algorithm 3 Loop over the match patterns
Input: data, predicatelist
Output: d: Dictionary storing answers for all variables in Query
1: d ← ∅
2: for c = 0 to predicatelist.length-1 do
3: subobj ← fetch and store subject object pairs of cth predicate in predicatelist
from data dictionary
4: res ← ∅ , tempdict ← ∅
5: x ← subject position string in cth match pattern, y← object position string in
cth match pattern
6: if subject position string does not start with ’?’ then
7: res ← objects from subobj associated with subject position string
8: process(y, res, d, tempdict)
9: else if object position string does not start with ’?’ then
10: res ← subjects from subobj associated with object position string
11: process(x, res, d, tempdict)
12: else
13: if key x and key y are in d then
14: psub ← d[x], pobj ← d[y], d[x] ← ∅ , d[y] ← ∅ , p← ∅
15: p ← list of variables associated with x and y through predicates in match
pattern already processed in the query
16: FindVariables(p, tempdict, d, t)
17: for all subobji where 0≤i≤subobj.length-1 do
18: if subobji is in psub and subobji+1 is in pobj then
19: ind1 ← index of subobji in psub, ind2 ← index of subobji+1
in pobj
20: if ind1 = = ind2 then
21: d[x] ← d[x] ∪subobji
22: d[y] ← d[y] ∪subobji+1
23: var ← ∅
24: for all var in tempdict.keys do
25: d[var] ← d[var] ∪tempdict[var][ind1]
26: end for
27: end if
28: end if
29: i ← i + 2
30: end for
31: else if key x is in d or key y is in d then
32: i ← 0, counter ← 1
33: if y in d then
34: swap x and y
35: i ← 1, counter ← -1
36: end if
37: psub ← d[x], d[x] ← ∅, d[y] ← ∅
38: p ← list of variables associated with x through predicates in match
patterns already processed in current query
39: FindVariables (p, tempdict, d, t)
40: while i < subobj.length do
41: if subobji in psub then
42: d[x] ← d[x] ∪subobji
43: d[y] ← d[y] ∪subobji+counter
44: ind ← index of subobji in psub
45: var ← ∅
46: for all var in tempdict.keys do
47: d[var] ← d[var] ∪tempdict[var][ind]
48: end for
49: end if
50: i ← i+2
51: end while
52: else
53: d[x] ← all subjects from subobj
54: d[y] ← all objects from subobj
55: end if
56: end if
57: c ← c+1
58: end for
59: return d
A. Dataset loading and hypergraph creation time Analysis
In this section we present the statistics of dataset loading
time and hypergraph creation time of our proposed system
using the line diagrams shown in figure 4(a) and 4(b) respec-
tively. Figure 4(a) shows the loading time of various size of
datasets into our system. The figure shows that our system
takes 0.23 sec. to load 691 triples and it increases to 0.44
sec. when we double the load containing 1285 triples. As we
increase the number of triples (the dataset), the load time also
gradually increases. For 32330 triples, total load time is 11.12
(a) Loading Time of Dataset
(b) Hypergraph creation Time from
Dataset
Fig. 4: Hypergraph creation and Loading Time of Dataset
Algorithm 4 process(t, res, d, tempdict)
Input: t: variable in current match pattern
res: storing answer of variable in current match pattern
Output: tempdict , d
1: if key t is not in d then
2: d[t] ← res
3: else
4: new ← d[t]
5: d[t]← ∅, p ← ∅
6: p ← list of variables associated with t through predicates in match patterns
already processed in current query
7: FindVariables(p, tempdict, d, t)
8: for all resi where 0≤i≤res.length-1 do
9: if resi is present in new then
10: d[y] ← d[y] ∪ resi
11: ind ← index of resi in new
12: var ← ∅
13: for all var in tempdict.keys do
14: d[var] ← d[var] ∪ tempdict[var][ind]
15: end for
16: end if
17: i ← i + 1
18: end for
19: end if
Algorithm 5 FindVariables(p, tempdict, d, t)
Input: p: list of variables associated with subject or object in current match pattern been
processed, d
Output: tempdict
1: curvar ← ∅
2: curvar ← curvar ∪ t
3: for all pi where 0≤i≤P.length-1 do
4: if pi not in tempdict and pi not in curvar then
5: tempdict[pi] ← d[pi]
6: d[pi] ← ∅
7: if pi has any variable associated with it through predicates in match patterns
already processed in current query then
8: p ← p ∪ variables associated with pi
9: end if
10: end if
11: i ← i + 1
12: end for
sec. We use RDFLib to load the datasets into our program.
Figure 4(b) shows hypergraph creation time w. r. t. various
size of datasets consisting of triples. As it can be seen from
figure 4(b) that with the increase in the number of triples,
the hypergraph creation time also gradually increases. For
instance, for 691 triples, the system takes 0.20 sec., whereas,
for 1285 triples, it is 0.52 sec, and for 2006 triples 0.57 sec.
We do this experiment with the total 16040 triples. For 16040,
the hypergraph creation time is 12.21 sec.
B. Query execution time Analysis
In this section we present the query execution time using
the bar diagrams comparing our proposed system with the
other three external systems such as apache-jena-2.12.1, rdf3x-
0.3.7 and AllegroGraph WebView 4.14.1. Figure 5 (a), shows
the comparison of query evaluation time between our system
and the other three systems for a dataset containing 691 triples.
We run the SPARQL queries listed in the beginning of this
section. We can see from figure 5(a), that our system takes
less time to run the above three queries compared to the other
3 systems. Our system takes 0.00048s to run query3, 0.00040s
to run query4 and 0.00093s to run query5.
Figure 5 (b), shows the comparison of query evaluation for a
dataset containing 1285 triples. After running the queries we
observe that our system takes 0.00043s to run query3, 0.00058s
to run query4 and 0.00099s to run query5. In a similar manner,
we increase the dataset size and run the above queries. As it
can be seen from the figures (5(a), 5(b), 5(c) and 5(d)) that to
execute the above three queries our proposed system takes less
time compared to the other systems, except the following cases.
We observe that for the given dataset containing 8023 triples,
for query4, rdf-3x takes less time and for query5, both rdf-
3x and AllegroGraph take less time compared to our system
(figure 5(e)). Also, for the dataset containing 16040 triples,
for query4 and query5, rdf-3x takes less time compared to our
system (figure 5(f)).
VIII. CONCLUSION
In this paper we provide a framework for hypergraph
based SPARQL query optimization for RDF data. We focus
on storing of RDF data as hypergraph and then processing the
SPARQL queries into the hypergraph. After the performance
analysis w. r. t. query processing time with other existing
systems, we state that our system yields better result in less
time.
(a) Query Execution with 691 triples (b) Query Execution with 1285 triples (c) Query Execution with 2006 triples
(d) Query Execution with 4021 triples (e) Query Execution with 8023 triples
(f) Query Execution with 16040
triples
Fig. 5: Query Processing ( dataset 691 triples to 16040 triples)
In the present work, we have considered the simple queries.
As a continuation to the current work, the presented approach
can be further extended with more complex queries.
REFERENCES
[1] Thomas Neumann 0001 and Gerhard Weikum. The rdf-3x engine for
scalable management of rdf data. VLDB J., 19(1):91–113, 2010.
[2] Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web.
2001.
[3] Andrey Gubichev and Thomas Neumann. Exploiting the query structure
for efficient join ordering in sparql queries. In EDBT’14, pages 439–
450, 2014.
[4] Olaf Hartig. Squin: A traversal based query execution system for the
web of linked data. 2013.
[5] Jiewen Huang, Daniel J. Abadi, and Kun Ren. Scalable sparql querying
of large rdf graphs. PVLDB, 4(11):1123–1134, 2011.
[6] Dong-Hyuk Im, Nansu Zong, Eung-Hee Kim, Seokchan Yun, and
Hong-Gee Kim. A hypergraph-based storage policy for rdf version
management system. 2012.
[7] Kisung Kim, Bongki Moon, and Hyoung-Joo Kim. Rg-index: An rdf
graph index for efficient sparql query processing. 2014.
[8] Chang Liu, Haofen Wang, Yong Yu, and Linhao Xu. Towards efficient
sparql query processing on rdf data, 2010.
[9] Georg Lausen Christoph Pinkel Michael Schmidt, Thomas Hornung.
Sp2bench: A sparql performance benchmark.
[10] Animesh Dutta Biswanath Dutta Sangeeta Sen, Moumita Ghosh. Hy-
pergraph based query optimization. January 2015.
[11] Gunter Ladwig Thanh Tran. Structure index for rdf data. 2010.
[12] Udrea, Octavian, Pugliese, Andrea, Subrahmanian, and V. S. Grin: A
graph based rdf index. In AAAI, pages 1465–1470. AAAI Press, 2007.
[13] W3C. Allegrograph rdfstore web 3.0’s database, September 2009.
[14] Kevin Wilkinson, Craig Sayers, Harumi Kuno, and Dave Reynolds.
Efficient RDF storage and retrieval in Jena2. In Proc. First International
Workshop on Semantic Web and Databases, 2003.
[15] Pingpeng Yuan, Pu Liu, Buwen Wu, Hai Jin, Wenya Zhang, and Ling
Liu. Triplebit: a fast and compact system for large scale rdf data.
PVLDB, 6(7):517–528, 2013.
[16] Pingpeng Yuan, Wenya Zhang, Hai Jin, and Buwen Wu. Statement
hypergraph as partitioning model for rdf data processing. pages 138–
145, 2012.
[17] Kai Zeng, Jiacheng Yang, Haixun Wang, Bin Shao, and Zhongyuan
Wang. A distributed graph engine for web scale rdf data. In
Proceedings of the 39th international conference on Very Large Data
Bases, PVLDB’13, pages 265–276. VLDB Endowment, 2013.
[18] Lei Zou, Jinghui Mo, Lei Chen 0002, M. Tamer zsu, and Dongyan
Zhao. gstore: Answering sparql queries via subgraph matching. PVLDB,
4(8):482–493, 2011.

Weitere ähnliche Inhalte

Was ist angesagt?

Towards efficient processing of RDF data streams
Towards efficient processing of RDF data streamsTowards efficient processing of RDF data streams
Towards efficient processing of RDF data streamsAlejandro Llaves
 
Technical_Report_on_ML_Library
Technical_Report_on_ML_LibraryTechnical_Report_on_ML_Library
Technical_Report_on_ML_LibrarySaurabh Chauhan
 
Data Structure
Data StructureData Structure
Data Structuresheraz1
 
Stacks in algorithems & data structure
Stacks in algorithems & data structureStacks in algorithems & data structure
Stacks in algorithems & data structurefaran nawaz
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on RAjay Ohri
 
LarKC Tutorial at ISWC 2009 - Data Model
LarKC Tutorial at ISWC 2009 - Data ModelLarKC Tutorial at ISWC 2009 - Data Model
LarKC Tutorial at ISWC 2009 - Data ModelLarKC
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeNational Institute of Informatics
 
3. Stack - Data Structures using C++ by Varsha Patil
3. Stack - Data Structures using C++ by Varsha Patil3. Stack - Data Structures using C++ by Varsha Patil
3. Stack - Data Structures using C++ by Varsha Patilwidespreadpromotion
 
how can implement a multidimensional Data Warehouse using NoSQL
how can implement a multidimensional Data Warehouse using NoSQLhow can implement a multidimensional Data Warehouse using NoSQL
how can implement a multidimensional Data Warehouse using NoSQLMohammed El malki
 
Enhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce TechniqueEnhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce TechniquejournalBEEI
 
5. Queue - Data Structures using C++ by Varsha Patil
5. Queue - Data Structures using C++ by Varsha Patil5. Queue - Data Structures using C++ by Varsha Patil
5. Queue - Data Structures using C++ by Varsha Patilwidespreadpromotion
 
4. Recursion - Data Structures using C++ by Varsha Patil
4. Recursion - Data Structures using C++ by Varsha Patil4. Recursion - Data Structures using C++ by Varsha Patil
4. Recursion - Data Structures using C++ by Varsha Patilwidespreadpromotion
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environmentizahn
 
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...DBOnto
 
Running complex data queries in a distributed system
Running complex data queries in a distributed systemRunning complex data queries in a distributed system
Running complex data queries in a distributed systemArangoDB Database
 
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
1. Fundamental Concept - Data Structures using C++ by Varsha Patil1. Fundamental Concept - Data Structures using C++ by Varsha Patil
1. Fundamental Concept - Data Structures using C++ by Varsha Patilwidespreadpromotion
 
An efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence apiAn efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence apiJoão Gabriel Lima
 

Was ist angesagt? (19)

Towards efficient processing of RDF data streams
Towards efficient processing of RDF data streamsTowards efficient processing of RDF data streams
Towards efficient processing of RDF data streams
 
Technical_Report_on_ML_Library
Technical_Report_on_ML_LibraryTechnical_Report_on_ML_Library
Technical_Report_on_ML_Library
 
Data Structure
Data StructureData Structure
Data Structure
 
Stacks in algorithems & data structure
Stacks in algorithems & data structureStacks in algorithems & data structure
Stacks in algorithems & data structure
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
 
LarKC Tutorial at ISWC 2009 - Data Model
LarKC Tutorial at ISWC 2009 - Data ModelLarKC Tutorial at ISWC 2009 - Data Model
LarKC Tutorial at ISWC 2009 - Data Model
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
 
3. Stack - Data Structures using C++ by Varsha Patil
3. Stack - Data Structures using C++ by Varsha Patil3. Stack - Data Structures using C++ by Varsha Patil
3. Stack - Data Structures using C++ by Varsha Patil
 
how can implement a multidimensional Data Warehouse using NoSQL
how can implement a multidimensional Data Warehouse using NoSQLhow can implement a multidimensional Data Warehouse using NoSQL
how can implement a multidimensional Data Warehouse using NoSQL
 
Data Structure
Data StructureData Structure
Data Structure
 
Enhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce TechniqueEnhancing Big Data Analysis by using Map-reduce Technique
Enhancing Big Data Analysis by using Map-reduce Technique
 
5. Queue - Data Structures using C++ by Varsha Patil
5. Queue - Data Structures using C++ by Varsha Patil5. Queue - Data Structures using C++ by Varsha Patil
5. Queue - Data Structures using C++ by Varsha Patil
 
4. Recursion - Data Structures using C++ by Varsha Patil
4. Recursion - Data Structures using C++ by Varsha Patil4. Recursion - Data Structures using C++ by Varsha Patil
4. Recursion - Data Structures using C++ by Varsha Patil
 
Introduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing EnvironmentIntroduction to the R Statistical Computing Environment
Introduction to the R Statistical Computing Environment
 
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...
 
Running complex data queries in a distributed system
Running complex data queries in a distributed systemRunning complex data queries in a distributed system
Running complex data queries in a distributed system
 
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
1. Fundamental Concept - Data Structures using C++ by Varsha Patil1. Fundamental Concept - Data Structures using C++ by Varsha Patil
1. Fundamental Concept - Data Structures using C++ by Varsha Patil
 
H1076875
H1076875H1076875
H1076875
 
An efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence apiAn efficient data mining framework on hadoop using java persistence api
An efficient data mining framework on hadoop using java persistence api
 

Andere mochten auch

HyperGraphDb
HyperGraphDbHyperGraphDb
HyperGraphDbborislav
 
Graph All the Things: An Introduction to Graph Databases
Graph All the Things: An Introduction to Graph DatabasesGraph All the Things: An Introduction to Graph Databases
Graph All the Things: An Introduction to Graph DatabasesNeo4j
 
Building Apps with MongoDB
Building Apps with MongoDBBuilding Apps with MongoDB
Building Apps with MongoDBNate Abele
 
Neo4j - The Benefits of Graph Databases (OSCON 2009)
Neo4j - The Benefits of Graph Databases (OSCON 2009)Neo4j - The Benefits of Graph Databases (OSCON 2009)
Neo4j - The Benefits of Graph Databases (OSCON 2009)Emil Eifrem
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialSteven Francia
 

Andere mochten auch (9)

First Step in NoSql
First Step in NoSqlFirst Step in NoSql
First Step in NoSql
 
Hypergraphs
HypergraphsHypergraphs
Hypergraphs
 
HyperGraphDb
HyperGraphDbHyperGraphDb
HyperGraphDb
 
Introduction to Hypergraphs
Introduction to HypergraphsIntroduction to Hypergraphs
Introduction to Hypergraphs
 
Graph All the Things: An Introduction to Graph Databases
Graph All the Things: An Introduction to Graph DatabasesGraph All the Things: An Introduction to Graph Databases
Graph All the Things: An Introduction to Graph Databases
 
HypergraphDB
HypergraphDBHypergraphDB
HypergraphDB
 
Building Apps with MongoDB
Building Apps with MongoDBBuilding Apps with MongoDB
Building Apps with MongoDB
 
Neo4j - The Benefits of Graph Databases (OSCON 2009)
Neo4j - The Benefits of Graph Databases (OSCON 2009)Neo4j - The Benefits of Graph Databases (OSCON 2009)
Neo4j - The Benefits of Graph Databases (OSCON 2009)
 
OSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB TutorialOSCON 2012 MongoDB Tutorial
OSCON 2012 MongoDB Tutorial
 

Ähnlich wie final_copy_camera_ready_paper (7)

Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmodwaqasm86
 
Towards efficient processing of RDF data streams
Towards efficient processing of RDF data streamsTowards efficient processing of RDF data streams
Towards efficient processing of RDF data streamsAlejandro Llaves
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Zakaria Zubi
 
SPARQL: SEMANTIC INFORMATION RETRIEVAL BY EMBEDDING PREPOSITIONS
SPARQL: SEMANTIC INFORMATION RETRIEVAL BY EMBEDDING PREPOSITIONSSPARQL: SEMANTIC INFORMATION RETRIEVAL BY EMBEDDING PREPOSITIONS
SPARQL: SEMANTIC INFORMATION RETRIEVAL BY EMBEDDING PREPOSITIONSIJNSA Journal
 
Rdf Processing On The Java Platform
Rdf Processing On The Java PlatformRdf Processing On The Java Platform
Rdf Processing On The Java Platformguestc1b16406
 
Sparql semantic information retrieval by
Sparql semantic information retrieval bySparql semantic information retrieval by
Sparql semantic information retrieval byIJNSA Journal
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webMahdi Atawneh
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkMila, Université de Montréal
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkSupriya .
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r publishedDipendra Kusi
 
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...Journal For Research
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET Journal
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRathachai Chawuthai
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiOllieShoresna
 

Ähnlich wie final_copy_camera_ready_paper (7) (20)

Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
 
Towards efficient processing of RDF data streams
Towards efficient processing of RDF data streamsTowards efficient processing of RDF data streams
Towards efficient processing of RDF data streams
 
Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)Knowledge Discovery Query Language (KDQL)
Knowledge Discovery Query Language (KDQL)
 
SPARQL: SEMANTIC INFORMATION RETRIEVAL BY EMBEDDING PREPOSITIONS
SPARQL: SEMANTIC INFORMATION RETRIEVAL BY EMBEDDING PREPOSITIONSSPARQL: SEMANTIC INFORMATION RETRIEVAL BY EMBEDDING PREPOSITIONS
SPARQL: SEMANTIC INFORMATION RETRIEVAL BY EMBEDDING PREPOSITIONS
 
Rdf Processing On The Java Platform
Rdf Processing On The Java PlatformRdf Processing On The Java Platform
Rdf Processing On The Java Platform
 
Sparql semantic information retrieval by
Sparql semantic information retrieval bySparql semantic information retrieval by
Sparql semantic information retrieval by
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the web
 
p27
p27p27
p27
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 
Hot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark frameworkHot-Spot analysis Using Apache Spark framework
Hot-Spot analysis Using Apache Spark framework
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description Framework
 
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as KnowledgeRDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Web Spa
Web SpaWeb Spa
Web Spa
 
Spark
SparkSpark
Spark
 
Database Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wiDatabase Integrated Analytics using R InitialExperiences wi
Database Integrated Analytics using R InitialExperiences wi
 

final_copy_camera_ready_paper (7)

  • 1. An Analytical Approach for Query Optimization Based on Hypergraph Sangeeta Sen NIT, DURGAPUR Email: sangeetaaec@ gmail.com Anisha Agrawal NIT, DURGAPUR Email: anishaagrawal23@ gmail.com Ankit Rathi NIT DURGAPUR Email: ankitrathi94@ it.nitdgp.ac.in Animesh Dutta NIT DURGAPUR Email: animesh.dutta@ it.nitdgp.ac.in Biswanath Dutta ISI Bangalore Email: bisu@drtc .isibang.ac.in Abstract—In Semantic web, Resource Description Framework (RDF) plays an important role for storing the data. Growth of RDF data throws a challenge for data management and evaluation of SPARQL in optimized time. In this paper we propose a hypergraph based data management system and SPARQL query optimization technique. We use the concept of hypergraph which is a generalization of the graph where the edges connect more than two vertices. We propose some algorithms for storing RDF data as hypergraph and make query on it. We compare the performance of our algorithms with 3 other systems (Rdf-3x, Apache-Jena and AllegroGraph) based on SP2 Bench a SPARQL performance benchmark dataset and SPARQL queries. I. INTRODUCTION Resource Description Framework 1 (RDF) is a core data model for storing information on the Semantic Web [2]. RDF data is stored as triplets (subject or resource, predicate or property and object or property value). Due to the development of RDF data, optimized query processing of SPARQL2 queries over RDF datasets has become an challenging issue. SPARQL is a graph based query language. The main task of SPARQL is to match the pattern, which consists of subject, predicate and object, in the underlying RDF graph. In this work, we propose a hypergraph [6], [16] based query optimization and data management technique that needs less computation time than the existing techniques discussed in rdf-3x [1], AllegroGraph [13], Jena [14]. A hypergraph is a generalization of the graph where the edges connect more than two vertices and such edges are called hyper-edges. To illustrate the novelty of our proposed work, we provide an experimental study that contrast the performance of 4 systems (including ours) against SP2 Bench benchmark [9] with different triples datasets and SPARQL queries. Following are the main contributions of this paper: • Hypergraph based database of RDF data: All sub- jects and objects connected with a particular predicate resides under a particular hyper-edge (predicate itself), so searching of RDF triples become easy to execute. • Query path based query Processing: To process a query, first the model will generates a query path according to rearranged query predicates. According to this path, query is processed in the hypergraph. 1www.w3.org/RDF/ 2www.w3.org/2001/sw/wiki/SPARQL Rest of the paper is organized as follows. In section II, current state of the art is highlighted. Section III provides an overview of the problem related to SPARQL query optimization and section IV presents the system architecture. In section V we describe the approach definition of the terms used throughout the paper. In section VI, we discuss a method for solving the problem of query optimization in RDF graph. In Section VII, we discuss the results of the proposed method. Section VIII concludes the work mentioning our future work. II. STATE-OF-THE-ART There has been a lot of research work in the field of RDF data storage and SPARQL query optimization. There are many approaches for SPARQL query processing, GRIN Index [12], PIG [11], and g-store [18] focuses on reducing the search space of the graph traversing algorithm using graph index. RG-Index [7] also uses graph index but focuses on reducing the joins in relation based RDF store. The model of the paper [8] represents an efficient RDF query engine to evaluate SPARQL query. This paper employs inverted index structure for indexing the RDF triple. A tree shaped optimized algorithm is developed that transforms a SPARQL query graph into optimal query plan by effectively reducing the searching space. In paper [15], the authors present TripleBit, a fast and compact system for storing and accessing RDF data. The compact design of TripleBit reduces both the size of stored data and size of its indexes. They develop a query processor which dynamically generates an optimal execution ordering for join queries, leading to fast query execution and effective reduction of intermediate result. System SQUIN, in paper [4], implements a novel query execution approach. The main idea is to integrate the traversal of data links into the construction of result. The authors of paper [5] introduce a scalable RDF data management system. They introduce techniques for partitioning the data across nodes in a manner that helps to accelerate query processing. Authors of the paper [17] introduces Trinity. RDF, a distributed, memory based graph engine for web scale data. The paper also introduces a new cost model, novel cardinality estimation technique and optimization algorithm for distributed query plan generation. A new join ordering algorithm that performs a SPARQL tailored query simplification is introduced by paper [3]. The paper also represents a novel RDF statistical synopsis that accurately estimates cardinality targeting SPARQL queries. They provide dynamic programming that decomposes the SPARQL graph into disjoint star shaped sub-queries.978-1-4799-7961-5/15/$31.00 c 2015 IEEE
  • 2. In the current work, we propose a hypergraph based query optimization technique as discussed in the following sections. III. PROBLEM SCENARIO SPARQL, a query language, is developed to provide a so- lution to the problem of RDF data querying. In 2008 SPARQL becomes a W3C recommendation, but query optimization is a big challenge [[7], [8]]. To explain the problems we take an exemplary RDF graph as shown in figure 1. a4a1 a7 type playsfor bornPosition type a2 a3 a5 position born playsfor region manages a9 a10population a8 a6 Fig. 1: RDF Graph The above figure 1 shows that the vertices subject and object are connected by edge predicates. Direction of the edges are from subject to object. Here a4, a5, a9 are subjects, a1, a2, a3, a7, a10 are objects and a6, a8 represent both subjects and objects. The edges playsfor, born, position, etc. are the predicates of the graph. SPARQL query contains match patterns as conditions. These match patterns are required to be matched in RDF database. Query1 contains one match pattern to extract information about player’s birth place. query1 : Qp 1 ⇒ {?player born ?region}. Although there is a single predicate match pattern in the above query, it becomes challenging when RDF graph is too large. Because to find the particular edge labeled as predicate of SPARQL query from the graph becomes complex and time consuming. We can understand that as the number of match pattern increases, the complexity and execution time also increases. For instance, the following query 2 contains 5 match patterns. query2 : Qp 1 ⇒ {?player playesfor ?club}, Qp 2 ⇒ {?player type a2}, Qp 3 ⇒ {?player born ?region} , Qp 4 ⇒ {?club region ?region}, Qp 5 ⇒ {?region population ?pop}. In the current work our aim is to match these patterns in the database in optimized time. In our earlier work in [10] we have provided the idea of using hypergraph on top of the RDF graph. To be more specific, we have applied the concept of undirected hypergraph with an idea of placing the subjects and objects of a particular predicate into a single hyperedge. The paper provides some algorithms to achieve the goal of SPARQL query optimization. In the current work we implement the algorithms with some refinements to improve the performance of the system. IV. SYSTEM ARCHITECTURE Figure 2 presents the architecture of our proposed system. It consists of 3 layers: (1) user interface layer; (2) query processing layer; and (3) storage layer. 1) User Interface Layer : In this layer user invokes their queries. After executing the queries our system provides the answer to the user. 2) Query Processing Layer : This layer contains 3 sub components. Query parser, for extracting the match patterns from the query. Query optimizer, for rearranging the match patterns according to the size of hyperedges. Query processor, for executing the query against the stored information. 3) Storage Layer : This layer also contains 2 sub components. RDF data, represented as a graph of triplets. Indexer and Storage, for building the index by tak- ing subjects and objects of particular predicate in a single unit and storing the data as a key-value pair {predicate as key and subject and object pairs as value}. USER QUERY PARSER SPARQL QUERY RESULT RDF DATA OF EACH PREDICATE) MATCH PATTERNS (TAKING SUBJECTS, OBJECTS USER INTERFACE LAYER QUERY PROCESSING LAYER STORAGE LAYER QUERY PROCESSOR QUERY OPTIMIZER STORAGEINDEXER AND (STORE AS KEY−VALUE PAIR) Fig. 2: System Architecture V. APPROACH DEFINITION In this section we provide formal definition of the basic terms used in this paper. Definition 1 : RDF graph: a RDF graph G=(V, E) where V={v|v∈ S ∪ O} and E={ e1, e2..} ∃ e={u, v} where u, v∈V. There are two function le and lv. le is an edge-labeling function. le(S, O)=P and lv is the node labeling function. lv(vt)=t where t ∈ (S ∪ O) and S=Subject (URI ∪ BLANKS), P=Predicate (URI), O=Object(URI ∪ BLANKS ∪ LIT). Definition 2 : Hypergraph: a Hypergraph HG=(V, E) where node V= { v1, v2.., vn } and E={ e1, e2,.., en } where V={v|v∈ S ∪ O ∪ P} and each edge E is a non-empty set of V. E=∪n i=1V. Definition 3 : Proposed Hypergraph: a Proposed Hy- pergraph Hy(G)=(V, E, lev) where V= { v1, v2.., vn }|v ∈ (S, O, P), E={ e1, e2,.., en }|e ∈ {vi, vj} and lev is the node labeling function. lev = X where X ∈ (subject, object and predicate). Definition 4 : SPARQL query: a SPARQL query (QR ) mainly contains <Qq , Qs , Qp > where Qq is the query form and Qp is the match pattern. If ?x∈var(Qq ) then ?x∈varQp and Qs represents constrains over the query. It contains constraints like FILTER, OPTIONAL.
  • 3. VI. HYPERGRAPH CREATION AND QUERY PROCESSING In this section, we describe our proposed approach. The approach consists of 3 steps. Steps 1: Transformation of RDF graph into a hypergraph: In this step we convert RDF graph into a hypergraph by calculating the subjects and objects connected with the distinct predicate. Steps 2: Query parsing and Rearrangement of SPARQL query according to predicate size: In this step we rearrange the match patterns of a given query according to hyperedge size and build a query path. Steps 3: Query processing: In this step we process and execute the query based on the rearranged query in step 2. This step involves 3 sub-steps: 1) Loop over the match patterns; 2) process(t, res, d, tempdict); 3) FindVariables(p, tempdict, d, t). The advantage of the proposed approach is we eliminate the joining cost of tables of a relational database. Here all the subjects and objects connected with a particular predicate are grouped together. A. Transformation of RDF graph into a hypergraph To transform the RDF graph into a hypergraph, we deploy algorithm 1. Algorithm 1 Creating Hypergraph using Dictionary for storing triples Input: G: Graph containing triples Output: data: Dictionary storing subjects objects pairs for each predicate 1: data ← ∅ , arr ← ∅ 2: for all Gj where 0≤j≤G.length-1 do 3: arr ← arr ∪ Gj {predicate} 4: j ← j + 1 5: end for 6: remove duplicate entries from arr 7: for all arri where 0≤i≤arr.length-1 do 8: for all Gj where 0≤j≤G.length-1 do 9: if arri == Gj {predicate} then 10: data[arri] ← data[arri] ∪ Gj {subject} ∪ Gj {object} 11: end if 12: j ← j + 1 13: end for 14: i ← i + 1 15: end for As shown above, we store the subject object pairs for each predicate in a dictionary. Here dictionary refers to the idea of python dictionary (a.k.a. mappings) 3 . Dictionary is an unordered set of key: value pairs, with the requirement that the keys are unique (within one dictionary). Unlike sequences, which are indexed by a range of numbers, dictionaries are indexed by keys, which can be any immutable type; strings and numbers can always be keys. Figure 3 is the pictorial representation of dictionary as a hypergraph that is generated from the RDF graph (figure 1). Here, in this figure 3, each predicate is a hyperedge which connects subjects and objects associated with it. 3https://docs.python.org/2/tutorial/datastructutes.html dictionaries Fig. 3: Hypergraph B. Query parsing and Rearrangement of SPARQL query ac- cording to predicate size When a SPARQL query is invoked into the system, we parse the query and arrange the match patterns in the query in ascending order of hyperedge size and use the sorted order to build a query path. To rearrange the predicate list for query execution we deploy the following algorithm 2. Algorithm 2 Rearranging match patterns in query according to hyperedge size Input: data, q: list storing the match patterns of input query. Output: sizeEdge: dictionary storing size of each hyperedge, predicatelist: list storing rearranged match patterns and predicate according to hyperedge size. sizeEdge ← ∅ , predicate ← ∅ , predicatelist ← ∅ , line ← ∅ , pred ← ∅ 2: for all predicate in data.keys do sizeEdge[predicate] ← data[predicate].length/2 4: end for for each line in q do 6: pred ← line {predicate} predicatelist ← predicatelist ∪ {line, pred, sizeEdge[pred]} 8: end for sort predicatelist based on sizeEdge[pred] 10: return sizeEdge, predicatelist Here we take the input query in a list q. We store the size i.e. the number of subject object pairs for each predicate in dictionary sizeEdge. From query q, we extract the predicates, store each predicate along with its match pattern and size in predicatelist and sort it on predicate size. C. Query processing Using the rearranged SPARQL query, we extract the re- quired subjects and objects from the data dictionary. We achieve this using the steps below. 1) Loop over the match patterns To loop over the match patterns in the rearranged query, we deploy algorithm 3. Here, for each match pattern, 3 cases are possible: (1) string at subject position is constant and string at object position is a variable; (2) string at subject position is a variable and string at object position is constant; (3) both strings at subject position and object position are variables. For the first two cases we use the function process(t, res, d, tempdict), where res stores the answer of the variable in current triple. The third case has three further subcases: (1) subject and object have been encountered before in the match patterns already processed in the query; (2) either subject or object has been encountered before in the match patterns already processed; (3) neither subject nor object has been encountered before in the triples already processed. For the first two subcases, we store, in p all the variables associated with subject(and/or object) in match patterns already processed, and answers of these variables in tempdict. Then we use the function FindVariables(p, tempdict, d, t). We find the answers common to answers of subject variable(and/or object) in current match pattern and answers of
  • 4. same variable previously encountered and accordingly modify the answers of variables in p. We store the final answers of all the variables in d. For the last subcase, we simply find the subjects and objects of predicate of current match pattern and store in d. 2) process(t, res, d, tempdict) This method handles cases when either subject position string or object position string is variable. The algorithm 4, describes the function process. If subject vari- able(or object) is not present in d(i.e has not been encountered before) then we just store all the subjects associated with object position string through predicate of the current match pattern, in d. Else we store in p, all variables associated with subject(or object) through predicate in match patterns already processed and answers of these variables in tempdict. Then we use the function FindVariables(p, tempdict, d, t). Then we find the answers common to answers of subject(or object) variable in current match pattern and answers of same subject(or object) variable previously encountered and accordingly modify the answers of variables in p. 3) FindVariables(p, tempdict, d, t) In this step we find variables whose answers will be affected by the current match pattern and will undergo modification. Using the algorithm 5, we find variables associated with the variables(through match patterns already processed in current query) stored in p. t stores the subject(or object) of the current match pattern being processed. curvar stores the variables in the current triple. VII. EVALUATION AND RESULT In this section, we evaluate the performance of our algorithm, both in terms of dataset loading time and query execution time. For this purpose, we use SP2 Bench benchmark datasets in [9] and also apply the following 3 SPARQL queries. SP2 Bench is settled in the DBLP scenario and comprises both data generator for creating arbitrarily large DBLP-like documents and a set of carefully designed benchmark queries. They provide datasets in N3 format. RDFLib-3.2.0, a python library, is also used to implement our algorithms. query3: SELECT ?article WHERE {?article http://www.w3.org/1999/02/22-rdf-syntax-ns type http://localhost/vocabulary/bench/Article} query4: SELECT ?yr WHERE {?journal http://www.w3.org/1999/02/22-rdf-syntax-ns type http://localhost/vocabulary/bench/Journal .?journal http://purl.org/dc/terms/issued ?yr } query5: SELECT ?inproc ?booktitle ?title WHERE {?inproc http://www.w3.org/1999/02/22-rdf-syntax-ns type http://localhost/vocabulary/bench/Inproceedings .?inproc http://purl.org/dc/elements/1.1/creator ?author .?inproc http://localhost/vocabulary/bench/booktitle ?booktitle .?inproc http://purl.org/dc/elements/1.1/title ?title } Our proposed algorithms as discussed above are implemented in python and compared with apache-jena-2.12.1, rdf3x-0.3.7 and AllegroGraph WebView 4.14.1. Jena is implemented in java, rdf-3x is in C++. All tests are run on a laptop with a processor Intel(R) Core(TM)2 duo CPU P8600@ 2.40GHz, 4GB RAM and 465GB 5400RPM storage capacity, while the operating system is Fedora20. We use NetbeansIDE8.0.2 for implementing jena, rdf-3x and our system. Algorithm 3 Loop over the match patterns Input: data, predicatelist Output: d: Dictionary storing answers for all variables in Query 1: d ← ∅ 2: for c = 0 to predicatelist.length-1 do 3: subobj ← fetch and store subject object pairs of cth predicate in predicatelist from data dictionary 4: res ← ∅ , tempdict ← ∅ 5: x ← subject position string in cth match pattern, y← object position string in cth match pattern 6: if subject position string does not start with ’?’ then 7: res ← objects from subobj associated with subject position string 8: process(y, res, d, tempdict) 9: else if object position string does not start with ’?’ then 10: res ← subjects from subobj associated with object position string 11: process(x, res, d, tempdict) 12: else 13: if key x and key y are in d then 14: psub ← d[x], pobj ← d[y], d[x] ← ∅ , d[y] ← ∅ , p← ∅ 15: p ← list of variables associated with x and y through predicates in match pattern already processed in the query 16: FindVariables(p, tempdict, d, t) 17: for all subobji where 0≤i≤subobj.length-1 do 18: if subobji is in psub and subobji+1 is in pobj then 19: ind1 ← index of subobji in psub, ind2 ← index of subobji+1 in pobj 20: if ind1 = = ind2 then 21: d[x] ← d[x] ∪subobji 22: d[y] ← d[y] ∪subobji+1 23: var ← ∅ 24: for all var in tempdict.keys do 25: d[var] ← d[var] ∪tempdict[var][ind1] 26: end for 27: end if 28: end if 29: i ← i + 2 30: end for 31: else if key x is in d or key y is in d then 32: i ← 0, counter ← 1 33: if y in d then 34: swap x and y 35: i ← 1, counter ← -1 36: end if 37: psub ← d[x], d[x] ← ∅, d[y] ← ∅ 38: p ← list of variables associated with x through predicates in match patterns already processed in current query 39: FindVariables (p, tempdict, d, t) 40: while i < subobj.length do 41: if subobji in psub then 42: d[x] ← d[x] ∪subobji 43: d[y] ← d[y] ∪subobji+counter 44: ind ← index of subobji in psub 45: var ← ∅ 46: for all var in tempdict.keys do 47: d[var] ← d[var] ∪tempdict[var][ind] 48: end for 49: end if 50: i ← i+2 51: end while 52: else 53: d[x] ← all subjects from subobj 54: d[y] ← all objects from subobj 55: end if 56: end if 57: c ← c+1 58: end for 59: return d A. Dataset loading and hypergraph creation time Analysis In this section we present the statistics of dataset loading time and hypergraph creation time of our proposed system using the line diagrams shown in figure 4(a) and 4(b) respec- tively. Figure 4(a) shows the loading time of various size of datasets into our system. The figure shows that our system takes 0.23 sec. to load 691 triples and it increases to 0.44 sec. when we double the load containing 1285 triples. As we increase the number of triples (the dataset), the load time also gradually increases. For 32330 triples, total load time is 11.12
  • 5. (a) Loading Time of Dataset (b) Hypergraph creation Time from Dataset Fig. 4: Hypergraph creation and Loading Time of Dataset Algorithm 4 process(t, res, d, tempdict) Input: t: variable in current match pattern res: storing answer of variable in current match pattern Output: tempdict , d 1: if key t is not in d then 2: d[t] ← res 3: else 4: new ← d[t] 5: d[t]← ∅, p ← ∅ 6: p ← list of variables associated with t through predicates in match patterns already processed in current query 7: FindVariables(p, tempdict, d, t) 8: for all resi where 0≤i≤res.length-1 do 9: if resi is present in new then 10: d[y] ← d[y] ∪ resi 11: ind ← index of resi in new 12: var ← ∅ 13: for all var in tempdict.keys do 14: d[var] ← d[var] ∪ tempdict[var][ind] 15: end for 16: end if 17: i ← i + 1 18: end for 19: end if Algorithm 5 FindVariables(p, tempdict, d, t) Input: p: list of variables associated with subject or object in current match pattern been processed, d Output: tempdict 1: curvar ← ∅ 2: curvar ← curvar ∪ t 3: for all pi where 0≤i≤P.length-1 do 4: if pi not in tempdict and pi not in curvar then 5: tempdict[pi] ← d[pi] 6: d[pi] ← ∅ 7: if pi has any variable associated with it through predicates in match patterns already processed in current query then 8: p ← p ∪ variables associated with pi 9: end if 10: end if 11: i ← i + 1 12: end for sec. We use RDFLib to load the datasets into our program. Figure 4(b) shows hypergraph creation time w. r. t. various size of datasets consisting of triples. As it can be seen from figure 4(b) that with the increase in the number of triples, the hypergraph creation time also gradually increases. For instance, for 691 triples, the system takes 0.20 sec., whereas, for 1285 triples, it is 0.52 sec, and for 2006 triples 0.57 sec. We do this experiment with the total 16040 triples. For 16040, the hypergraph creation time is 12.21 sec. B. Query execution time Analysis In this section we present the query execution time using the bar diagrams comparing our proposed system with the other three external systems such as apache-jena-2.12.1, rdf3x- 0.3.7 and AllegroGraph WebView 4.14.1. Figure 5 (a), shows the comparison of query evaluation time between our system and the other three systems for a dataset containing 691 triples. We run the SPARQL queries listed in the beginning of this section. We can see from figure 5(a), that our system takes less time to run the above three queries compared to the other 3 systems. Our system takes 0.00048s to run query3, 0.00040s to run query4 and 0.00093s to run query5. Figure 5 (b), shows the comparison of query evaluation for a dataset containing 1285 triples. After running the queries we observe that our system takes 0.00043s to run query3, 0.00058s to run query4 and 0.00099s to run query5. In a similar manner, we increase the dataset size and run the above queries. As it can be seen from the figures (5(a), 5(b), 5(c) and 5(d)) that to execute the above three queries our proposed system takes less time compared to the other systems, except the following cases. We observe that for the given dataset containing 8023 triples, for query4, rdf-3x takes less time and for query5, both rdf- 3x and AllegroGraph take less time compared to our system (figure 5(e)). Also, for the dataset containing 16040 triples, for query4 and query5, rdf-3x takes less time compared to our system (figure 5(f)). VIII. CONCLUSION In this paper we provide a framework for hypergraph based SPARQL query optimization for RDF data. We focus on storing of RDF data as hypergraph and then processing the SPARQL queries into the hypergraph. After the performance analysis w. r. t. query processing time with other existing systems, we state that our system yields better result in less time.
  • 6. (a) Query Execution with 691 triples (b) Query Execution with 1285 triples (c) Query Execution with 2006 triples (d) Query Execution with 4021 triples (e) Query Execution with 8023 triples (f) Query Execution with 16040 triples Fig. 5: Query Processing ( dataset 691 triples to 16040 triples) In the present work, we have considered the simple queries. As a continuation to the current work, the presented approach can be further extended with more complex queries. REFERENCES [1] Thomas Neumann 0001 and Gerhard Weikum. The rdf-3x engine for scalable management of rdf data. VLDB J., 19(1):91–113, 2010. [2] Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web. 2001. [3] Andrey Gubichev and Thomas Neumann. Exploiting the query structure for efficient join ordering in sparql queries. In EDBT’14, pages 439– 450, 2014. [4] Olaf Hartig. Squin: A traversal based query execution system for the web of linked data. 2013. [5] Jiewen Huang, Daniel J. Abadi, and Kun Ren. Scalable sparql querying of large rdf graphs. PVLDB, 4(11):1123–1134, 2011. [6] Dong-Hyuk Im, Nansu Zong, Eung-Hee Kim, Seokchan Yun, and Hong-Gee Kim. A hypergraph-based storage policy for rdf version management system. 2012. [7] Kisung Kim, Bongki Moon, and Hyoung-Joo Kim. Rg-index: An rdf graph index for efficient sparql query processing. 2014. [8] Chang Liu, Haofen Wang, Yong Yu, and Linhao Xu. Towards efficient sparql query processing on rdf data, 2010. [9] Georg Lausen Christoph Pinkel Michael Schmidt, Thomas Hornung. Sp2bench: A sparql performance benchmark. [10] Animesh Dutta Biswanath Dutta Sangeeta Sen, Moumita Ghosh. Hy- pergraph based query optimization. January 2015. [11] Gunter Ladwig Thanh Tran. Structure index for rdf data. 2010. [12] Udrea, Octavian, Pugliese, Andrea, Subrahmanian, and V. S. Grin: A graph based rdf index. In AAAI, pages 1465–1470. AAAI Press, 2007. [13] W3C. Allegrograph rdfstore web 3.0’s database, September 2009. [14] Kevin Wilkinson, Craig Sayers, Harumi Kuno, and Dave Reynolds. Efficient RDF storage and retrieval in Jena2. In Proc. First International Workshop on Semantic Web and Databases, 2003. [15] Pingpeng Yuan, Pu Liu, Buwen Wu, Hai Jin, Wenya Zhang, and Ling Liu. Triplebit: a fast and compact system for large scale rdf data. PVLDB, 6(7):517–528, 2013. [16] Pingpeng Yuan, Wenya Zhang, Hai Jin, and Buwen Wu. Statement hypergraph as partitioning model for rdf data processing. pages 138– 145, 2012. [17] Kai Zeng, Jiacheng Yang, Haixun Wang, Bin Shao, and Zhongyuan Wang. A distributed graph engine for web scale rdf data. In Proceedings of the 39th international conference on Very Large Data Bases, PVLDB’13, pages 265–276. VLDB Endowment, 2013. [18] Lei Zou, Jinghui Mo, Lei Chen 0002, M. Tamer zsu, and Dongyan Zhao. gstore: Answering sparql queries via subgraph matching. PVLDB, 4(8):482–493, 2011.