We present a new version of the data model stRDF and the query language stSPARQL for the representation and querying of geospatial data. The new versions of stRDF and stSPARQL use OGC standards to represent geometries where the original version of stSPARQL used linear constraints. In this sense stSPARQL is a subset of the recent standard GeoSPARQL proposed by OGC. We discuss the implementation of the system Strabon which is a storage and query evaluation module for stRDF/stSPARQL and the corresponding subset of GeoSPARQL. We study the performance of Strabon experimentally and show that it scales to very large data volumes.
1. Strabon
A Semantic Geospatial DBMS
Kostis Kyzirakos, Manos Karpathiotakis, and
Manolis Koubarakis
Dept. of Informatics and Telecommunications,
National and Kapodistrian University of Athens, Greece
2. Outline
Introduction
The data model stRDF
The query language stSPARQL
and a comparison to GeoSPARQL
The system Strabon for
stSPARQL and GeoSPARQL
Experimental Evaluation
Related Work & Conclusions
3. Main idea
How do we represent and query geospatial
information in the Semantic Web?
Develop appropriate vocabularies and
ontologies
Extend RDF to take into account the
geospatial dimension
Extend SPARQL to query the new kinds of
data
Use Open Geospatial Consortium (OGC) and
other geospatial industry standards
6. Introduction
The data model
The data model
stRDF
The query language
stSPARQL and a
stRDF
comparison to
GeoSPARQL
The system Strabon
for stSPARQL and
GeoSPARQL
Experimental
Evaluation
Related Work &
Conclusions
7. The Data Model stRDF
stRDF extends RDF with:
Spatial literals encoded by Boolean combinations of
linear constraints
New datatype for spatial literals (strdf:geometry)
Valid time of triples encoded by Boolean
combinations of temporal constraints
stRDF (most recent version)
Spatial literals encoded in Well-Known Text/GML
(OGC standards)
Valid time of triples ignored for the time being
9. The stRDF Data Model
strdf:geometry rdf:type rdfs:Datatype;
rdfs:subClassOf rdfs:Literal.
strdf:WKT rdf:type rdfs:Datatype;
rdfs:subClassOf strdf:geometry.
strdf:GML rdf:type rdfs:Datatype;
rdfs:subClassOf strdf:geometry.
10. Introduction
The data model
stRDF
The query
The query
language language
stSPARQL
stSPARQL and a
comparison to
GeoSPARQL
The system
Strabon for
stSPARQL and
and a comparison
GeoSPARQL
Experimental
to GeoSPARQL
Evaluation
Related Work &
Conclusions
11. stSPARQL: Geospatial SPARQL 1.1
We define a SPARQL extension function for each function
defined in the OpenGIS Simple Features Access standard
Basic functions
Get a property of a geometry (e.g., strdf:srid)
Get the desired representation of a geometry (e.g., strdf:AsText)
Test whether a certain condition holds (e.g., strdf:IsEmpty,
strdf:IsSimple)
Functions for testing topological spatial relationships
(e.g., strdf:equals, strdf:intersects)
OGC Simple Features Access, Egenhofer, RCC-8
Spatial analysis functions
Construct new geometric objects from existing geometric objects (e.g.,
strdf:buffer, strdf:intersection, strdf:convexHull)
Spatial metric functions (e.g., strdf:distance, strdf:area)
Spatial aggregate functions (e.g., strdf:union, strdf:extent)
12. stSPARQL: Geospatial SPARQL 1.1
Select clause
Construction of new geometries (e.g., strdf:buffer(?geo, 0.1))
Spatial aggregate functions (e.g., strdf:union(?geo))
Metric functions (e.g., strdf:area(?geo))
Filter clause
Functions for testing topological relationships between spatial terms (e.g.,
strdf:contains(?G1, strdf:union(?G2, ?G3)))
Numeric expressions involving spatial metric functions
(e.g., strdf:area(?G1) ≤ 2*strdf:area(?G2))
Boolean combinations
Having clause
Boolean expressions involving spatial aggregate functions and spatial metric
functions or functions testing for topological relationships between spatial
terms (e.g., strdf:area(strdf:union(?geo))>1)
Updates
13. stSPARQL: An example (1/2)
Find coniferous forests that have been affected by
fires
SELECT ?forest ?burntArea
WHERE {
?burntArea rdf:type noa:BurntArea;
noa:hasGeometry [
noa:hasSerialization ?baGeo].
?forest rdf:type noa:Region;
clc:hasLandCover noa:coniferousForest;
Spatial clc:hasGeometry [
Function clc:hasSerialization ?fGeom].
FILTER(strdf:intersects(?baGeom,?fGeom))
}
14. stSPARQL: An example (2/2)
Isolate the parts of the burnt areas that lie in
coniferous forests. Spatial
SELECT ?burntArea Aggregate
(strdf:intersection(?baGeom,
strdf:union(?fGeom))
AS ?burntForest)
WHERE {
?burntArea rdf:type noa:BurntArea;
noa:hasGeometry [
noa:hasSerialization ?baGeo].
?forest rdf:type noa:Region;
clc:hasLandCover noa:coniferousForest;
Spatial clc:hasGeometry [
Function clc:hasSerialization ?fGeom].
FILTER(strdf:intersects(?baGeom,?fGeom))
}
GROUP BY ?burntArea ?baGeom
15. The OGC Standard GeoSPARQL
Core
Topology Geometry Parameters
Vocabulary Extension
Extension
- relation family
- serialization
- version • Serialization
• WKT
• GML
Geometry Topology
Extension
- serialization
• Relation Family
- version
- relation family
• Simple
Features
• RCC-8
Query Rewrite RDFS Entailment
Extension Extension • Egenhofer
- serialization - serialization
- version - version
- relation family - relation family
16. Introduction
The data model
stRDF The system
The query
language
stSPARQL and a
Strabon for
comparison to
GeoSPARQL stSPARQL and
GeoSPARQL
The system
Strabon for
stSPARQL and
GeoSPARQL
Experimental
Evaluation
strabon.di.uoa.gr
Related Work &
Conclusions
18. Storage Scheme uri_values
type_2
Triples ID VALUE
Dictionary
SUBJEC OBJECT
SUBJECT PREDICATE OBJECT
PREDICATE 1
OBJECT VALUE
ID noa:ba_15
T
noa:ba_15 3 rdf:type
noa:ba_15
1 rdf:type 12 noa:ba_15
rdf:type
noa:BurntArea .
1
noa:ba_15 2 noa:hasGeometry
3
noa:ba_15 6 noa:hasGeometry
5 noa:ba_15g. rdf:type
noa:ba_15g noa:BurntArea
23
1
noa:ba_15g4 rdf:type 5
noa:ba_15g rdf:type
hasgeom_4
noa:Geometry .
34 noa:BurntArea
noa:Geometrynoa:hasGeometry
5
noa:ba_15g2 noa:hasSerialization
6 "MULTIPOLYGON(...)"^^
noa:ba_15g OBJECT
SUBJECT noa:hasserialization 45 noa:hasGeometry
noa:ba_15g
"MULTIPOLYGON(...)"^^
5 7 8 strdf:WKT .
strdf:WKT noa:ba_15g
56 noa:Geometry
1 5
67 noa:Geometry
noa:hasSerialization
hasserial_7
7label_values
noa:hasSerialization
SUBJECT OBJECT
8 ID "MULTIPOLYGON(...)"^^
VALUE
5 8 geo_values strdf:WKT
8 "MULTIPOLYGON(...)"^^
ID VALUE strdf:WKT
8 datatype_values
ID VALUE
8 strdf:WKT
19. Query Processing
Strabon • Parser generates
stSPARQL/ abstract syntax tree
Query Engine
GeoSPARQL
query Parser
• Abstract syntax tree
Optimizer mapped to internal
Evaluator algebra of Sesame
Transaction
Manager • Standard optimizations
performed
• Evaluator produces
corresponding SQL
GeneralDB query
• DBMS evaluates the
SQL query
PostGIS
• Post-processing
20. Query Processing (cont’d)
• Deviate from the evaluation strategy of Sesame
for SPARQL extension functions
• Push the evaluation of extension functions to
underlying DBMS
• Spatial predicates evaluated by PostGIS
• Spatial joins now affect query plan
• Avoid Cartesian products
• Results may be returned in well-known industry
formats
• KML/KMZ
• GeoJSON
• GML
21. Introduction
The data model
Experimental
stRDF
The query
language
Evaluation
stSPARQL and a
comparison to
GeoSPARQL
The system
Strabon for
stSPARQL and
GeoSPARQL
Experimental
Evaluation
Related Work &
Conclusions
22. Experimental Evaluation
• Goal: Evaluate the performance of
Strabon vs other systems
• Real workload based on geospatial
linked datasets
150 million triples
• Synthetic workload
Half a billion triples
23. Strabon vs other systems
• Strabon over PostgreSQL
(Strabon-PG)
• Strabon over System X
(Strabon-X)
• Implementation over RDF-3X
(Brodt et. al)
• Parliament (BBN Technologies)
• Naive Implementation over Sesame
25. Real world workload: Queries
Commonly Used Spatial Selection Spatial Join
Query 1 X
Query 2 X
Query 3 X
Query 4 X
Query 5 X
Query 6 X X
Query 7 X X
Query 8 X
Queries with Spatial Joins Points Lines Polygons
Query 5 X X
X X
Query 6 X X X
Query 7 X
Query 8 X X X
27. Synthetic Workload
• Workload based on a synthetic dataset
Dataset based on OpenStreetMaps
10 million triples (2 GB) up to half a
billion triples (50GB)
Implemented custom bulk loader
Triples with spatial literals: 1 up to 46
million triples
• Response time of queries with various
thematic and spatial selectivities
30. Real-world Workload:
100 million triples – warm caches
Response time (sec)
Response time (sec)
number of Nodes in query region number of Nodes in query region
Tag 1 Tag 1024
31. Real-world Workload:
100 million triples – cold caches
number of Nodes in query region number of Nodes in query region
Tag 1 Tag 1024
34. Findings
• Strabon over PostgreSQL outperforms
other systems in case of warm caches
• Results in case of cold caches mixed
• PostreSQL optimizer needs to take into
account spatial selectivity
• PostGIS 2.0 moves towards it
35. System Language Index Geometries CRS support Geospatial
Function
Support
Strabon stSPARQL/ R-tree-over- WKT / GML Yes • OGC-SFA
GeoSPARQL* GiST support • Egenhofer
• RCC-8
Parliament GeoSPARQL* R-Tree WKT / GML Yes •OGC-SFA
support •Egenhofer
•RCC-8
Oracle 12c GeoSPARQL R-Tree, WKT Yes •OGC-SFA
Quadtree
Brodt et al. SPARQL R-Tree WKT support No OGC-SFA
(RDF-3X)
Perry SPARQL-ST R-Tree GeoRSS GML Yes RCC-8
AllegroGraph Extended Distribution 2D point Partial •Buffer
SPARQL sweeping geometries •Bounding Box
technique •Distance
OWLIM Extended Custom 2D point No •Point-in-polygon
•Buffer
SPARQL geometries
•Distance
Virtuoso SPARQL R-Tree 2D point Yes SQL/MM (subset)
geometries
uSeekM SPARQL R-tree-over WKT support No OGC-SFA
GiST
36. Introduction
The data model
stRDF
The query
Related Work &
language
stSPARQL and a
comparison to
Conclusions
GeoSPARQL
The system
Strabon for
stSPARQL and
GeoSPARQL
Experimental
Evaluation
Related Work &
Conclusions
37. Future Work
Use even larger datasets
Test with 1 billion triples successful
Implement the temporal part of stSPARQL
Develop a benchmark for geospatial RDF
stores
Consider more systems
GeoSPARQL implementation of Oracle
Virtuoso
stSPARQL query processing in MonetDB
Go distributed!
Federated queries
38. Thanks! Any Questions?
Strabon
Manolis Koubarakis, Kostis Kyzirakos, Manos Karpathiotakis,
Charalampos Nikolaou, Giorgos Garbis, Konstantina Bereta,
Kallirroi Dogani, Stella Giannakopoulou and Panayiotis Smeros.
Web site: http://strabon.di.uoa.gr
Mercurial repository: http://hg.strabon.di.uoa.gr
Trac: http://bug.strabon.di.uoa.gr
Mailing list: http://cgi.di.uoa.gr/~mailman/listinfo/strabon-users
Real Time Fire Monitoring Service,
National Obervatory of Athens
http://papos.space.noa.gr/fend_static
Greek Linked Open Data
http://www.linkedopendata.gr
TELEIOS EU Project
http://www.earthobservatory.eu
SemsorGrid4Env EU Project
http://www.semsorgrid4env.eu
Hinweis der Redaktion
Hello, I am KostisKyzirakos, and I will be presenting you the work done with my colleagues Manos Karpathiotakis and ManolisKoubarakis, on the development of a geospatial RDF store nick-named Strabon.
Our extension of RDF for the representation of geospatial and temporal information is stRDF.stRDFextens RDF with a new kind of literals, called spatial literals, which are expresssionsof Boolean combinations of linear constraints.stRDF defines also a new datatype for spatial literals [which is a supertype of the respective datatypes for WKT and GML (standardized formats of the Open Geospatial Consortium for the encoding of geometries)]stRDF employs also a forth component to a triple to represent the valid time of triples.(???WHAT EXPRESSIONS CAN BE WRITTEN???)Constraints: theoretical representationIn the most recent version of stRDF, and to be compliant with geospatial industry standards,spatial literals are encoded in the WKT/GML formats, which are widely adopted standards of the Open Geospatial Consortium.The reason for moving to OGC standards was being realistic. No one had any spatial data expressedwith linear constraints! But all the nice work of the theoretical model remained valid. We only had to change the encoding used for spatial literals.(We are currently working on this feature, so this presentation does not consider time anymore.) We will describe our model through an example.
Filter expressions are...stSPARQL supports also the construction of new spatial terms and spatial aggregate functions in SELECT clauses.(spatial aggregate functions are implemented in Strabon and we do not use the implementation of PostGIS to overcome issues with transformatios between cordinate reference systems, something that PostGIS does not take into account)OpenGIS Simple Feature Access specification defines a standard SQL schema that supports storage, retrieval, query and update of collections of spatial features using SQL
Strabon extends the well-known RDF store Sesame, allowing it to manage both thematic and spatial data expressed in stRDF. In the case when PostGIS is used as the relational backend of Strabon,Strabon uses an R-tree-over-GiST (Generalised Search Tree) offered by PostGIS as a spatial index.Strabon is implemented by creating a layer that is included in Sesame's software stackin a transparent way so that it does not affect its range of functionalities, whilebenefitting from new versions of Sesame. Strabon 3.0 uses Sesame 2.6.3 andcomprises three modules: the storage manager, the query engine and the underlying databaseWe have extended Sesame as follows: Modified the storage scheme, optimizer, and evaluator to take into account geometries and not treat them as plain literals Added a new RDBMS abstract layer to Storage Manager so as any spatially-enabled database can be used with Strabon.Currently, we have been using PostGIS and MonetDB.
“Decomposition Storage Model”
The query engine works as follows. First, the parser generates an abstractsyntax tree. Then, this tree is mapped to the internal algebra of Sesame, re-sulting in a query tree. The query tree is then processed by the optimizer thatprogressively modies it, implementing the various optimization techniques ofStrabon. Afterwards, the query tree is passed to the evaluator to produce thecorresponding SQL query that will be evaluated by PostgreSQL. After the SQLquery has been posed, the evaluator receives the results and performs any post-processing actions needed. The nal step involves formatting the results. Besidesthe standard formats oered by RDF stores, Strabonoers KML and GeoJSONencodings, which are widely used in the mapping industry.We now discuss how the optimizer works. First, it applies all the Sesameoptimizations that deal with the standard SPARQL part of an stSPARQL query(e.g., it pushes down FILTERs to minimize intermediate results etc.). Then, twooptimizations specic to stSPARQL are applied. The rst optimization has todo with the extension functions of stSPARQL. By default, Sesame evaluatesthese after all bindings for the variables present in the query are retrieved. InStrabon, we modify the behaviour of the Sesame optimizer to incorporate allextension functions present in the SELECT and FILTER clause of an stSPARQLquery into the query tree prior to its transformation to SQL. In this way, theseextension functions will be evaluated using PostGIS spatial functions instead ofrelying on external libraries that would add an unneeded post-processing cost.The second optimization makes the underlying DBMS aware of the existence ofspatial joins in stSPARQL queries so that they would be evaluated eciently.Let us consider the query of Example 2. The rst three triple patterns of thequery are related to the rest via the topological function strdf:intersects.The query tree that is produced by the Sesame optimizer, fails to deal withthis spatial join appropriately. It will generate a Cartesian product for the thirdand the fth triple pattern of the query, and the evaluation of the spatial predi-catestrdf:intersects will be wrongly postponed after the calculation of thisCartesian product. Using a query graph as an intermediate representation of thequery, we identify such spatial joins and modify the query tree with appropriatenodes so that Cartesian products are avoided. For the query of Example 2, themodied query tree will result in a SQL query that contains a -join where isthe spatial function ST Intersects.At this stage, the query tree is a plain translation ofthe initial query to the algebra of Sesame. The query tree is then processed bythe optimizer that progressively restructures and enriches it, implementing thevarious optimization techniques of Strabon. Afterwards, the query tree is passedto the evaluator to produce the corresponding SQL query that will be evaluatedby PostgreSQL. After the SQL query has been posed, the evaluator receivesthe results and performs any post-processing actions needed.
First, it applies all the Sesameoptimizations that deal with the standard SPARQL part of an stSPARQL query(e.g., it pushes down FILTERs to minimize intermediate results etc.). Then, twooptimizations specic to stSPARQL are applied. The rst optimization has todo with the extension functions of stSPARQL. By default, Sesame evaluatesthese after all bindings for the variables present in the query are retrieved. InStrabon, we modify the behaviour of the Sesame optimizer to incorporate allextension functions present in the SELECT and FILTER clause of an stSPARQLquery into the query tree prior to its transformation to SQL. In this way, theseextension functions will be evaluated using PostGIS spatial functions instead ofrelying on external libraries that would add an unneeded post-processing cost.The second optimization makes the underlying DBMS aware of the existence ofspatial joins in stSPARQL queries so that they would be evaluated eciently.Let us consider the query of Example 2. The rst three triple patterns of thequery are related to the rest via the topological function strdf:intersects.The query tree that is produced by the Sesame optimizer, fails to deal withthis spatial join appropriately. It will generate a Cartesian product for the thirdand the fth triple pattern of the query, and the evaluation of the spatial predi-catestrdf:intersects will be wrongly postponed after the calculation of thisCartesian product. Using a query graph as an intermediate representation of thequery, we identify such spatial joins and modify the query tree with appropriatenodes so that Cartesian products are avoided. For the query of Example 2, themodied query tree will result in a SQL query that contains a -join where isthe spatial function ST Intersects.At this stage, the query tree is a plain translation ofthe initial query to the algebra of Sesame. The query tree is then processed bythe optimizer that progressively restructures and enriches it, implementing thevarious optimization techniques of Strabon. Afterwards, the query tree is passedto the evaluator to produce the corresponding SQL query that will be evaluatedby PostgreSQL. After the SQL query has been posed, the evaluator receivesthe results and performs any post-processing actions needed.Query processing in Strabon is performed by the query engine which consistsof a parser, an optimizer, an evaluator and a transaction manager. The parserand the transaction manager are identical to the ones in Sesame. The optimizerand the evaluator have been implemented by modifying the corresponding com-ponents of Sesame as we describe below.The query engine works as follows. First, the parser generates an abstractsyntax tree. Then, this tree is mapped to the internal algebra of Sesame, re-sulting in a query tree. The query tree is then processed by the optimizer thatprogressively modies it, implementing the various optimization techniques ofStrabon. Afterwards, the query tree is passed to the evaluator to produce thecorresponding SQL query that will be evaluated by PostgreSQL. After the SQLquery has been posed, the evaluator receives the results and performs any post-processing actions needed. The nal step involves formatting the results. Besidesthe standard formats oered by RDF stores, Strabonoers KML and GeoJSONencodings, which are widely used in the mapping industry.We now discuss how the optimizer works. First, it applies all the Sesameoptimizations that deal with the standard SPARQL part of an stSPARQL query(e.g., it pushes down FILTERs to minimize intermediate results etc.). Then, twooptimizations specic to stSPARQL are applied. The rst optimization has todo with the extension functions of stSPARQL. By default, Sesame evaluatesthese after all bindings for the variables present in the query are retrieved. InStrabon, we modify the behaviour of the Sesame optimizer to incorporate allextension functions present in the SELECT and FILTER clause of an stSPARQLquery into the query tree prior to its transformation to SQL. In this way, theseextension functions will be evaluated using PostGIS spatial functions instead ofrelying on external libraries that would add an unneeded post-processing cost.The second optimization makes the underlying DBMS aware of the existence ofspatial joins in stSPARQL queries so that they would be evaluated eciently.Let us consider the query of Example 2. The rst three triple patterns of thequery are related to the rest via the topological function strdf:intersects.The query tree that is produced by the Sesame optimizer, fails to deal withthis spatial join appropriately. It will generate a Cartesian product for the thirdand the fth triple pattern of the query, and the evaluation of the spatial predi-catestrdf:intersects will be wrongly postponed after the calculation of thisCartesian product. Using a query graph as an intermediate representation of thequery, we identify such spatial joins and modify the query tree with appropriatenodes so that Cartesian products are avoided. For the query of Example 2, themodied query tree will result in a SQL query that contains a -join where isthe spatial function ST Intersects.At this stage, the query tree is a plain translation ofthe initial query to the algebra of Sesame. The query tree is then processed bythe optimizer that progressively restructures and enriches it, implementing thevarious optimization techniques of Strabon. Afterwards, the query tree is passedto the evaluator to produce the corresponding SQL query that will be evaluatedby PostgreSQL. After the SQL query has been posed, the evaluator receivesthe results and performs any post-processing actions needed.
We designed a simple benchmark
This section presents a detailed evaluation of the system Strabon using two different workloads: a workload based on linked data and a synthetic workload. Forboth workloads, we compare the response time of Strabon on top of PostgreSQL (called Strabon PG from now on) with our closest competitor implementationin [3], the naive, baseline implementation described in Section 4, and the RDF store Parliament. To identify potential benets from using a dierent DBMS asa relational backend for Strabon, we also executed the SQL queries produced by Strabon in a proprietary spatially-enabled DBMS (which we will call System X,and Strabon X the resulting combination). [3] presents an implementation which enhances the RDF-3X triple store withthe ability to perform spatial selections using an R-tree index. The implementation of [3] is not a complete system like Strabon and does not support afull-edged query language such as stSPARQL. In addition, the only way to load data in the system is the use of a generator which has been especially designed for the experiments of [3] thus it cannot be used to load other datasets in the implementation. Moreover, the geospatial indexing support of this implementation is limited to spatial selections. Spatial selections are pushed down inthe query tree (i.e., they are evaluated before other operators). Parliament is an RDF storage engine recently enhanced with GeoSPARQL processing capabilities [2], which is coupled with Jena (an RDF query processor) to provide a complete RDF system.
We have used a number of linked geospatial datasets. I would like to stress the heterogeneity of the dataset.We have included datasets from many application domains (e.g., sensors, land use, administrative areas, gazeteer).They also vary significantly in terms of size and complexity of the spatial geometries. DBpediaGeoNamesLDG (LinkedGeoData, OpenStreetMaps)Pachube (“patch-bay”): Pachube ("patch-bay") connects people to devices, applications, and the Internet of Things. As a web-based service built to manage the world's real-time data, Pachube gives people the power to share, collaborate, and make use of information generated from the world around them.SwissEx (Swiss Experiment): http://www.swiss-experiment.chA platform to enable real-time environmental experiments through wireless sensor networks and a common, modern, generic cyber-infrastructure. This infrastructure will be used to enable interdisciplinary environmental research, allowing scientists to work efficiently and collaboratively to find the key mechanisms in the triggering of natural hazards and to efficiently distribute the information to increase public awareness.CLC (Corine Land Cover)GADM (Global Administrative Areas): http://www.gadm.org/ GADM is a spatial database of the location of the world's administrative areas (or adminstrative boundaries) for use in GIS and similar software. Administrative areas in this database are countries and lower level subdivisions such as provinces, departments, bibhag, bundeslander, daerahistimewa, fivondronana, krong, landsvæðun, opština, sous-préfectures, counties, and thana. GADM describes where these administrative areas are (the "spatial features"), and for each area it provides some attributes, such as the name and variant names.
Queries 1,2: no spatial dimension
Cold caches vs work cachesGreen is the winning colorRed is the loosing colorThe various versions of Strabon are the winning onesNo cases where Parliament is the winning systemThere are many interesting things we found in this experiment. For example,Parliament: reference implementation of GeoSPARQLIn queries Q1 and Q2, the baseline implementation outperforms all other systems as these queries do not include any spatial function. As mentioned earlier, the native store of Sesame outperforms Sesame implementations on top of a DBMS. All other systems producecomparable result times for these non-spatial queries. In all other queries the DBMS-based implementations outperform Parliament and the naive implemen-tation. Strabon PG outperforms the naive implementation since it incorporates the two stSPARQL-specic optimizations discussed in Section 4. Thus, spatialoperations are evaluated by PostGIS using a spatial index, instead of being evaluated after all the results have been retrieved. The naive implementation andParliament fail to execute Q3 as this query involves a spatial join that is very expensive for systems using a naive approach. The only exception in the behaviorof the naive implementation is Q4 in the case of warm caches, where the non-spatial part of the query produces very few results and the file blocks needed forquery evaluation are cached in main memory. In this case, the non-spatial part of the query is executed rapidly while the evaluation of the spatial function overthe results thus far is not signicant. All queries except Q8 are executed significantly faster when run using Strabon on warm caches. Q8 involves many triplepatterns and spatial functions which result in the production of a large number of intermediate results. As these do not fit in the system's cache, the responsetime is unaffected by the cache contents. System X decides to ignore the spatial index in queries Q3, Q6-Q8 and evaluate any spatial predicate exhaustively overthe results of a thematic join. In queries Q6-Q8, it also uses Cartesian products in the query plan as it considers their evaluation more profitable. These decisionsare correct in the case of Q8 where it outperforms all other systems significantly, but very costly in the cases of Q3, Q6 and Q7.
Για το synthetic dataset, έχουμε τρέξει πειράματα με 1 billion με τον Strabon. Το naive δεν μπορούσε να αποτιμήσει queries σε τόσο μεγάλο dataset οπότε παρουσιάζουμε γράφους μέχρι 500 millions.
The data of the synthetic dataset follows a general version of the schema of Open Street Map. Each node has a spatial extent (the location of the node) and is placed uniformly on a grid with step 0.001 (min 0, max 10.24).In addition, each node is assigned a number of tags each of which consist of a key-value pair of strings.Every node is tagged with key 1, every second node with key 2, every fourth node with key 4, and so on up to key 1024.By modifying the step of the grid, we produce datasets of arbitrary size.
In this query THIS_VALUE (value “1”) allows us to select a known fraction of the generated nodes. Since every node is tagged with key 1, this triple pattern will produce a binding for every node of the dataset.By altering this value to “2”, we select half the nodes of the dataset. By further changing it to “4”, we select half thenodes we previously did, and so on.The extent of the POLYGON in the filter expression defines how selective the spatial predicate is. We modify the size of the polygon in order to select 10, 100, 1000, etc. spatial nodes.
After consulting the query logs of PostgreSQL, we no-ticed that the vast majority of the SQL queries that were posed and derivedfrom the query template adhered to one of the query plans of Figures 4,5. Ac-cording to plan A, query evaluation starts by evaluating the thematic selectionover table key using the appropriate index. The results are retrieved and joinedwith the hasTag and hasGeography predicate tables using appropriate indices.Finally, the spatial selection is evaluated by scanning the spatial index and theresults are joined with the intermediate results of the previous operations. Onthe contrary, plan B starts with the evaluation of the spatial selection, leavingthe application of the thematic selection for the end.Unfortunately, in the current version of PostGIS spatial selectivities are notcomputed properly. The functions that estimate the selectivity of a spatial se-lection/join return a constant number regardless of the actual selectivity of theoperator. Thus, only the thematic selectivity aects the choice of a query plan.To state it in terms of our query template, altering the value PARAM A between1 and 1024 was the only factor inuencing the selection of a query plan.
In this case, as previously discussed, PostgreSQL does not select a good plan andthe response times are even higher than the times where PARAM A is equal to 1(Figure 6a), despite producing signicantly more intermediate results.
After consulting the query logs of PostgreSQL, we no-ticed that the vast majority of the SQL queries that were posed and derivedfrom the query template adhered to one of the query plans of Figures 4,5. Ac-cording to plan A, query evaluation starts by evaluating the thematic selectionover table key using the appropriate index. The results are retrieved and joinedwith the hasTag and hasGeography predicate tables using appropriate indices.Finally, the spatial selection is evaluated by scanning the spatial index and theresults are joined with the intermediate results of the previous operations. Onthe contrary, plan B starts with the evaluation of the spatial selection, leavingthe application of the thematic selection for the end.Unfortunately, in the current version of PostGIS spatial selectivities are notcomputed properly. The functions that estimate the selectivity of a spatial se-lection/join return a constant number regardless of the actual selectivity of theoperator. Thus, only the thematic selectivity aects the choice of a query plan.To state it in terms of our query template, altering the value PARAM A between1 and 1024 was the only factor inuencing the selection of a query plan.
Two different query plans.1. Evaluation of thematic selection first.2. Evaluation of spatial selection first.In Postgres, the functions that estimate the selectivity of a spatial selection or a spatial join return a constant number. In other words, the spatial selectivity of a query did not affect dynamically the decision whether the spatial selection should be the first operation to execute. Therefore, only the thematic selectivity affected the choice for a query plan. To state it in terms of our query template, altering the value for the key between 1 and 1024 was the only factor influencing the query plan.
So, what I would like you to keep for this slide is that Strabon not only has a good performance but it is also one of the richest systems available at the moment.