SlideShare a Scribd company logo
1 of 24
Validating statistical index data
represented in RDF using
SPARQL queries
WESO Research Group
University of Oviedo, Spain
Jose Emilio Labra Gayo Jose María Álvarez Rodríguez
Motivation
The WebIndex Project
Measure impact of the Web in different countries
First publication: 2012, 2 web sites:
Visualization
http://thewebindex.org
Data portal
http://data.webfoundation.org
WebIndex Workflow
Raw Data
Excel
Computed data
Excel
External
sources
All data
RDF Datastore
Visualizations
Technical details
Index made from
61 countries (100 planned for 2013)
85 indicators:
51 Primary (questionnaires)
34 Secondary (external sources)
WebIndex RDF data
Modeled on top of RDF Data Cube
More than 1 million triples
Linked data: DBPedia, Organizations, etc.
Records data computation
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Country 2009 2010 2011
Spain 4 5 3
Finland 4 6
Armenia 1
Chile 6 8
Country 2009 2010 2011
Spain 4 5 3
Finland 4 6
Armenia 1
Chile 6 8
WebIndex
computation process (1)
Simplified with one indicator, 3 years and 4 countries
Raw Data
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Imputed Data
Filtered Data (Indicator A)
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
Normalized Data (z-scores)
Country 2009 2010 2011
Spain 4 5 3
Finland 4 6
Armenia 1
Chile 6 8
Country 2009 2010 2011
Spain 4 5 3
Finland 4 5 6
Armenia 1 1 1
Chile 6 8 10.6
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
More details can be found here: http://thewebindex.org/about/methodology/computation/
WebIndex
computation Process (2)
Simplified with one indicator, 3 years and 4 countries
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
Normalized Data (z-scores)
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
Country 2009 2010 2011
Spain -0.57 -0.57 -0.92
Finland -0.57 -0.57 -0.14
Chile 1.15 1.15 1.06
More details can be found here: http://thewebindex.org/about/methodology/computation/
Adjusted data
Country A B C D ...
Spain 8 7 9.1 7.1 ...
Finland 7 8 7.1 8 ...
Chile 8 9 7.6 6 ...
Group indicators
Country Readiness Impact Web Composite
Spain 5.7 3.5 5.1 4.5
Finland 5.5 3.9 7.1 4.9
Chile 6.7 4.5 7.6 5.1
Rankings
Country Readiness Impact Web Composite
Spain 2 3 3 3
Finland 3 2 2 2
Chile 1 1 1 1
Example of data
representation in RDF
Example:
obs:obsM23 a qb:Observation ;
cex:computation
[ a cex:Z-Score ;
cex:observation obs:obsA23 ;
cex:slice slice:sliceA09 ;
] ;
cex:value -0.57 ;
cex:md5-checksum "2917835203..." ;
cex:indicator indicator:A ;
cex:concept country:ESP ;
qb:dataSet dataset:A-Normalized ;
# ... other declarations omitted for brevity
It declares that the value
of this observation was obtained as
z-score of obs:obsA23 over
slice:sliceA09
Each observation follows the RDF Data Cube vocabulary extented with
metadata about how it was obtained
Computex
Vocabulary for statistical computations
Extends RDF Data Cube
Ontology at: http://purl.org/weso/computex
Some terms:
cex:Concept
cex:Indicator
cex:Computation
cex:WeightSchema
qb:Observation
...
Validation process
Last year (2012)
Template based validation
MD5 checksum of each observation and some fields
This year (2013)
SPARQL based validation
3 levels of validation
RDF Data Cube
Shapes of data
Computation process
Ultimate goal: automatically compute the index
Validation approach
SPARQL CONSTRUCT queries instead of ASK
IF no error THEN empty model
ELSE RDF graph with error
informationCONSTRUCT {
[ a cex:Error ;
cex:errorParam
[cex:name "obs"; cex:value ?obs ] ,
[cex:name "value1"; cex:value ?value1 ] ,
[cex:name "value2"; cex:value ?value2 ] ;
cex:msg "Observation has two different values" . ]
} WHERE {
?obs a qb:Observation .
?obs cex:value ?value1 .
?obs cex:value ?value2 .
FILTER ( ?value1 != ?value2 )
}
ASK WHERE {
?dim a qb:DimensionProperty .
FILTER NOT EXISTS { ?dim rdfs:range [] }
}
SPARQL queries
RDF Data Cube
We converted RDF Data Cube integrity
constraints from ASK to CONSTRUCT queries
CONSTRUCT {
[ a cex:Error ;
cex:errorParam [cex:name "dim"; cex:value ?dim ] ;
cex:msg "Every Dimension must have a declared range" .
]
} WHERE {
?dim a qb:DimensionProperty .
FILTER NOT EXISTS { ?dim rdfs:range [] }
}
SPARQL expressivity
SPARQL can express complex validation patterns.
Example: Mean
CONSTRUCT {
[ a cex:Error ;
cex:errorParam # ...omitted
cex:msg "Mean value does not match" ] .
} WHERE {
?obs a qb:Observation ;
cex:computation ?comp ;
cex:value ?val .
?comp a cex:Mean .
{ SELECT (AVG(?value) as ?mean) ?comp WHERE {
?comp cex:observation ?obs1 .
?obs1 cex:value ?value ;
} GROUP BY ?comp }
FILTER( abs(?mean - ?val) > 0.0001) }}
Limitations of
SPARQL expressivity
Some built-in functions are not standardized
Example: z-score employs standard deviation.
It requires built-in function: sqrt
Available in some SPARQL implementations
Ranking of values was not obvious (*)
2 solutions:
Using GROUP_CONCAT
Check a value against all the other values
Neither solution is efficient
(*) http://stackoverflow.com/questions/17313730/how-to-rank-values-in-sparql
SPARQL expressivity
CONSTRUCT { # ... omitted for brevity
} WHERE {
?obs cex:computation [a cex:AverageGrowth; cex:observations ?ls] ;
cex:value ?val .
?ls rdf:first [cex:value ?v1 ] .
{ SELECT ( SUM(?v_n / ?v_n1)/COUNT(*) as ?meanGrowth)
WHERE {
?ls rdf:rest* [ rdf:first [ cex:value ?v_n ] ;
rdf:rest [ rdf:first [ cex:value ?v_n1 ]]] .
}
} FILTER (abs(?meanGrowth * ?v1 - ?val) > 0.001)
} * This solution was provided by Joshua Taylor
RDF Profiles
Descriptions (with constraints) of dataset families
Can be used to validate (and compute) RDF graphs
Examples:
Cube (Statistical data)
Normalization Update queries +
Integrity constraints
Computex (statistical computations)
Adds more queries
Other user defined profiles
Cube
Computex
WebIndexCloudIndex
Cube Profile
W3c Candidate Recommendation (http://www.w3.org/TR/vocab-data-cube/)
1 base ontology + RDF Schema inference
2 update queries (normalization algorithm)
21 integrity constraint queries
cex:cube a cex:ValidationProfile ;
cex:name "RDF Data Cube" ;
cex:ontologyBase cube:cube.ttl ;
cex:expandSteps (
[ cex:name "Closure" ; cex:uri cube:closure.ru ]
[ cex:name "Flatten" ; cex:uri cube:flatten.ru ]
) ;
cex:integrityQuery
[ cex:name "IC-1. Unique DataSet" ; cex:uri cube:q1.sparql ];
# ...
.
Computex Profile
1 base ontology
http://purl.org/weso/computex
Imports RDF Data Cube
cex:computex a cex:ValidationProfile ;
cex:name "Computex" ;
cex:ontologyBase cex:computex.ttl ;
cex:import profile:cube.ttl ;
cex:expandSteps
( [ cex:name "Copy Raw"; cex:uri cex:copyRaw.sparql ]
# Other UPDATE queries
) ;
cex:integrityQuery
[ cex:name "Adjusted" cex:uri cex:adjusted.sparql ] ,
# Other integrity queries
Computex Validation tool
Available at:
http://computex.herokuapp.com
Features
Informative error messages
Option to generate EARL report
2 validation profiles (Cube and Computex)
Source code in Scala available at:
http://github.com/weso/computex
Based on profiles
Computex Validation Tool
http://computex.herokuapp.com
Conclusions
WebIndex = use case for RDF Validation
SPARQL queries as an implementation approach
Challenges:
Expressivity limits of SPARQL
Complexity of some queries
Concept of RDF Profiles
Declarative, Turtle syntax
Intermediate level between OWL and SPARQL
END OF PRESENTATION
COMPLEMENTARY SLIDES
Limits of
SPARQL expressivity
Ranking of values using GROUP_CONCAT
SELECT ?x ?v ?ranking {
?x :value ?v .
{ SELECT (GROUP_CONCAT(?x;separator="") as ?ordered) {
{ SELECT ?x {
?x :value ?v .
} ORDER BY DESC(?v)
}
}
}
BIND (str(?x) as ?xName)
BIND (strbefore(?ordered,?xName) as ?before)
BIND ((strlen(?before) / strlen(?xName)) + 1 as ?ranking)
} ORDER BY ?ranking
Limits of
SPARQL expressivity
Ranking of values comparing all values
SELECT ?x ?v (COUNT(*) as ?ranking) WHERE {
?x :value ?v .
[] :value ?u .
FILTER( ?v <= ?u )
}
GROUP BY ?x ?v
ORDER BY ?ranking
* This solution was provided by Joshua Taylor

More Related Content

What's hot

Review of some successes
Review of some successesReview of some successes
Review of some successes
Andrea Zaliani
 

What's hot (20)

Geo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic MapsGeo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic Maps
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Write Graph Algorithms Like a Boss Andrew Ray
Write Graph Algorithms Like a Boss Andrew RayWrite Graph Algorithms Like a Boss Andrew Ray
Write Graph Algorithms Like a Boss Andrew Ray
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDeep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
 
Spark rdd vs data frame vs dataset
Spark rdd vs data frame vs datasetSpark rdd vs data frame vs dataset
Spark rdd vs data frame vs dataset
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
 
SparkSQL and Dataframe
SparkSQL and DataframeSparkSQL and Dataframe
SparkSQL and Dataframe
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...
 
Photon Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think VectorizedPhoton Technical Deep Dive: How to Think Vectorized
Photon Technical Deep Dive: How to Think Vectorized
 
Review of some successes
Review of some successesReview of some successes
Review of some successes
 
Spark: Taming Big Data
Spark: Taming Big DataSpark: Taming Big Data
Spark: Taming Big Data
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
 
ST-Toolkit, a Framework for Trajectory Data Warehousing
ST-Toolkit, a Framework for Trajectory Data WarehousingST-Toolkit, a Framework for Trajectory Data Warehousing
ST-Toolkit, a Framework for Trajectory Data Warehousing
 
Spark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. JyotiskaSpark Dataframe - Mr. Jyotiska
Spark Dataframe - Mr. Jyotiska
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
MATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAPMATLAB, netCDF, and OPeNDAP
MATLAB, netCDF, and OPeNDAP
 

Similar to Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

Representing verifiable statistical index computations as linked data
Representing verifiable statistical index computations as linked dataRepresenting verifiable statistical index computations as linked data
Representing verifiable statistical index computations as linked data
Jose Emilio Labra Gayo
 

Similar to Validating statistical Index Data represented in RDF using SPARQL Queries: Computex (20)

Representing verifiable statistical index computations as linked data
Representing verifiable statistical index computations as linked dataRepresenting verifiable statistical index computations as linked data
Representing verifiable statistical index computations as linked data
 
RDF Validation Future work and applications
RDF Validation Future work and applicationsRDF Validation Future work and applications
RDF Validation Future work and applications
 
ECU ODS data integration using OWB and SSIS UNC Cause 2013
ECU ODS data integration using OWB and SSIS UNC Cause 2013ECU ODS data integration using OWB and SSIS UNC Cause 2013
ECU ODS data integration using OWB and SSIS UNC Cause 2013
 
70487.pdf
70487.pdf70487.pdf
70487.pdf
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can help
 
Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
Oracle OpenWorld 2016 Review - Focus on Data, BigData, Streaming Data, Machin...
 
Oow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BIOow2016 review-db-dev-bigdata-BI
Oow2016 review-db-dev-bigdata-BI
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
Querying Druid in SQL with Superset
Querying Druid in SQL with SupersetQuerying Druid in SQL with Superset
Querying Druid in SQL with Superset
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
Access Data from XPages with the Relational Controls
Access Data from XPages with the Relational ControlsAccess Data from XPages with the Relational Controls
Access Data from XPages with the Relational Controls
 
Flink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - BerlinFlink Forward Europe 2019 - Berlin
Flink Forward Europe 2019 - Berlin
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the web
 
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
OVH-Change Data Capture in production with Apache Flink - Meetup Rennes 2019-...
 
Rennes Meetup 2019-09-26 - Change data capture in production
Rennes Meetup 2019-09-26 - Change data capture in productionRennes Meetup 2019-09-26 - Change data capture in production
Rennes Meetup 2019-09-26 - Change data capture in production
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Jacob Keecheril
Jacob KeecherilJacob Keecheril
Jacob Keecheril
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 

More from Jose Emilio Labra Gayo

More from Jose Emilio Labra Gayo (20)

Publicaciones de investigación
Publicaciones de investigaciónPublicaciones de investigación
Publicaciones de investigación
 
Introducción a la investigación/doctorado
Introducción a la investigación/doctoradoIntroducción a la investigación/doctorado
Introducción a la investigación/doctorado
 
Challenges and applications of RDF shapes
Challenges and applications of RDF shapesChallenges and applications of RDF shapes
Challenges and applications of RDF shapes
 
Legislative data portals and linked data quality
Legislative data portals and linked data qualityLegislative data portals and linked data quality
Legislative data portals and linked data quality
 
Validating RDF data: Challenges and perspectives
Validating RDF data: Challenges and perspectivesValidating RDF data: Challenges and perspectives
Validating RDF data: Challenges and perspectives
 
Wikidata
WikidataWikidata
Wikidata
 
Legislative document content extraction based on Semantic Web technologies
Legislative document content extraction based on Semantic Web technologiesLegislative document content extraction based on Semantic Web technologies
Legislative document content extraction based on Semantic Web technologies
 
ShEx by Example
ShEx by ExampleShEx by Example
ShEx by Example
 
Introduction to SPARQL
Introduction to SPARQLIntroduction to SPARQL
Introduction to SPARQL
 
Introducción a la Web Semántica
Introducción a la Web SemánticaIntroducción a la Web Semántica
Introducción a la Web Semántica
 
RDF Data Model
RDF Data ModelRDF Data Model
RDF Data Model
 
2017 Tendencias en informática
2017 Tendencias en informática2017 Tendencias en informática
2017 Tendencias en informática
 
RDF, linked data and semantic web
RDF, linked data and semantic webRDF, linked data and semantic web
RDF, linked data and semantic web
 
Introduction to SPARQL
Introduction to SPARQLIntroduction to SPARQL
Introduction to SPARQL
 
19 javascript servidor
19 javascript servidor19 javascript servidor
19 javascript servidor
 
Como publicar datos: hacia los datos abiertos enlazados
Como publicar datos: hacia los datos abiertos enlazadosComo publicar datos: hacia los datos abiertos enlazados
Como publicar datos: hacia los datos abiertos enlazados
 
16 Alternativas XML
16 Alternativas XML16 Alternativas XML
16 Alternativas XML
 
XSLT
XSLTXSLT
XSLT
 
XPath
XPathXPath
XPath
 
Arquitectura de la Web y Computación en el Servidor
Arquitectura de la Web y Computación en el ServidorArquitectura de la Web y Computación en el Servidor
Arquitectura de la Web y Computación en el Servidor
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Validating statistical Index Data represented in RDF using SPARQL Queries: Computex

  • 1. Validating statistical index data represented in RDF using SPARQL queries WESO Research Group University of Oviedo, Spain Jose Emilio Labra Gayo Jose María Álvarez Rodríguez
  • 2. Motivation The WebIndex Project Measure impact of the Web in different countries First publication: 2012, 2 web sites: Visualization http://thewebindex.org Data portal http://data.webfoundation.org
  • 3. WebIndex Workflow Raw Data Excel Computed data Excel External sources All data RDF Datastore Visualizations
  • 4. Technical details Index made from 61 countries (100 planned for 2013) 85 indicators: 51 Primary (questionnaires) 34 Secondary (external sources) WebIndex RDF data Modeled on top of RDF Data Cube More than 1 million triples Linked data: DBPedia, Organizations, etc. Records data computation
  • 5. Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain 4 5 3 Finland 4 6 Armenia 1 Chile 6 8 Country 2009 2010 2011 Spain 4 5 3 Finland 4 6 Armenia 1 Chile 6 8 WebIndex computation process (1) Simplified with one indicator, 3 years and 4 countries Raw Data Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Imputed Data Filtered Data (Indicator A) Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 Normalized Data (z-scores) Country 2009 2010 2011 Spain 4 5 3 Finland 4 6 Armenia 1 Chile 6 8 Country 2009 2010 2011 Spain 4 5 3 Finland 4 5 6 Armenia 1 1 1 Chile 6 8 10.6 Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 More details can be found here: http://thewebindex.org/about/methodology/computation/
  • 6. WebIndex computation Process (2) Simplified with one indicator, 3 years and 4 countries Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 Normalized Data (z-scores) Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 Country 2009 2010 2011 Spain -0.57 -0.57 -0.92 Finland -0.57 -0.57 -0.14 Chile 1.15 1.15 1.06 More details can be found here: http://thewebindex.org/about/methodology/computation/ Adjusted data Country A B C D ... Spain 8 7 9.1 7.1 ... Finland 7 8 7.1 8 ... Chile 8 9 7.6 6 ... Group indicators Country Readiness Impact Web Composite Spain 5.7 3.5 5.1 4.5 Finland 5.5 3.9 7.1 4.9 Chile 6.7 4.5 7.6 5.1 Rankings Country Readiness Impact Web Composite Spain 2 3 3 3 Finland 3 2 2 2 Chile 1 1 1 1
  • 7. Example of data representation in RDF Example: obs:obsM23 a qb:Observation ; cex:computation [ a cex:Z-Score ; cex:observation obs:obsA23 ; cex:slice slice:sliceA09 ; ] ; cex:value -0.57 ; cex:md5-checksum "2917835203..." ; cex:indicator indicator:A ; cex:concept country:ESP ; qb:dataSet dataset:A-Normalized ; # ... other declarations omitted for brevity It declares that the value of this observation was obtained as z-score of obs:obsA23 over slice:sliceA09 Each observation follows the RDF Data Cube vocabulary extented with metadata about how it was obtained
  • 8. Computex Vocabulary for statistical computations Extends RDF Data Cube Ontology at: http://purl.org/weso/computex Some terms: cex:Concept cex:Indicator cex:Computation cex:WeightSchema qb:Observation ...
  • 9. Validation process Last year (2012) Template based validation MD5 checksum of each observation and some fields This year (2013) SPARQL based validation 3 levels of validation RDF Data Cube Shapes of data Computation process Ultimate goal: automatically compute the index
  • 10. Validation approach SPARQL CONSTRUCT queries instead of ASK IF no error THEN empty model ELSE RDF graph with error informationCONSTRUCT { [ a cex:Error ; cex:errorParam [cex:name "obs"; cex:value ?obs ] , [cex:name "value1"; cex:value ?value1 ] , [cex:name "value2"; cex:value ?value2 ] ; cex:msg "Observation has two different values" . ] } WHERE { ?obs a qb:Observation . ?obs cex:value ?value1 . ?obs cex:value ?value2 . FILTER ( ?value1 != ?value2 ) }
  • 11. ASK WHERE { ?dim a qb:DimensionProperty . FILTER NOT EXISTS { ?dim rdfs:range [] } } SPARQL queries RDF Data Cube We converted RDF Data Cube integrity constraints from ASK to CONSTRUCT queries CONSTRUCT { [ a cex:Error ; cex:errorParam [cex:name "dim"; cex:value ?dim ] ; cex:msg "Every Dimension must have a declared range" . ] } WHERE { ?dim a qb:DimensionProperty . FILTER NOT EXISTS { ?dim rdfs:range [] } }
  • 12. SPARQL expressivity SPARQL can express complex validation patterns. Example: Mean CONSTRUCT { [ a cex:Error ; cex:errorParam # ...omitted cex:msg "Mean value does not match" ] . } WHERE { ?obs a qb:Observation ; cex:computation ?comp ; cex:value ?val . ?comp a cex:Mean . { SELECT (AVG(?value) as ?mean) ?comp WHERE { ?comp cex:observation ?obs1 . ?obs1 cex:value ?value ; } GROUP BY ?comp } FILTER( abs(?mean - ?val) > 0.0001) }}
  • 13. Limitations of SPARQL expressivity Some built-in functions are not standardized Example: z-score employs standard deviation. It requires built-in function: sqrt Available in some SPARQL implementations Ranking of values was not obvious (*) 2 solutions: Using GROUP_CONCAT Check a value against all the other values Neither solution is efficient (*) http://stackoverflow.com/questions/17313730/how-to-rank-values-in-sparql
  • 14. SPARQL expressivity CONSTRUCT { # ... omitted for brevity } WHERE { ?obs cex:computation [a cex:AverageGrowth; cex:observations ?ls] ; cex:value ?val . ?ls rdf:first [cex:value ?v1 ] . { SELECT ( SUM(?v_n / ?v_n1)/COUNT(*) as ?meanGrowth) WHERE { ?ls rdf:rest* [ rdf:first [ cex:value ?v_n ] ; rdf:rest [ rdf:first [ cex:value ?v_n1 ]]] . } } FILTER (abs(?meanGrowth * ?v1 - ?val) > 0.001) } * This solution was provided by Joshua Taylor
  • 15. RDF Profiles Descriptions (with constraints) of dataset families Can be used to validate (and compute) RDF graphs Examples: Cube (Statistical data) Normalization Update queries + Integrity constraints Computex (statistical computations) Adds more queries Other user defined profiles Cube Computex WebIndexCloudIndex
  • 16. Cube Profile W3c Candidate Recommendation (http://www.w3.org/TR/vocab-data-cube/) 1 base ontology + RDF Schema inference 2 update queries (normalization algorithm) 21 integrity constraint queries cex:cube a cex:ValidationProfile ; cex:name "RDF Data Cube" ; cex:ontologyBase cube:cube.ttl ; cex:expandSteps ( [ cex:name "Closure" ; cex:uri cube:closure.ru ] [ cex:name "Flatten" ; cex:uri cube:flatten.ru ] ) ; cex:integrityQuery [ cex:name "IC-1. Unique DataSet" ; cex:uri cube:q1.sparql ]; # ... .
  • 17. Computex Profile 1 base ontology http://purl.org/weso/computex Imports RDF Data Cube cex:computex a cex:ValidationProfile ; cex:name "Computex" ; cex:ontologyBase cex:computex.ttl ; cex:import profile:cube.ttl ; cex:expandSteps ( [ cex:name "Copy Raw"; cex:uri cex:copyRaw.sparql ] # Other UPDATE queries ) ; cex:integrityQuery [ cex:name "Adjusted" cex:uri cex:adjusted.sparql ] , # Other integrity queries
  • 18. Computex Validation tool Available at: http://computex.herokuapp.com Features Informative error messages Option to generate EARL report 2 validation profiles (Cube and Computex) Source code in Scala available at: http://github.com/weso/computex Based on profiles
  • 20. Conclusions WebIndex = use case for RDF Validation SPARQL queries as an implementation approach Challenges: Expressivity limits of SPARQL Complexity of some queries Concept of RDF Profiles Declarative, Turtle syntax Intermediate level between OWL and SPARQL
  • 23. Limits of SPARQL expressivity Ranking of values using GROUP_CONCAT SELECT ?x ?v ?ranking { ?x :value ?v . { SELECT (GROUP_CONCAT(?x;separator="") as ?ordered) { { SELECT ?x { ?x :value ?v . } ORDER BY DESC(?v) } } } BIND (str(?x) as ?xName) BIND (strbefore(?ordered,?xName) as ?before) BIND ((strlen(?before) / strlen(?xName)) + 1 as ?ranking) } ORDER BY ?ranking
  • 24. Limits of SPARQL expressivity Ranking of values comparing all values SELECT ?x ?v (COUNT(*) as ?ranking) WHERE { ?x :value ?v . [] :value ?u . FILTER( ?v <= ?u ) } GROUP BY ?x ?v ORDER BY ?ranking * This solution was provided by Joshua Taylor