This document summarizes different approaches for managing web data and querying semi-structured data. It discusses challenges like lack of schemas, scale, and volatility of web data. It then describes approaches like property tables, binary tables, and graph-based approaches using the gStore and VS-Tree systems. The document concludes that graph-based approaches like VS-Tree have the best performance and that gStore is more efficient than other approaches for querying RDF triple stores on the web.
1. Web Data Management
Advanced Database Presentation
By:
Navid Sedighpour
Professor :
Dr. Alireza Bagheri
Nevember 2015
1
2. Interest
Lack of schema
Data is unstructured or at best “semi-structured”
Missing data, additional attributes, similar data but not identical
Volatility
May confirm to one schema now, but not later
Scale
How to capture everything?
Querying Difficulty
What is the user language?
What are the primitives?
Aren’t Search Engines sufficient?
2
3. Fusion Tables
Users contribute data in spreadsheet
Possible joins between multiple data sets
Extensive visualization
3
More Recent Approaches to Web Querying
4. More Recent Approaches to Web Querying
XML
Data exchange language
Tree based structure
4
5. More Recent Approaches to Web Querying
RDF
W3C Recommendation
Simple, self-descriptive model
5
6. RDF Data Volumes
90% of world's data generated over last two years
Data are growing fast
Size almost doubling every year
6
11. RDF Introduction
Everything is an uniquely named resource
Prefixes can be used to shorten names
Properties of resources can be defined
Relationships with other resources can be defined
Resource description can be contributed by different people/groups and can be located anywhere
in the web
Integrated web “database”
11
12. RDF Data Model
Triple : Subject, Predicate (Property) , Object
Subject : The entity that is described (URI or Blank Node)
Predicate : a feature of the entity
Object : value of the feature
Set of RDF Triples is called “RDF Graph”
12
21. Property Tables
Advantages :
Fewer Joins
Disadvantages :
Lots of NULLs
Clustering is not trivial
Multi-valued properties are complicated
21
22. Binary Tables
Grouping by Properties: for each property build a two column table containing both subject and
object, ordered by subjects
Also called “Vertically Partitioned Approach”
N two column tables (n is the number of unique properties in the data)
22
25. Two steps need to be done :
1. For each node of Q* get the lists of nodes in G* that include that node
2. Do a multi-way join to get the candidate list
Alternatives :
Sequential scan of G*
Both steps are inefficient
S-Tree
Height Balanced Tree over signatures
Run an inclusion query for each node of Q* and get lists of nodes in G* that include that node (q & s = q)
VS-Tree
Support both steps efficiently
Grouping by vertices
25
Graph-Based Approach
33. Conclusion
RDF Data seem to have considerable promise for web data management
We talked about four approaches to web data management including Naïve triple store design,
Property Tables, Binary Tables and Graph-Based approach
VS-Tree has the best performance in Graph-Base approaches
gStore is more efficient than other approaches
33
34. References
34
[1] D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach, "Scalable semantic web data
management using vertical partitioning," in Proceedings of the 33rd international conference on Very large
data bases, 2007, pp. 411-422.
[2] L. Zou, J. Mo, L. Chen, M. T. Özsu, and D. Zhao, "gStore: answering SPARQL queries via
subgraph matching," Proceedings of the VLDB Endowment, vol. 4, pp. 482-493, 2011.
[3] L. Zou, M. T. Özsu, L. Chen, X. Shen, R. Huang, and D. Zhao, "gStore: a graph-based SPARQL
query engine," The VLDB Journal—The International Journal on Very Large Data Bases, vol. 23, pp. 565-
590, 2014.
[4] X. Shen, L. Zou, M. T. Ozsu, L. Chen, Y. Li, S. Han, et al., "A Graph-based RDF Triple Store."