Democratizing Big Semantic Data Management with HDT

Democratizing Big Semantic
Data management
Javier D. Fernández
WU Vienna, Austria
Complexity Science Hub Vienna, Austria
Privacy and Sustainable Computing Lab, Austria
STANFORD CENTER FOR BIOMEDICAL INFORMATICS
APRIL 25TH, 2018.
or how to query an RDF graph with 28 billion triples in a standard laptop

The Linked Open Data cloud (2018)
PAGE 2
 ~10K datasets organized into 9
domains which include many and
varied knowledge fields.
 150B statements, including
entity descriptions and
(inter/intra-dataset) links between
them.
 >500 live endpoints serving this
data.
http://lod-cloud.net/
http://stats.lod2.eu/
http://sparqles.ai.wu.ac.at/

But what about Web-scale queries
 E.g. retrieve all entities in LOD referring to the gene WBGene00000001 (aap-1)
 Solutions?
3
select distinct ?x {
?x dcterms:title "WBGene00000001 (aap-1)" .
}

4
Let’s fish in our Linked Data Eco System

A) Federated Queries!!
1. Get a list of potential SPARQL Endpoints
 datahub.io, LOV, other catalogs?
2. Query each SPARQL Endpoint
 Problems?
 Many SPARQL Endpoints have low availability
5
The Web of Data Eco System
http://sparqles.ai.wu.ac.at/

A) Federated Queries!!
1. Get a list of potential SPARQL Endpoints
 datahub.io, LOV, other catalogs?
2. Query each SPARQL Endpoint
 Problems?
 Many SPARQL Endpoints have low availability
 SPARQL Endpoints are usually restricted (timeout,#results)
 Moreover, it can be tricky with complex queries (joins) due to intermediary
results, delays, etc
6

B) Follow-your-nose
1. Follow self-descriptive IRIs and links
2. Filter the results you are interested in
 Problems?
 You need some initial seed
 DBpedia could be a good start
 It’s slow (fetching many documents)
 Where should I start for unbounded queries?
 ?x dcterms:title “WBGene00000001 (aap-1)"
7

C) Use the RDF dumps by yourself
1. Crawl de Web of Data
 Probably start with datahub.io, LOV, other catalogs?
2. Download datasets
 You better have some free space in your machine
3. Index the datasets locally
 You better are patience and survive parsing errors
4. Query all datasets
 You better are alive by then
 Problems?
 Hugh resources!
 + Messiness of the data
8
Rietveld, L., Beek, W., & Schlobach, S. (2015). LOD
lab: Experiments at LOD scale. In ISWC

 Publication, Exchange and Consumption of large RDF datasets
 Most RDF formats (N3, XML, Turtle) are text serializations, designed for
human readability (not for machines)
 Verbose = High costs to write/exchange/parse
 A basic offline search = (decompress)+ index the file + search
The problem is in the roots
(Big tree = big roots)
Steve Garry

1) HDT
 Highly compact serialization of RDF
 Allows fast RDF retrieval in compressed space (without prior decompression)
 Includes internal indexes to solve basic queries with small (3%) memory footprint.
 Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, Jena, RDF3X.
 Supports FULL SPARQL as the compressed backend store of Jena, with an efficiency on the
same scale as current more optimized solutions
 Challenges:
 Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then
it is ready to consume efficiently!)
10
A Linked Data hacker toolkit
431 M.triples~
63 GB
NT + gzip
5 GB
HDT
6.6 GB
Slightly more but you can query!
rdfhdt.org

11
HDT (Header-Dictionary-Triples) Overview
RDF
Header
Dictionary
Triples
1 aa..
2 ab..
3 bu ..
metadata describing the RDF dataset
Mapping between IDs elements in the dataset
aa..
ab..
bu
..
3
2
1
Structure of the data after the ID replacement

 Publication Metadata:
 Publisher, Public Endpoint, …
 Statistical Metadata:
 Number of triples, subjects, entities, histograms…
 Format Metadata:
 Description of the data structure, e.g. Triples Order.
 Additional Metadata:
 Domain-specific.
 … In RDF
Header

14
Dictionary+Triples partition

15
Dictionary+Triples partition
<http://example.org/Vienna>
<http://example.org/Javier>
<http://example.org/Paul>
<http://example.org/Researcher>
<http://example.org/Stefan>
ex:birthPlace
ex:workPlace
foaf:mbox
foaf:name
rdf:type
“jfergar@example.org”
“jfergar@wu.ac.at”
“Vienna”@en
1
2
3
4
5
6
7
8
9
10
11
12
13
2 1
7
12
13
9
4
10
8
5
6
3
6
11
8

 Mapping of strings to correlative IDs. {1..n}
 Lexicographically sorted, no duplicates.
 Prefix-Based compression in each section.
 Efficient IDString operations
Dictionary

 Prefix-Based compression used in each section
1. Each string is encoded with two values
 An integer representing the number of characters shared with the previous
string
 A sequence of characters representing the suffix that is not shared with the
previous string
Dictionary. Plain Front Coding (PFC)
A
An
Ant
Antivirus
Antivirus Software
Best
(0,a) (1,n) (2,t) (3,ivirus) (9, Software) (0,Best)

2. The vocabulary is split in buckets, each of them storing “b” strings
 The first string of each bucket (header) is coded explicitly (i.e. full string)
 The subsequent b-1 strings (internal strings) are coded differentially
A
An
Ant
Antivirus
Antivirus Software
Best
Bucket 1 Bucket 2
a (1,n) (2,t) Antivirus (9, Software) (0,Best)
1 2 3 4 5 6

3. PFC is encoded with a byte sequence and an array of pointers (ptr) to
denote the first byte of each bucket
A
An
Ant
Antivirus
Antivirus Software
Best ptr 1 9
Bucket 1 Bucket 2
a (1,n) (2,t) Antivirus (9, Software) (0,B)
1 2 3 4 5 6

 Locate (string) performs a binary search in the headers + sequential decoding of
internal strings
 e.g. locate (Antivirus Software)=5
 Extract (id), finds the bucket id/b, and decodes until the given position
 E.g. extract (5) = Antivirus Software
A
An
Ant
Antivirus
Antivirus Software
Best
More on Compressed Dictionaries: Martínez-Prieto, M. A., Brisaboa, N., Cánovas, R., Claude, F., &
Navarro, G. (2016). Practical compressed string dictionaries. Information Systems, 56, 73-108.
ptr 1 9
Bucket 1 Bucket 2
a (1,n) (2,t) Antivirus (9, Software) (0,B)
1 2 3 4 5 6

subjects
Objects:
Predicates:
 Bitmap Triples:
21
Bitmap Triples Encoding
 We index the bitsequences to provide a SPO index

 Represent and index large volumes of data
 ~ Theoretical minimum space while serving efficient operations:
 Mostly based on 3 operations:
 Access
 Rank
 Select
22
Remember…. Succinct Data Structures

 Bitmap Sequence.
 Operations in constant time
 access(position) = Value.
 rank(position) = “Number of ones, up to position”.
 select(i) = “Position where the one has i occurrences”.
 Implementation:
 n + o(n) bits
 Adjustable space overhead: In practice, 37,5 % overhead
23
Bit Sequence Coding
1 1 0 0 0 1 0 0 1 0 1 1 0 1 1 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
rank(7) = 4
9
select(5) = 9
1
access(14) = 1

 Bitmap Triples:
24
Bitmap Triples Encoding
subjects
Objects:
Predicates:
 E.g. retrieve (2,5,?)
 Find the position of the second ‘1’-bit in Bp (select)
 Binary search on the list of predicates looking for 5
 Note that such predicate 5 is in position 4 of Sp
 Find the position of the four ‘1’-bit in Bo (select)
S P O
S P ?
S ? O
S ? ?
? ? ?

 E.g. retrieve (?,4,?)
 Get the location of the list, Bl.select(3)+1= 4+1 =5
 Retrieve the position: Sl[5]=1
 Get the associated subject, rank(1)=1st subject
 access the object as (1,4,?)
 Encoded as a position list
25
Additional Predicate Index
predicates
5 6 2 3 1 4
0 1 1 1 1 1Bl
Sl
1 2 3 4 5
? P ?
subjects
Objects:
Predicates:

 Encoded as a position list
26
Additional Object Index
subjects
Objects:
Predicates:
? ? O
? P O
predicates
5 6 2 3 1 4
0 1 1 1 1 1Bl
Sl
1 2 3 4 5objects
2 5 6 4 3 3 1
0 0 1 1 1 1 1Bl
Sl
1 2 3 4 5

 From the exchanged HDT to the functional HDT-FoQ:
 Publish and Exchange HDT
 At the consumer:
27
On-the-fly indexes: HDT-FoQ
Process Type of Index Patterns
index the bitsequences
Subject
SPO
SPO, SP?,
S??, S?O, ???
We index the position of each predicate
(just a position list)
Predicate
PSO ?P?
We index the position of each object
(just a position list)
Object
OPS ?PO, ??O
1
2
3

 http://www.w3.org/Submission/2011/03/
28
HDT Acknowledged as
W3C member submission:

29
Some numbers on size
http://dataweb.infor.uva.es/projects/hdt-mr/
José M. Giménez-García, Javier D. Fernández, and Miguel A. Martínez-Prieto. HDT-MR: A
Scalable Solution for RDF Compression with HDT and MapReduce. In Proc. of
International Semantic Web Conference (ISWC), 2015
28,362,198,927 Triples

 Data is ready to be consumed 10-15x faster.
 HDT << any other RDF format || RDF engine
 Competitive query performance.
 Very fast on triple patterns, x 1.5 faster (Virtuoso, RDF3x).
 Integration with Jena
 Joins on the same scale of existing solutions (Virtuoso, RDF3x).
30
Results

https://github.com/rdfhdt,
C++ and Java tools
Only in the last two weeks…
HDT-cpp

33
uses HDT as the main storage solution
2) LOD Laundromat
 Challenges:
 Still you need to query 650K datasets
 Of course it does not contain all LOD, but “a good approximation”
http://lodlaundromat.org/
Beek, W., Rietveld, L., Bazoobandi, H. R., Wielemaker, J., & Schlobach, S. (2014, October). LOD laundromat: a
uniform way of publishing other people’s dirty data. In ISWC (pp. 213-228).

3) Linked Data Fragments
 Challenges:
 Still room for optimization for complex federated queries (delays,
intermediate results, …)
34
typically uses HDT as the main engine
Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, M., ... & Van de Walle, R. (2014).
Querying datasets on the web with high availability. In ISWC (pp. 180-196).

Get more than 650K HDT datasets from
LOD Laundromat…
PAGE 35

 Scalable storage
 Store and serve thousands of (large) datasets (e.g. LOD Laundromat)
 Archiving (reduce storage costs + foster smart consumption)
 Also with deltas (see https://aic.ai.wu.ac.at/qadlod/bear.html)
 Better consumer-centric publication
 Compress and share ready-to-consume RDF datasets
 Consumption with limited resources
 smartphones, standard laptops
 Fast –low cost- SPARQL Query Engine
 Via HDT-Jena
 Via Linked Data Fragments
Application scenarios (HDT)

 Storage + Light API
 http://lodlaundromat.org/
 http://linkeddatafragments.org/
 Storage + SPARQL Query Engine
 https://data.world/
 Advance features
 Top k Shortest Path
 Query Answering over the Web of Data
 Others: Versioning, streaming,…
Application Examples

Application Examples
Publication in HDT (~1B triples): https://zenodo.org/record/1116889#.WuBt0C7FKpo
Uri resolver based on HDT: https://github.com/pharmbio/urisolve
91,498,351 compounds from PubChem with predicted logD (water–octanol distribution coefficient) values at 90% confidence level

 E.g. retrieve all entities referring to the gene WBGene00000001 (aap-1)
 Solutions?
40
select distinct ?x {
?x dcterms:title "WBGene00000001 (aap-1)" .
}

LOD-a-lot
41
- flashback -

42
-crawl-
LOD
Laundromat
Dataset 1
N-Triples
(zip)
Dataset 650K
N-Triples
(zip)
Linked
Open Data
-clean-
-index&store-
SPARQL
endpoint
(metadata)
LOD-a-lot
LOD-a-lot-integrate-
28B triples

 Disk size:
 HDT: 304 GB
 HDT-FoQ (additional indexes): 133 GB
 Memory footprint (to query):
 15.7 GB of RAM (3% of the size)
 144 seconds loading time
 8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS
 LDF page resolution in milliseconds.
43
LOD-a-lot (some numbers)
305€
(LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM)

44
https://datahub.io/dataset/lod-a-lot
http://purl.org/HDT/lod-a-lot
LOD-a-lot

 Query resolution at Web scale
 Using LDF, Jena
 Evaluation and Benchmarking
 No excuse 
 RDF metrics and analytics
45
LOD-a-lot (some use cases)
subjects predicates objects

 Identity closure
 ?x owl:sameAs ?y
 Graph navigations
 E.g. shortest path, random walk
46
LOD-a-lot (some use cases)
Wouter Beek, Javier D.
Fernández and Ruben Verborgh.
LOD-a-lot: A Single-File Enabler
for Data Science. In Proc. of
SEMANTiCS 2017.
More use cases:

 Update LOD-a-lot regularly
 More and newer datasets from the LOD Cloud
 Leverage the HDT indexes to support “data science”
 E.g. get links across datasets, study the topology of the network, optimize
query planning
 Support provenance of the triples (i.e. origin of each triple)
 Currently supported only via LOD Laundromat
 … implement the use cases and help the community to democratize
the access to LOD
Roadmap

 We are currently facing Big Linked Data challenges
 Generation, publication and consumption
 Archiving, evolution…
 Thanks to compression/HDT, the Big Linked Data
today will be the “pocket” data tomorrow
 HDT democratizes the access to Big Linked Data
= Cheap, scalable consumers
low-cost access to LOD = high-impact research
PAGE 48
Take-home messages

Thank you!
javier.fernandez@wu.ac.at
Kudos to all the co-authors involved in the works presented here
Incomplete list of ACKs:
Miguel A. Martínez-Prieto
Mario Arias
Pablo de la Fuente
Claudio Gutierrez
Axel Polleres
Wouter Beek
Ruben Verborgh
… And many others

Democratizing Big Semantic Data Management with HDT

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Ähnlich wie Democratizing Big Semantic Data Management with HDT

Ähnlich wie Democratizing Big Semantic Data Management with HDT (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Democratizing Big Semantic Data Management with HDT