SlideShare ist ein Scribd-Unternehmen logo
1 von 50
Democratizing Big Semantic
Data management
Javier D. Fernández
WU Vienna, Austria
Complexity Science Hub Vienna, Austria
Privacy and Sustainable Computing Lab, Austria
STANFORD CENTER FOR BIOMEDICAL INFORMATICS
APRIL 25TH, 2018.
or how to query an RDF graph with 28 billion triples in a standard laptop
The Linked Open Data cloud (2018)
PAGE 2
 ~10K datasets organized into 9
domains which include many and
varied knowledge fields.
 150B statements, including
entity descriptions and
(inter/intra-dataset) links between
them.
 >500 live endpoints serving this
data.
http://lod-cloud.net/
http://stats.lod2.eu/
http://sparqles.ai.wu.ac.at/
But what about Web-scale queries
 E.g. retrieve all entities in LOD referring to the gene WBGene00000001 (aap-1)
 Solutions?
3
select distinct ?x {
?x dcterms:title "WBGene00000001 (aap-1)" .
}
4
Let’s fish in our Linked Data Eco System
A) Federated Queries!!
1. Get a list of potential SPARQL Endpoints
 datahub.io, LOV, other catalogs?
2. Query each SPARQL Endpoint
 Problems?
 Many SPARQL Endpoints have low availability
5
The Web of Data Eco System
http://sparqles.ai.wu.ac.at/
A) Federated Queries!!
1. Get a list of potential SPARQL Endpoints
 datahub.io, LOV, other catalogs?
2. Query each SPARQL Endpoint
 Problems?
 Many SPARQL Endpoints have low availability
 SPARQL Endpoints are usually restricted (timeout,#results)
 Moreover, it can be tricky with complex queries (joins) due to intermediary
results, delays, etc
6
The Web of Data Eco System
B) Follow-your-nose
1. Follow self-descriptive IRIs and links
2. Filter the results you are interested in
 Problems?
 You need some initial seed
 DBpedia could be a good start
 It’s slow (fetching many documents)
 Where should I start for unbounded queries?
 ?x dcterms:title “WBGene00000001 (aap-1)"
7
The Web of Data Eco System
C) Use the RDF dumps by yourself
1. Crawl de Web of Data
 Probably start with datahub.io, LOV, other catalogs?
2. Download datasets
 You better have some free space in your machine
3. Index the datasets locally
 You better are patience and survive parsing errors
4. Query all datasets
 You better are alive by then
 Problems?
 Hugh resources!
 + Messiness of the data
8
The Web of Data Eco System
Rietveld, L., Beek, W., & Schlobach, S. (2015). LOD
lab: Experiments at LOD scale. In ISWC
 Publication, Exchange and Consumption of large RDF datasets
 Most RDF formats (N3, XML, Turtle) are text serializations, designed for
human readability (not for machines)
 Verbose = High costs to write/exchange/parse
 A basic offline search = (decompress)+ index the file + search
The problem is in the roots
(Big tree = big roots)
Steve Garry
1) HDT
 Highly compact serialization of RDF
 Allows fast RDF retrieval in compressed space (without prior decompression)
 Includes internal indexes to solve basic queries with small (3%) memory footprint.
 Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, Jena, RDF3X.
 Supports FULL SPARQL as the compressed backend store of Jena, with an efficiency on the
same scale as current more optimized solutions
 Challenges:
 Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then
it is ready to consume efficiently!)
10
A Linked Data hacker toolkit
431 M.triples~
63 GB
NT + gzip
5 GB
HDT
6.6 GB
Slightly more but you can query!
rdfhdt.org
11
HDT (Header-Dictionary-Triples) Overview
RDF
Header
Dictionary
Triples
1 aa..
2 ab..
3 bu ..
metadata describing the RDF dataset
Mapping between IDs elements in the dataset
aa..
ab..
bu
..
3
2
1
Structure of the data after the ID replacement
 Publication Metadata:
 Publisher, Public Endpoint, …
 Statistical Metadata:
 Number of triples, subjects, entities, histograms…
 Format Metadata:
 Description of the data structure, e.g. Triples Order.
 Additional Metadata:
 Domain-specific.
 … In RDF
Header
13
Header
14
Dictionary+Triples partition
15
Dictionary+Triples partition
<http://example.org/Vienna>
<http://example.org/Javier>
<http://example.org/Paul>
<http://example.org/Researcher>
<http://example.org/Stefan>
ex:birthPlace
ex:workPlace
foaf:mbox
foaf:name
rdf:type
“jfergar@example.org”
“jfergar@wu.ac.at”
“Vienna”@en
1
2
3
4
5
6
7
8
9
10
11
12
13
2 1
7
12
13
9
4
10
8
5
6
3
6
11
8
 Mapping of strings to correlative IDs. {1..n}
 Lexicographically sorted, no duplicates.
 Prefix-Based compression in each section.
 Efficient IDString operations
Dictionary
 Prefix-Based compression used in each section
1. Each string is encoded with two values
 An integer representing the number of characters shared with the previous
string
 A sequence of characters representing the suffix that is not shared with the
previous string
Dictionary. Plain Front Coding (PFC)
A
An
Ant
Antivirus
Antivirus Software
Best
(0,a) (1,n) (2,t) (3,ivirus) (9, Software) (0,Best)
 Prefix-Based compression used in each section
2. The vocabulary is split in buckets, each of them storing “b” strings
 The first string of each bucket (header) is coded explicitly (i.e. full string)
 The subsequent b-1 strings (internal strings) are coded differentially
Dictionary. Plain Front Coding (PFC)
A
An
Ant
Antivirus
Antivirus Software
Best
Bucket 1 Bucket 2
a (1,n) (2,t) Antivirus (9, Software) (0,Best)
1 2 3 4 5 6
 Prefix-Based compression used in each section
3. PFC is encoded with a byte sequence and an array of pointers (ptr) to
denote the first byte of each bucket
Dictionary. Plain Front Coding (PFC)
A
An
Ant
Antivirus
Antivirus Software
Best ptr 1 9
Bucket 1 Bucket 2
a (1,n) (2,t) Antivirus (9, Software) (0,B)
1 2 3 4 5 6
 Prefix-Based compression used in each section
 Locate (string) performs a binary search in the headers + sequential decoding of
internal strings
 e.g. locate (Antivirus Software)=5
 Extract (id), finds the bucket id/b, and decodes until the given position
 E.g. extract (5) = Antivirus Software
Dictionary. Plain Front Coding (PFC)
A
An
Ant
Antivirus
Antivirus Software
Best
More on Compressed Dictionaries: Martínez-Prieto, M. A., Brisaboa, N., Cánovas, R., Claude, F., &
Navarro, G. (2016). Practical compressed string dictionaries. Information Systems, 56, 73-108.
ptr 1 9
Bucket 1 Bucket 2
a (1,n) (2,t) Antivirus (9, Software) (0,B)
1 2 3 4 5 6
subjects
Objects:
Predicates:
 Bitmap Triples:
21
Bitmap Triples Encoding
 We index the bitsequences to provide a SPO index
 Represent and index large volumes of data
 ~ Theoretical minimum space while serving efficient operations:
 Mostly based on 3 operations:
 Access
 Rank
 Select
22
Remember…. Succinct Data Structures
 Bitmap Sequence.
 Operations in constant time
 access(position) = Value.
 rank(position) = “Number of ones, up to position”.
 select(i) = “Position where the one has i occurrences”.
 Implementation:
 n + o(n) bits
 Adjustable space overhead: In practice, 37,5 % overhead
23
Bit Sequence Coding
1 1 0 0 0 1 0 0 1 0 1 1 0 1 1 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
rank(7) = 4
9
select(5) = 9
1
access(14) = 1
 Bitmap Triples:
24
Bitmap Triples Encoding
subjects
Objects:
Predicates:
 E.g. retrieve (2,5,?)
 Find the position of the second ‘1’-bit in Bp (select)
 Binary search on the list of predicates looking for 5
 Note that such predicate 5 is in position 4 of Sp
 Find the position of the four ‘1’-bit in Bo (select)
S P O
S P ?
S ? O
S ? ?
? ? ?
 E.g. retrieve (?,4,?)
 Get the location of the list, Bl.select(3)+1= 4+1 =5
 Retrieve the position: Sl[5]=1
 Get the associated subject, rank(1)=1st subject
 access the object as (1,4,?)
 Encoded as a position list
25
Additional Predicate Index
predicates
5 6 2 3 1 4
0 1 1 1 1 1Bl
Sl
1 2 3 4 5
? P ?
subjects
Objects:
Predicates:
 Encoded as a position list
26
Additional Object Index
subjects
Objects:
Predicates:
? ? O
? P O
predicates
5 6 2 3 1 4
0 1 1 1 1 1Bl
Sl
1 2 3 4 5objects
2 5 6 4 3 3 1
0 0 1 1 1 1 1Bl
Sl
1 2 3 4 5
 From the exchanged HDT to the functional HDT-FoQ:
 Publish and Exchange HDT
 At the consumer:
27
On-the-fly indexes: HDT-FoQ
Process Type of Index Patterns
index the bitsequences
Subject
SPO
SPO, SP?,
S??, S?O, ???
We index the position of each predicate
(just a position list)
Predicate
PSO ?P?
We index the position of each object
(just a position list)
Object
OPS ?PO, ??O
1
2
3
 http://www.w3.org/Submission/2011/03/
28
HDT Acknowledged as
W3C member submission:
29
Some numbers on size
http://dataweb.infor.uva.es/projects/hdt-mr/
José M. Giménez-García, Javier D. Fernández, and Miguel A. Martínez-Prieto. HDT-MR: A
Scalable Solution for RDF Compression with HDT and MapReduce. In Proc. of
International Semantic Web Conference (ISWC), 2015
28,362,198,927 Triples
 Data is ready to be consumed 10-15x faster.
 HDT << any other RDF format || RDF engine
 Competitive query performance.
 Very fast on triple patterns, x 1.5 faster (Virtuoso, RDF3x).
 Integration with Jena
 Joins on the same scale of existing solutions (Virtuoso, RDF3x).
30
Results
31
rdfhdt.org community
https://github.com/rdfhdt,
C++ and Java tools
Only in the last two weeks…
HDT-cpp
33
A Linked Data hacker toolkit
uses HDT as the main storage solution
2) LOD Laundromat
 Challenges:
 Still you need to query 650K datasets
 Of course it does not contain all LOD, but “a good approximation”
http://lodlaundromat.org/
Beek, W., Rietveld, L., Bazoobandi, H. R., Wielemaker, J., & Schlobach, S. (2014, October). LOD laundromat: a
uniform way of publishing other people’s dirty data. In ISWC (pp. 213-228).
3) Linked Data Fragments
 Challenges:
 Still room for optimization for complex federated queries (delays,
intermediate results, …)
34
A Linked Data hacker toolkit
typically uses HDT as the main engine
Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, M., ... & Van de Walle, R. (2014).
Querying datasets on the web with high availability. In ISWC (pp. 180-196).
Get more than 650K HDT datasets from
LOD Laundromat…
PAGE 35
PAGE 36
And query them online
 Scalable storage
 Store and serve thousands of (large) datasets (e.g. LOD Laundromat)
 Archiving (reduce storage costs + foster smart consumption)
 Also with deltas (see https://aic.ai.wu.ac.at/qadlod/bear.html)
 Better consumer-centric publication
 Compress and share ready-to-consume RDF datasets
 Consumption with limited resources
 smartphones, standard laptops
 Fast –low cost- SPARQL Query Engine
 Via HDT-Jena
 Via Linked Data Fragments
Application scenarios (HDT)
 Storage + Light API
 http://lodlaundromat.org/
 http://linkeddatafragments.org/
 Storage + SPARQL Query Engine
 https://data.world/
 Advance features
 Top k Shortest Path
 Query Answering over the Web of Data
 Others: Versioning, streaming,…
Application Examples
Application Examples
Publication in HDT (~1B triples): https://zenodo.org/record/1116889#.WuBt0C7FKpo
Uri resolver based on HDT: https://github.com/pharmbio/urisolve
91,498,351 compounds from PubChem with predicted logD (water–octanol distribution coefficient) values at 90% confidence level
But what about Web-scale queries
 E.g. retrieve all entities referring to the gene WBGene00000001 (aap-1)
 Solutions?
40
select distinct ?x {
?x dcterms:title "WBGene00000001 (aap-1)" .
}
LOD-a-lot
41
But what about Web-scale queries
- flashback -
42
-crawl-
LOD
Laundromat
Dataset 1
N-Triples
(zip)
Dataset 650K
N-Triples
(zip)
Linked
Open Data
-clean-
-index&store-
SPARQL
endpoint
(metadata)
LOD-a-lot
LOD-a-lot-integrate-
28B triples
 Disk size:
 HDT: 304 GB
 HDT-FoQ (additional indexes): 133 GB
 Memory footprint (to query):
 15.7 GB of RAM (3% of the size)
 144 seconds loading time
 8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS
 LDF page resolution in milliseconds.
43
LOD-a-lot (some numbers)
305€
(LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM)
44
https://datahub.io/dataset/lod-a-lot
http://purl.org/HDT/lod-a-lot
LOD-a-lot
 Query resolution at Web scale
 Using LDF, Jena
 Evaluation and Benchmarking
 No excuse 
 RDF metrics and analytics
45
LOD-a-lot (some use cases)
subjects predicates objects
 Identity closure
 ?x owl:sameAs ?y
 Graph navigations
 E.g. shortest path, random walk
46
LOD-a-lot (some use cases)
Wouter Beek, Javier D.
Fernández and Ruben Verborgh.
LOD-a-lot: A Single-File Enabler
for Data Science. In Proc. of
SEMANTiCS 2017.
More use cases:
 Update LOD-a-lot regularly
 More and newer datasets from the LOD Cloud
 Leverage the HDT indexes to support “data science”
 E.g. get links across datasets, study the topology of the network, optimize
query planning
 Support provenance of the triples (i.e. origin of each triple)
 Currently supported only via LOD Laundromat
 … implement the use cases and help the community to democratize
the access to LOD
Roadmap
 We are currently facing Big Linked Data challenges
 Generation, publication and consumption
 Archiving, evolution…
 Thanks to compression/HDT, the Big Linked Data
today will be the “pocket” data tomorrow
 HDT democratizes the access to Big Linked Data
= Cheap, scalable consumers
low-cost access to LOD = high-impact research
PAGE 48
Take-home messages
49
ACKs
Thank you!
javier.fernandez@wu.ac.at
Kudos to all the co-authors involved in the works presented here
Incomplete list of ACKs:
Miguel A. Martínez-Prieto
Mario Arias
Pablo de la Fuente
Claudio Gutierrez
Axel Polleres
Wouter Beek
Ruben Verborgh
… And many others

Weitere ähnliche Inhalte

Was ist angesagt?

Framester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubFramester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubMehwish Alam
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogsandrea huang
 
A Main Memory Index Structure to Query Linked Data
A Main Memory Index Structure to Query Linked DataA Main Memory Index Structure to Query Linked Data
A Main Memory Index Structure to Query Linked DataOlaf Hartig
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online NewsBernardo Najlis
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationBoris Glavic
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information RetrievalShadi Saleh
 
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...andrea huang
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...Dataconomy Media
 
Cross language information retrieval (clir)slide
Cross language information retrieval (clir)slideCross language information retrieval (clir)slide
Cross language information retrieval (clir)slideMohd Iqbal Al-farabi
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 
Using NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 LiteratureUsing NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 LiteratureDatabricks
 
Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7alaa223
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAMfnothaft
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?andrea huang
 
Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]Mariana Damova, Ph.D
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online NewsBernardo Najlis
 

Was ist angesagt? (19)

2017 biological databases_part1_vupload
2017 biological databases_part1_vupload2017 biological databases_part1_vupload
2017 biological databases_part1_vupload
 
Framester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data HubFramester: A Wide Coverage Linguistic Linked Data Hub
Framester: A Wide Coverage Linguistic Linked Data Hub
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs
 
A Main Memory Index Structure to Query Linked Data
A Main Memory Index Structure to Query Linked DataA Main Memory Index Structure to Query Linked Data
A Main Memory Index Structure to Query Linked Data
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
ICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database VirtualizationICDE 2015 - LDV: Light-weight Database Virtualization
ICDE 2015 - LDV: Light-weight Database Virtualization
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information Retrieval
 
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
Relations for Reusing (R4R) in A Shared Context: An Exploration on Research P...
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
 
Cross language information retrieval (clir)slide
Cross language information retrieval (clir)slideCross language information retrieval (clir)slide
Cross language information retrieval (clir)slide
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Using NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 LiteratureUsing NLP to Explore Entity Relationships in COVID-19 Literature
Using NLP to Explore Entity Relationships in COVID-19 Literature
 
Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7
 
Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
 
Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]
 
Named Entity Recognition from Online News
Named Entity Recognition from Online NewsNamed Entity Recognition from Online News
Named Entity Recognition from Online News
 

Ähnlich wie Democratizing Big Semantic Data Management with HDT

FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...Mark Wilkinson
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordMark Wilkinson
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic WebIvan Herman
 
IBC FAIR Data Prototype Implementation slideshow
IBC FAIR Data Prototype Implementation   slideshowIBC FAIR Data Prototype Implementation   slideshow
IBC FAIR Data Prototype Implementation slideshowMark Wilkinson
 
RO-Crate: packaging metadata love notes into FAIR Digital Objects
RO-Crate: packaging metadata love notes into FAIR Digital ObjectsRO-Crate: packaging metadata love notes into FAIR Digital Objects
RO-Crate: packaging metadata love notes into FAIR Digital ObjectsCarole Goble
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Ken Mwai
 
2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod GmodJun Zhao
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2Daniel JACOB
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Chris Mattmann
 
Data Vault vs Data Lake: What's the difference?
Data Vault vs Data Lake: What's the difference?Data Vault vs Data Lake: What's the difference?
Data Vault vs Data Lake: What's the difference?Fru Louis
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD VivaAidan Hogan
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
A Generic Mapping-based Query Translation from SPARQL to Various Target Datab...
A Generic Mapping-based Query Translation from SPARQL to Various Target Datab...A Generic Mapping-based Query Translation from SPARQL to Various Target Datab...
A Generic Mapping-based Query Translation from SPARQL to Various Target Datab...Franck Michel
 

Ähnlich wie Democratizing Big Semantic Data Management with HDT (20)

FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
FAIR Data Prototype - Interoperability and FAIRness through a novel combinati...
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, Oxford
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
IR with lucene
IR with luceneIR with lucene
IR with lucene
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
IBC FAIR Data Prototype Implementation slideshow
IBC FAIR Data Prototype Implementation   slideshowIBC FAIR Data Prototype Implementation   slideshow
IBC FAIR Data Prototype Implementation slideshow
 
Hadoop
HadoopHadoop
Hadoop
 
RO-Crate: packaging metadata love notes into FAIR Digital Objects
RO-Crate: packaging metadata love notes into FAIR Digital ObjectsRO-Crate: packaging metadata love notes into FAIR Digital Objects
RO-Crate: packaging metadata love notes into FAIR Digital Objects
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Slides 111017220255-phpapp01
Slides 111017220255-phpapp01
 
2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod Gmod
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!Apache Tika: 1 point Oh!
Apache Tika: 1 point Oh!
 
Data Vault vs Data Lake: What's the difference?
Data Vault vs Data Lake: What's the difference?Data Vault vs Data Lake: What's the difference?
Data Vault vs Data Lake: What's the difference?
 
Aidan's PhD Viva
Aidan's PhD VivaAidan's PhD Viva
Aidan's PhD Viva
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
The Python Programming Language and HDF5: H5Py
The Python Programming Language and HDF5: H5PyThe Python Programming Language and HDF5: H5Py
The Python Programming Language and HDF5: H5Py
 
A Generic Mapping-based Query Translation from SPARQL to Various Target Datab...
A Generic Mapping-based Query Translation from SPARQL to Various Target Datab...A Generic Mapping-based Query Translation from SPARQL to Various Target Datab...
A Generic Mapping-based Query Translation from SPARQL to Various Target Datab...
 

Kürzlich hochgeladen

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Democratizing Big Semantic Data Management with HDT

  • 1. Democratizing Big Semantic Data management Javier D. Fernández WU Vienna, Austria Complexity Science Hub Vienna, Austria Privacy and Sustainable Computing Lab, Austria STANFORD CENTER FOR BIOMEDICAL INFORMATICS APRIL 25TH, 2018. or how to query an RDF graph with 28 billion triples in a standard laptop
  • 2. The Linked Open Data cloud (2018) PAGE 2  ~10K datasets organized into 9 domains which include many and varied knowledge fields.  150B statements, including entity descriptions and (inter/intra-dataset) links between them.  >500 live endpoints serving this data. http://lod-cloud.net/ http://stats.lod2.eu/ http://sparqles.ai.wu.ac.at/
  • 3. But what about Web-scale queries  E.g. retrieve all entities in LOD referring to the gene WBGene00000001 (aap-1)  Solutions? 3 select distinct ?x { ?x dcterms:title "WBGene00000001 (aap-1)" . }
  • 4. 4 Let’s fish in our Linked Data Eco System
  • 5. A) Federated Queries!! 1. Get a list of potential SPARQL Endpoints  datahub.io, LOV, other catalogs? 2. Query each SPARQL Endpoint  Problems?  Many SPARQL Endpoints have low availability 5 The Web of Data Eco System http://sparqles.ai.wu.ac.at/
  • 6. A) Federated Queries!! 1. Get a list of potential SPARQL Endpoints  datahub.io, LOV, other catalogs? 2. Query each SPARQL Endpoint  Problems?  Many SPARQL Endpoints have low availability  SPARQL Endpoints are usually restricted (timeout,#results)  Moreover, it can be tricky with complex queries (joins) due to intermediary results, delays, etc 6 The Web of Data Eco System
  • 7. B) Follow-your-nose 1. Follow self-descriptive IRIs and links 2. Filter the results you are interested in  Problems?  You need some initial seed  DBpedia could be a good start  It’s slow (fetching many documents)  Where should I start for unbounded queries?  ?x dcterms:title “WBGene00000001 (aap-1)" 7 The Web of Data Eco System
  • 8. C) Use the RDF dumps by yourself 1. Crawl de Web of Data  Probably start with datahub.io, LOV, other catalogs? 2. Download datasets  You better have some free space in your machine 3. Index the datasets locally  You better are patience and survive parsing errors 4. Query all datasets  You better are alive by then  Problems?  Hugh resources!  + Messiness of the data 8 The Web of Data Eco System Rietveld, L., Beek, W., & Schlobach, S. (2015). LOD lab: Experiments at LOD scale. In ISWC
  • 9.  Publication, Exchange and Consumption of large RDF datasets  Most RDF formats (N3, XML, Turtle) are text serializations, designed for human readability (not for machines)  Verbose = High costs to write/exchange/parse  A basic offline search = (decompress)+ index the file + search The problem is in the roots (Big tree = big roots) Steve Garry
  • 10. 1) HDT  Highly compact serialization of RDF  Allows fast RDF retrieval in compressed space (without prior decompression)  Includes internal indexes to solve basic queries with small (3%) memory footprint.  Very fast on basic queries (triple patterns), x 1.5 faster than Virtuoso, Jena, RDF3X.  Supports FULL SPARQL as the compressed backend store of Jena, with an efficiency on the same scale as current more optimized solutions  Challenges:  Publisher has to pay a bit of overhead to convert the RDF dataset to HDT (but then it is ready to consume efficiently!) 10 A Linked Data hacker toolkit 431 M.triples~ 63 GB NT + gzip 5 GB HDT 6.6 GB Slightly more but you can query! rdfhdt.org
  • 11. 11 HDT (Header-Dictionary-Triples) Overview RDF Header Dictionary Triples 1 aa.. 2 ab.. 3 bu .. metadata describing the RDF dataset Mapping between IDs elements in the dataset aa.. ab.. bu .. 3 2 1 Structure of the data after the ID replacement
  • 12.  Publication Metadata:  Publisher, Public Endpoint, …  Statistical Metadata:  Number of triples, subjects, entities, histograms…  Format Metadata:  Description of the data structure, e.g. Triples Order.  Additional Metadata:  Domain-specific.  … In RDF Header
  • 16.  Mapping of strings to correlative IDs. {1..n}  Lexicographically sorted, no duplicates.  Prefix-Based compression in each section.  Efficient IDString operations Dictionary
  • 17.  Prefix-Based compression used in each section 1. Each string is encoded with two values  An integer representing the number of characters shared with the previous string  A sequence of characters representing the suffix that is not shared with the previous string Dictionary. Plain Front Coding (PFC) A An Ant Antivirus Antivirus Software Best (0,a) (1,n) (2,t) (3,ivirus) (9, Software) (0,Best)
  • 18.  Prefix-Based compression used in each section 2. The vocabulary is split in buckets, each of them storing “b” strings  The first string of each bucket (header) is coded explicitly (i.e. full string)  The subsequent b-1 strings (internal strings) are coded differentially Dictionary. Plain Front Coding (PFC) A An Ant Antivirus Antivirus Software Best Bucket 1 Bucket 2 a (1,n) (2,t) Antivirus (9, Software) (0,Best) 1 2 3 4 5 6
  • 19.  Prefix-Based compression used in each section 3. PFC is encoded with a byte sequence and an array of pointers (ptr) to denote the first byte of each bucket Dictionary. Plain Front Coding (PFC) A An Ant Antivirus Antivirus Software Best ptr 1 9 Bucket 1 Bucket 2 a (1,n) (2,t) Antivirus (9, Software) (0,B) 1 2 3 4 5 6
  • 20.  Prefix-Based compression used in each section  Locate (string) performs a binary search in the headers + sequential decoding of internal strings  e.g. locate (Antivirus Software)=5  Extract (id), finds the bucket id/b, and decodes until the given position  E.g. extract (5) = Antivirus Software Dictionary. Plain Front Coding (PFC) A An Ant Antivirus Antivirus Software Best More on Compressed Dictionaries: Martínez-Prieto, M. A., Brisaboa, N., Cánovas, R., Claude, F., & Navarro, G. (2016). Practical compressed string dictionaries. Information Systems, 56, 73-108. ptr 1 9 Bucket 1 Bucket 2 a (1,n) (2,t) Antivirus (9, Software) (0,B) 1 2 3 4 5 6
  • 21. subjects Objects: Predicates:  Bitmap Triples: 21 Bitmap Triples Encoding  We index the bitsequences to provide a SPO index
  • 22.  Represent and index large volumes of data  ~ Theoretical minimum space while serving efficient operations:  Mostly based on 3 operations:  Access  Rank  Select 22 Remember…. Succinct Data Structures
  • 23.  Bitmap Sequence.  Operations in constant time  access(position) = Value.  rank(position) = “Number of ones, up to position”.  select(i) = “Position where the one has i occurrences”.  Implementation:  n + o(n) bits  Adjustable space overhead: In practice, 37,5 % overhead 23 Bit Sequence Coding 1 1 0 0 0 1 0 0 1 0 1 1 0 1 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 rank(7) = 4 9 select(5) = 9 1 access(14) = 1
  • 24.  Bitmap Triples: 24 Bitmap Triples Encoding subjects Objects: Predicates:  E.g. retrieve (2,5,?)  Find the position of the second ‘1’-bit in Bp (select)  Binary search on the list of predicates looking for 5  Note that such predicate 5 is in position 4 of Sp  Find the position of the four ‘1’-bit in Bo (select) S P O S P ? S ? O S ? ? ? ? ?
  • 25.  E.g. retrieve (?,4,?)  Get the location of the list, Bl.select(3)+1= 4+1 =5  Retrieve the position: Sl[5]=1  Get the associated subject, rank(1)=1st subject  access the object as (1,4,?)  Encoded as a position list 25 Additional Predicate Index predicates 5 6 2 3 1 4 0 1 1 1 1 1Bl Sl 1 2 3 4 5 ? P ? subjects Objects: Predicates:
  • 26.  Encoded as a position list 26 Additional Object Index subjects Objects: Predicates: ? ? O ? P O predicates 5 6 2 3 1 4 0 1 1 1 1 1Bl Sl 1 2 3 4 5objects 2 5 6 4 3 3 1 0 0 1 1 1 1 1Bl Sl 1 2 3 4 5
  • 27.  From the exchanged HDT to the functional HDT-FoQ:  Publish and Exchange HDT  At the consumer: 27 On-the-fly indexes: HDT-FoQ Process Type of Index Patterns index the bitsequences Subject SPO SPO, SP?, S??, S?O, ??? We index the position of each predicate (just a position list) Predicate PSO ?P? We index the position of each object (just a position list) Object OPS ?PO, ??O 1 2 3
  • 29. 29 Some numbers on size http://dataweb.infor.uva.es/projects/hdt-mr/ José M. Giménez-García, Javier D. Fernández, and Miguel A. Martínez-Prieto. HDT-MR: A Scalable Solution for RDF Compression with HDT and MapReduce. In Proc. of International Semantic Web Conference (ISWC), 2015 28,362,198,927 Triples
  • 30.  Data is ready to be consumed 10-15x faster.  HDT << any other RDF format || RDF engine  Competitive query performance.  Very fast on triple patterns, x 1.5 faster (Virtuoso, RDF3x).  Integration with Jena  Joins on the same scale of existing solutions (Virtuoso, RDF3x). 30 Results
  • 32. https://github.com/rdfhdt, C++ and Java tools Only in the last two weeks… HDT-cpp
  • 33. 33 A Linked Data hacker toolkit uses HDT as the main storage solution 2) LOD Laundromat  Challenges:  Still you need to query 650K datasets  Of course it does not contain all LOD, but “a good approximation” http://lodlaundromat.org/ Beek, W., Rietveld, L., Bazoobandi, H. R., Wielemaker, J., & Schlobach, S. (2014, October). LOD laundromat: a uniform way of publishing other people’s dirty data. In ISWC (pp. 213-228).
  • 34. 3) Linked Data Fragments  Challenges:  Still room for optimization for complex federated queries (delays, intermediate results, …) 34 A Linked Data hacker toolkit typically uses HDT as the main engine Verborgh, R., Hartig, O., De Meester, B., Haesendonck, G., De Vocht, L., Vander Sande, M., ... & Van de Walle, R. (2014). Querying datasets on the web with high availability. In ISWC (pp. 180-196).
  • 35. Get more than 650K HDT datasets from LOD Laundromat… PAGE 35
  • 36. PAGE 36 And query them online
  • 37.  Scalable storage  Store and serve thousands of (large) datasets (e.g. LOD Laundromat)  Archiving (reduce storage costs + foster smart consumption)  Also with deltas (see https://aic.ai.wu.ac.at/qadlod/bear.html)  Better consumer-centric publication  Compress and share ready-to-consume RDF datasets  Consumption with limited resources  smartphones, standard laptops  Fast –low cost- SPARQL Query Engine  Via HDT-Jena  Via Linked Data Fragments Application scenarios (HDT)
  • 38.  Storage + Light API  http://lodlaundromat.org/  http://linkeddatafragments.org/  Storage + SPARQL Query Engine  https://data.world/  Advance features  Top k Shortest Path  Query Answering over the Web of Data  Others: Versioning, streaming,… Application Examples
  • 39. Application Examples Publication in HDT (~1B triples): https://zenodo.org/record/1116889#.WuBt0C7FKpo Uri resolver based on HDT: https://github.com/pharmbio/urisolve 91,498,351 compounds from PubChem with predicted logD (water–octanol distribution coefficient) values at 90% confidence level
  • 40. But what about Web-scale queries  E.g. retrieve all entities referring to the gene WBGene00000001 (aap-1)  Solutions? 40 select distinct ?x { ?x dcterms:title "WBGene00000001 (aap-1)" . }
  • 41. LOD-a-lot 41 But what about Web-scale queries - flashback -
  • 42. 42 -crawl- LOD Laundromat Dataset 1 N-Triples (zip) Dataset 650K N-Triples (zip) Linked Open Data -clean- -index&store- SPARQL endpoint (metadata) LOD-a-lot LOD-a-lot-integrate- 28B triples
  • 43.  Disk size:  HDT: 304 GB  HDT-FoQ (additional indexes): 133 GB  Memory footprint (to query):  15.7 GB of RAM (3% of the size)  144 seconds loading time  8 cores (2.6 GHz), RAM 32 GB, SATA HDD on Ubuntu 14.04.5 LTS  LDF page resolution in milliseconds. 43 LOD-a-lot (some numbers) 305€ (LOD-a-lot creation took 64 h & 170GB RAM. HDT-FoQ took 8 h & 250GB RAM)
  • 45.  Query resolution at Web scale  Using LDF, Jena  Evaluation and Benchmarking  No excuse   RDF metrics and analytics 45 LOD-a-lot (some use cases) subjects predicates objects
  • 46.  Identity closure  ?x owl:sameAs ?y  Graph navigations  E.g. shortest path, random walk 46 LOD-a-lot (some use cases) Wouter Beek, Javier D. Fernández and Ruben Verborgh. LOD-a-lot: A Single-File Enabler for Data Science. In Proc. of SEMANTiCS 2017. More use cases:
  • 47.  Update LOD-a-lot regularly  More and newer datasets from the LOD Cloud  Leverage the HDT indexes to support “data science”  E.g. get links across datasets, study the topology of the network, optimize query planning  Support provenance of the triples (i.e. origin of each triple)  Currently supported only via LOD Laundromat  … implement the use cases and help the community to democratize the access to LOD Roadmap
  • 48.  We are currently facing Big Linked Data challenges  Generation, publication and consumption  Archiving, evolution…  Thanks to compression/HDT, the Big Linked Data today will be the “pocket” data tomorrow  HDT democratizes the access to Big Linked Data = Cheap, scalable consumers low-cost access to LOD = high-impact research PAGE 48 Take-home messages
  • 50. Thank you! javier.fernandez@wu.ac.at Kudos to all the co-authors involved in the works presented here Incomplete list of ACKs: Miguel A. Martínez-Prieto Mario Arias Pablo de la Fuente Claudio Gutierrez Axel Polleres Wouter Beek Ruben Verborgh … And many others