This document discusses the capabilities and performance of Virtuoso, an open-source database for managing and querying semantic data. It describes how Virtuoso uses techniques like column storage, vector execution, and structure awareness to achieve SQL and SPARQL query performance on par with specialized relational databases. The document also outlines several European Union-funded research projects aimed at further improving RDF database performance and scaling through benchmarks, geospatial extensions, and graph analytics.
2. Linked Data at Dawn
The Promise and the Practice
The Science of Speed
The Structure which Is
Ongoing Research
License CC-BY-SA 4.0 (International).
3. Linked Data Promises
RDF is a generic, minimalistic model for describing
things
RDF has global identifiers and data is self-describing
URI's may be dereferenceable
RDF is flexible to query, does not force a single
hierarchical view like XML
License CC-BY-SA 4.0 (International).
4. Linked Data Scenarios
RDF is used because of
schema flexibility
global identifiers
Inference, if present, is usually trivial
Subclass
Sub-property
License CC-BY-SA 4.0 (International).
5. Where Triples Come From
Relational extracts or web content is converted to
and stored as triples
NLP extraction
New applications with RDF as primary data model
Doing SPARQL against data in RDB's is possible
but is rare and does not deliver the flexibility
License CC-BY-SA 4.0 (International).
6. Linked Data Verticals and Patterns
Publishing: tagging and annotations, evolving vocabularies
Archives: self description, long term identifiers, many versions
of schema
Semantic search: structured, semi-structured and full text all in
one
Business intelligence: many sources, ease of adding sources,
no 6 month DW schema change cycle
E-science: often in life sciences: common interchange format,
nano-publications, NLP extracts, different users cook their data
differently, provenance
License CC-BY-SA 4.0 (International).
7. The Hopes and Perceptions
The age of ad hoc
Find insight in any data, when you need it, from any source,
any format
No data warehouse planning cycles, make your own from
the pieces you need, when you need it
Still, data integration remains hard work, quality and
coverage of sources vary
Flexibility may be there, but is performance and scalability
on the level?
License CC-BY-SA 4.0 (International).
8. Yes, But ...
Web and Big Data: Everybody reinvents the triple: Self-description,
long term identifiers, key-value pairs in many non-
RDF use cases
SPARQL and RDF would be the natural, standards compliant
choice if did beat SQL, information retrieval, custom big data,
key value, map reduce solutions
Is this intrinsic to linked data or is this lack of engineering?
Linked data has unique advantages in breadth of coverage and
expressivity but performance must not lag behind.
License CC-BY-SA 4.0 (International).
9. What is the RDF Tax?
90% of bad performance comes from non-optimal
query plans
Some comes from indexing too much (e.g. SQL
bulk load with no indices is 50x faster than the
equivalent in RDF with all indexed)
Some comes from string ops on URI's, literals
Some comes from having a join for every attribute.
Vectoring and right plans help, though
License CC-BY-SA 4.0 (International).
10. The Bane of the Triple
When data is stored as triples:
There is structure still but it is harder to exploit: Schema re-emerges
as correlations
More joins make more possible query plans, bigger errors in
plan cost estimation
More joining reduces locality
Lack of schema causes needless indexing, data takes more
space
A URI for everything takes space and time
For the same workload, SQL can be 2 - 20x faster also with
Virtuoso
License CC-BY-SA 4.0 (International).
11. The Question is Raised
LOD2 FP7, now ending: RDF Performance parity
with relational?
SQL is the senior science, who ignores history is
bound to repeat it
Integral mastery of RDB science is a prerequisite,
but do not forget the subtle twists of schema-less’ness
License CC-BY-SA 4.0 (International).
12. Virtuoso Leadership in Linked Data
2000 – 06 V1.x - 4.x SQL row store with SQL federation and
XML
2007 – 08 V 5.x - 6.x SPARQL, Adapted for RDF quads with
more compression, bitmap indices, special data types, RDF
awareness in query optimization
2009 - 6.x Scale out cluster capable
2010 – 13 V7.x Column store, vectored execution, 3x more
space efficient, 10+x more speed
2013 Star Schema benchmark with SPARQL 100x MySQL
SQL, 0.8x MonetDB SQL
2014 - Top of the line SQL analytics, 500Gtriples, Structure
Awareness
License CC-BY-SA 4.0 (International).
13. Triples Are Done Right, so?
Column store techniques are a good fit, index based
triple storage does not get much better
RAM-only pointer based techniques can be faster
but cost 10-100x more to scale up
To take RDF to SQL parity, Virtuoso must first be
on the level with the best in SQL
TPC-H is the checklist for mastery of DW and query
optimization , who survives shall not fear
Parity is achieved when running with triples just like
with tables
License CC-BY-SA 4.0 (International).
14. Structure is Everywhere
CWI in LOD2:
90% of triples in Common Crawl fall in 20 tables
All relational extractions are 100% tables
Even Dbpedia is 90% covered by 500 tables, but is
unusually heterogeneous, albeit not very large
License CC-BY-SA 4.0 (International).
15. The Glorious Dawn:
Structure is the Servant, not the Tyrant
A set of subjects with all the same single valued properties is
in fact a table.
So, store it as a table
Allow exceptions, e.g. sometimes multiple values, different
values In different graphs, extra properties etc
If it is big, it has repeating structure
All RDF semantics are preserved, any triple is possible but the
common ones are SQL compact and SQL fast
With tables, query optimization returns to SQL complexity and
is much more reliable
So, more tricks from the SQL analytics bad become safe and
applicable
License CC-BY-SA 4.0 (International).
16. Gains from Structure Awareness
3+x Load Speed
2x more space efficiency
Queries against regular data within 10-20% of SQL
speeds
Just declare which properties tend to occur
together, no strict schema first like with SQL
Later, self configuration
License CC-BY-SA 4.0 (International).
17. The Cycle of Adventure
Rebels: SQL not cool, too rigid, drop
ACID, go key-value, map-reduce, the
triple is all there is, semantic web
Pioneers: Life on the frontier is hard,
infrastructure missing or bad
Same everyday problems also in
Utopia
Recognizing the objective values, eg
schema freedom and identifiers, no
AI. Do the job, forget dogma
Reconciliation: schema-first and
schema-last converge in structure
awareness
License CC-BY-SA 4.0 (International).
18. Present FP7 Research
LDBC - Transparency and
Relevance for Graph DB, RDF
performance
GeoKnow - GeoData is everywhere,
how to carry the planet in your
pocket
LOD2 - Where no triple has gone
before (and come back)
Open PHACTs – A Data Platform
for Drug Discovery
License CC-BY-SA 4.0 (International).
19. LDBC - Linked Data Benchmark Council
Rebels: SQL not cool, too rigid, drop ACID,
go key-value, map-reduce, the triple is all
there is, semantic web
Pioneers: Life on the frontier is hard,
infrastructure missing or bad
Same everyday problems also in Utopia
Recognizing the objective values, e. schema
freedom and identifiers, no AI. Do the job,
forget dogma
Reconciliation: Some of the rebel thinking
becomes mainstream, e.g. schema-first and
schema-last converge in structure awareness
License CC-BY-SA 4.0 (International).
20. LDB Council, Independent Industry Forum
for Benchmarking
The TPC for the frontiers of database
Bootstrapped in the LDBC FP7, continues
as independent industry association
OpenLink, Ontotext, Neo Technologies,
Sparsity as founding members
IBM, Oracle Labs, Systap, SPARQL City
already joined
DB superstars Peter Boncz and Thomas
Neumann as founders and scientific lead
License CC-BY-SA 4.0 (International).
21. LDBC Benchmarks
Social Network
Online - Lookups, updates, analysis of
social environment
Business Intelligence - Spotting trends, key
players, big query
Graph analytics - Community detection,
Page rank, graph metrics
Semantic Publishing
Modeled after the BBC linked data portal,
online lookups, drill downs and updates
License CC-BY-SA 4.0 (International).
22. GeoKnow - The Planet in your Pocket
Ms. Globe and Mr. Cube have a
thing going on:
Mr. Cube: Desiloization ...
integrated metadata ... Explicit
semantics .
Ms. Globe: I can feel it... but are
you man enough ... you need to
show me.
License CC-BY-SA 4.0 (International).
23. Planet Scale Roadmap
Jan 2014:
Virtuoso SPARQL outperforms PostGIS in map lookups with planet-wide
Open Street Map
Virtuoso SQL adds 5x more power
License CC-BY-SA 4.0 (International).
24. Next: Jan 2015
Parity between SPARQL and SQL via structure
awareness
Geospatial data clustering
Graph analytics close to the data, Pregel Giraph etc
in the DB itself
Adding fine grain geo dimension to LDBC social
network benchmark
License CC-BY-SA 4.0 (International).
25. The LOD2 scaling adventures
Experiments at CWI’s Scilens cluster:
150Gtriples in Jan 2013 (8x256G RAM)
500Gtriples Aug 2014 (12 x 256G RAM)
Some trillion triple claims exist but do not
detail any query workload
BSBM explore and BI workloads
10x speed gains for BI queries from 2013 to
2014
Bulk load at 6M triples/s
All done in triples, structure awareness will
go further still
License CC-BY-SA 4.0 (International).
27. Virtuoso Now
Snapshot of RDF Linked Data customers in the Enterprise:
Data.Gov (U.S. Govt. Open
Linked Data initiative)
Bank of America
Booz Allen Hamilton
Northrop Grumman
Elsevier
French National Library
Samsung
Globo
Daimler Benz
Johnson & Johnson
Bayer
St Jude's Medical
Fuijitsu
Syngenta
and many more
License CC-BY-SA 4.0 (International).
28. Virtuoso Availability
Most capabilities as open source
Commercial adds
Cluster scale-out
SQL Federation
Replication (SQL & RDF)
Advanced RDF security, ABAC & RBAC (ACLs)
Wide tables
and more
Up to the minute tech previews via v7fasttrack on github, e.g.
superfast TPC-H implementation
License CC-BY-SA 4.0 (International).
29. Virtuoso Future
Preview of structure aware RDF store in fall 2014
via v7fasttrack
Integrated graph analytics framework
Embed complex graph algos, e.g. community
detection, shortest path inside SPARQL/SQL
Comparison of SQL and SPARQL for big data
analytics
License CC-BY-SA 4.0 (International).
30. Linked Data Now
Adoption across major industries
Superior flexibility and time to solution
Dramatic performance gains in the last 5 years
Benchmarking will continue to drive progress, to the benefit of
users and vendors alike
Run circles around most open source SQL in SPARQL:
Virtuoso SPARQL beats MySQL in SSB by 100x
With structure awareness, SPARQL to match the best in SQL
for data warehousing, OLTP
Linked Data no longer a long shot but a technology that makes
sense
License CC-BY-SA 4.0 (International).
31. About OpenLink Software
OpenLink Software is a privately-held company founded in 1992 by its President & CEO,
Kingsley Idehen. The company is an industry acclaimed technology innovator in the
following areas:
ODBC, JDBC, ADO.NET, and OLE-DB compliant Data Access Drivers for Oracle,
SQL Server, Informix, Ingres, Sybase, Progress, MySQL, and PostgreSQL
High-Performance & Scalable Multi-Model (Relational & Graph) Database
Technology
Data Integration Middleware (Data Virtualization Technology across a wide variety of
Protocols & Formats)
Web Application Server Technology
Linked Data Deployment & Management
Socially-enhanced Distributed Collaborative Applications Platforms (Weblogs, Wikis,
Feed Aggregation and Syndication, Web File Systems, Discussion Forums, etc.)
Identity Management.
License CC-BY-SA 4.0 (International).
32. Office Locations
USA
OpenLink Software, Inc
10 Burlington Mall Road
Suite 265
Burlington, MA 01803
Tel.: +1 781 273 0900
Fax: +1 781 229 8030
UK
OpenLink Software Ltd.
Airport House
Purley Way
Croydon, Surrey CR0 0XZ
Tel.: +44 (0)20 8681 7701
Fax: +44 (0)20 8681 7702
License CC-BY-SA 4.0 (International).
33. Additional Information
Web Sites
OpenLink Software
YouID – Digital Identity Card (Certificate) Generator
OpenLink Data Spaces – Semantically enhanced Personal & Enterprise Data Spaces &
Collaboration Platform
OpenLink Virtuoso - Hybrid Data Management, Integration, Application, and Identity Server
Universal Data Access Drivers - High-Performance ODBC, JDBC, ADO.NET, and OLE-DB
Drivers
LDAP and NetID-TLS – How to use LDAP scheme URIs with NetID-TLS Authentication
Social Media Data spaces
http://kidehen.blogspot.com (weblog)
http://www.openlinksw.com/blog/~kidehen/ (weblog)
https://plus.google.com/112399767740508618350/posts (Google+)
https://twitter.com/#!/kidehen (Twitter)
Hashtag: #LinkedData (Anywhere).
License CC-BY-SA 4.0 (International).