Virtuoso -- The Prometheus of RDF

Virtuoso, The
Prometheus of RDF
By Orri Erling
Virtuoso Program Manager, OpenLink Software

Linked Data at Dawn
 The Promise and the Practice
 The Science of Speed
 The Structure which Is
 Ongoing Research
License CC-BY-SA 4.0 (International).

Linked Data Promises
 RDF is a generic, minimalistic model for describing
things
 RDF has global identifiers and data is self-describing
 URI's may be dereferenceable
 RDF is flexible to query, does not force a single
hierarchical view like XML

Linked Data Scenarios
RDF is used because of
 schema flexibility
 global identifiers
Inference, if present, is usually trivial
 Subclass
 Sub-property

Where Triples Come From
 Relational extracts or web content is converted to
and stored as triples
NLP extraction
 New applications with RDF as primary data model
 Doing SPARQL against data in RDB's is possible
but is rare and does not deliver the flexibility

Linked Data Verticals and Patterns
Publishing: tagging and annotations, evolving vocabularies
Archives: self description, long term identifiers, many versions
of schema
Semantic search: structured, semi-structured and full text all in
one
Business intelligence: many sources, ease of adding sources,
no 6 month DW schema change cycle
E-science: often in life sciences: common interchange format,
nano-publications, NLP extracts, different users cook their data
differently, provenance

The Hopes and Perceptions
The age of ad hoc
Find insight in any data, when you need it, from any source,
any format
No data warehouse planning cycles, make your own from
the pieces you need, when you need it
Still, data integration remains hard work, quality and
coverage of sources vary
Flexibility may be there, but is performance and scalability
on the level?

Yes, But ...
Web and Big Data: Everybody reinvents the triple: Self-description,
long term identifiers, key-value pairs in many non-
RDF use cases
SPARQL and RDF would be the natural, standards compliant
choice if did beat SQL, information retrieval, custom big data,
key value, map reduce solutions
Is this intrinsic to linked data or is this lack of engineering?
Linked data has unique advantages in breadth of coverage and
expressivity but performance must not lag behind.

What is the RDF Tax?
 90% of bad performance comes from non-optimal
query plans
 Some comes from indexing too much (e.g. SQL
bulk load with no indices is 50x faster than the
equivalent in RDF with all indexed)
 Some comes from string ops on URI's, literals
 Some comes from having a join for every attribute.
Vectoring and right plans help, though

The Bane of the Triple
When data is stored as triples:
 There is structure still but it is harder to exploit: Schema re-emerges
as correlations
 More joins make more possible query plans, bigger errors in
plan cost estimation
 More joining reduces locality
 Lack of schema causes needless indexing, data takes more
space
 A URI for everything takes space and time
For the same workload, SQL can be 2 - 20x faster also with
Virtuoso

The Question is Raised
 LOD2 FP7, now ending: RDF Performance parity
with relational?
 SQL is the senior science, who ignores history is
bound to repeat it
 Integral mastery of RDB science is a prerequisite,
but do not forget the subtle twists of schema-less’ness

Virtuoso Leadership in Linked Data
 2000 – 06 V1.x - 4.x SQL row store with SQL federation and
XML
 2007 – 08 V 5.x - 6.x SPARQL, Adapted for RDF quads with
more compression, bitmap indices, special data types, RDF
awareness in query optimization
 2009 - 6.x Scale out cluster capable
 2010 – 13 V7.x Column store, vectored execution, 3x more
space efficient, 10+x more speed
 2013 Star Schema benchmark with SPARQL 100x MySQL
SQL, 0.8x MonetDB SQL
 2014 - Top of the line SQL analytics, 500Gtriples, Structure
Awareness

Triples Are Done Right, so?
 Column store techniques are a good fit, index based
triple storage does not get much better
 RAM-only pointer based techniques can be faster
but cost 10-100x more to scale up
 To take RDF to SQL parity, Virtuoso must first be
on the level with the best in SQL
 TPC-H is the checklist for mastery of DW and query
optimization , who survives shall not fear
 Parity is achieved when running with triples just like
with tables

Structure is Everywhere
CWI in LOD2:
 90% of triples in Common Crawl fall in 20 tables
 All relational extractions are 100% tables
 Even Dbpedia is 90% covered by 500 tables, but is
unusually heterogeneous, albeit not very large

The Glorious Dawn:
Structure is the Servant, not the Tyrant
 A set of subjects with all the same single valued properties is
in fact a table.
So, store it as a table
 Allow exceptions, e.g. sometimes multiple values, different
values In different graphs, extra properties etc
 If it is big, it has repeating structure
 All RDF semantics are preserved, any triple is possible but the
common ones are SQL compact and SQL fast
 With tables, query optimization returns to SQL complexity and
is much more reliable
 So, more tricks from the SQL analytics bad become safe and
applicable

Gains from Structure Awareness
 3+x Load Speed
 2x more space efficiency
 Queries against regular data within 10-20% of SQL
speeds
 Just declare which properties tend to occur
together, no strict schema first like with SQL
 Later, self configuration

The Cycle of Adventure
 Rebels: SQL not cool, too rigid, drop
ACID, go key-value, map-reduce, the
triple is all there is, semantic web
 Pioneers: Life on the frontier is hard,
infrastructure missing or bad
 Same everyday problems also in
Utopia
 Recognizing the objective values, eg
schema freedom and identifiers, no
AI. Do the job, forget dogma
 Reconciliation: schema-first and
schema-last converge in structure
awareness

Present FP7 Research
 LDBC - Transparency and
Relevance for Graph DB, RDF
performance
 GeoKnow - GeoData is everywhere,
how to carry the planet in your
pocket
 LOD2 - Where no triple has gone
before (and come back)
 Open PHACTs – A Data Platform
for Drug Discovery

LDBC - Linked Data Benchmark Council
 Rebels: SQL not cool, too rigid, drop ACID,
go key-value, map-reduce, the triple is all
there is, semantic web
 Pioneers: Life on the frontier is hard,
infrastructure missing or bad
 Same everyday problems also in Utopia
 Recognizing the objective values, e. schema
freedom and identifiers, no AI. Do the job,
forget dogma
 Reconciliation: Some of the rebel thinking
becomes mainstream, e.g. schema-first and
schema-last converge in structure awareness

LDB Council, Independent Industry Forum
for Benchmarking
 The TPC for the frontiers of database
 Bootstrapped in the LDBC FP7, continues
as independent industry association
 OpenLink, Ontotext, Neo Technologies,
Sparsity as founding members
 IBM, Oracle Labs, Systap, SPARQL City
already joined
 DB superstars Peter Boncz and Thomas
Neumann as founders and scientific lead

LDBC Benchmarks
Social Network
 Online - Lookups, updates, analysis of
social environment
 Business Intelligence - Spotting trends, key
players, big query
 Graph analytics - Community detection,
Page rank, graph metrics
Semantic Publishing
 Modeled after the BBC linked data portal,
online lookups, drill downs and updates

GeoKnow - The Planet in your Pocket
Ms. Globe and Mr. Cube have a
thing going on:
 Mr. Cube: Desiloization ...
integrated metadata ... Explicit
semantics .
 Ms. Globe: I can feel it... but are
you man enough ... you need to
show me.

Planet Scale Roadmap
Jan 2014:
 Virtuoso SPARQL outperforms PostGIS in map lookups with planet-wide
Open Street Map
 Virtuoso SQL adds 5x more power

Next: Jan 2015
 Parity between SPARQL and SQL via structure
awareness
 Geospatial data clustering
 Graph analytics close to the data, Pregel Giraph etc
in the DB itself
 Adding fine grain geo dimension to LDBC social
network benchmark

The LOD2 scaling adventures
Experiments at CWI’s Scilens cluster:
 150Gtriples in Jan 2013 (8x256G RAM)
 500Gtriples Aug 2014 (12 x 256G RAM)
 Some trillion triple claims exist but do not
detail any query workload
BSBM explore and BI workloads
 10x speed gains for BI queries from 2013 to
2014
Bulk load at 6M triples/s
 All done in triples, structure awareness will
go further still

Open PHACTs
Partners:

Virtuoso Now
Snapshot of RDF Linked Data customers in the Enterprise:
 Data.Gov (U.S. Govt. Open
Linked Data initiative)
 Bank of America
 Booz Allen Hamilton
 Northrop Grumman
 Elsevier
 French National Library
 Samsung
 Globo
 Daimler Benz
 Johnson & Johnson
 Bayer
 St Jude's Medical
 Fuijitsu
 Syngenta
 and many more

Virtuoso Availability
 Most capabilities as open source
 Commercial adds
 Cluster scale-out
 SQL Federation
 Replication (SQL & RDF)
 Advanced RDF security, ABAC & RBAC (ACLs)
 Wide tables
 and more
 Up to the minute tech previews via v7fasttrack on github, e.g.
superfast TPC-H implementation

Virtuoso Future
 Preview of structure aware RDF store in fall 2014
via v7fasttrack
Integrated graph analytics framework
 Embed complex graph algos, e.g. community
detection, shortest path inside SPARQL/SQL
 Comparison of SQL and SPARQL for big data
analytics

Linked Data Now
 Adoption across major industries
 Superior flexibility and time to solution
 Dramatic performance gains in the last 5 years
 Benchmarking will continue to drive progress, to the benefit of
users and vendors alike
 Run circles around most open source SQL in SPARQL:
Virtuoso SPARQL beats MySQL in SSB by 100x
 With structure awareness, SPARQL to match the best in SQL
for data warehousing, OLTP
 Linked Data no longer a long shot but a technology that makes
sense

About OpenLink Software
OpenLink Software is a privately-held company founded in 1992 by its President & CEO,
Kingsley Idehen. The company is an industry acclaimed technology innovator in the
following areas:
 ODBC, JDBC, ADO.NET, and OLE-DB compliant Data Access Drivers for Oracle,
SQL Server, Informix, Ingres, Sybase, Progress, MySQL, and PostgreSQL
 High-Performance & Scalable Multi-Model (Relational & Graph) Database
Technology
 Data Integration Middleware (Data Virtualization Technology across a wide variety of
Protocols & Formats)
 Web Application Server Technology
 Linked Data Deployment & Management
 Socially-enhanced Distributed Collaborative Applications Platforms (Weblogs, Wikis,
Feed Aggregation and Syndication, Web File Systems, Discussion Forums, etc.)
 Identity Management.

Office Locations
USA
OpenLink Software, Inc
10 Burlington Mall Road
Suite 265
Burlington, MA 01803
Tel.: +1 781 273 0900
Fax: +1 781 229 8030
UK
OpenLink Software Ltd.
Airport House
Purley Way
Croydon, Surrey CR0 0XZ
Tel.: +44 (0)20 8681 7701
Fax: +44 (0)20 8681 7702

Additional Information
Web Sites
OpenLink Software
YouID – Digital Identity Card (Certificate) Generator
OpenLink Data Spaces – Semantically enhanced Personal & Enterprise Data Spaces &
Collaboration Platform
OpenLink Virtuoso - Hybrid Data Management, Integration, Application, and Identity Server
Universal Data Access Drivers - High-Performance ODBC, JDBC, ADO.NET, and OLE-DB
Drivers
LDAP and NetID-TLS – How to use LDAP scheme URIs with NetID-TLS Authentication
Social Media Data spaces
http://kidehen.blogspot.com (weblog)
http://www.openlinksw.com/blog/~kidehen/ (weblog)
https://plus.google.com/112399767740508618350/posts (Google+)
https://twitter.com/#!/kidehen (Twitter)
Hashtag: #LinkedData (Anywhere).

Virtuoso -- The Prometheus of RDF

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (13)

Ähnlich wie Virtuoso -- The Prometheus of RDF

Ähnlich wie Virtuoso -- The Prometheus of RDF (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Virtuoso -- The Prometheus of RDF