Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Graph databases & data integration v2
1. Graph Databases &
data integration
Voxxed Days Athens 2018
Dimitris Kontokostas
Senior Knowledge Engineer @GeoPhy
2. About me
● Data geek, software engineer & open source enthusiast
● Involved in many R&D projects since 2003
● Participate(d) in graph-related standardization activities
● PhD in knowledge extraction and quality assessment
● Working on the GeoPhy Real Estate Knowledge Graph
3. Agenda
● Graphs
● RDF Graphs (*)
● Semantics & why they matter (*)
● Linked Data
● Who uses RDF
● How Google uses RDF
● How we (GeoPhy) uses RDF
(*)
Some concepts are simplified or skipped to make this talk easier to digest in the allocated time
4.
5. Heatmap for Graph Databases
(*) See also this
Gartner study in 2013 found:
● many organizations find the
variety dimension a greater
challenge than volume or
velocity.
Graph DBs to the rescue:
● Combine multiple sources with
different structures
● Retain the flexibility to add
new ones without adapting
schemas
● Query combined data, or
multiple sources at once
● Detect patterns in the data
7. ● A graph is a way of specifying relationships among a collection of items
● Items can be:
○ Nodes: Alice, Bob, …
○ Edges
■ undirected: knows, …
■ directed: follows, …
○ Attributes: name, age, type, since, ...
○ Values: 18, 2001/10/13, ...
Graphs
Image source from wikimedia commons
8. Graph Data Models
Property graphs
● Industry standards
○ Cypher mainly Neo4j
○ Gremlin traversal API
(Apache TinkerPop)
=> Most common
○ GraphQL
● Data import / export using Cypher,
gremlin or vendor-specific
● Usually optimized for specific
operations / use cases
RDF Graphs
● W3C standards
○ Like XML, HTML, define once
run everywhere ™
● Standardised way for querying
(SPARQL), exporting & importing
(RDF)
Slide input from Andy Seaborn @VoxxedDays Bristol
10. ● Each node has
○ unique identifier
○ outgoing edges
○ incoming edges
○ key-value properties collection
● Each edge has
○ unique identifier
○ direction
○ label for the relationship
○ key-value properties collection
● Extreme flexibility
Property Graphs
11. RDF - Resource Description Framework
● An RDF Graph is a set of RDF Triples
● An RDF triple consists of only three components (simplified):
○ the subject which is a Thing
○ the predicate which is a (special) Thing
○ the object that can be either a Thing or a Literal (Value)
● Things are represented with URIs
● Literals have a value and a value type or a language tag (defaults to string)
Subject Predicate Object
12. RDF - Resource Description Framework
● An RDF Graph is a set of RDF Triples
● An RDF triple consists of only three components (simplified):
○ the subject which is a Thing
○ the predicate which is a (special) Thing
○ the object that can be either a Thing or a Literal (Value)
● Things are represented with URIs
● Literals have a value and a value type or a language tag (defaults to string)
Subject Predicate Object
13. RDF - Resource Description Framework
Depending on the serialization format, URIs can be abbreviated with namespaces
> just like XML
> Improves readability, e.g.
@prefix dbpedia: <http://dbpedia.org/resource/> .
@prefix schema: <http://schema.org/> .
Subject Predicate Object
14. RDF is an abstract data model
Many different serialization formats…
Turtle, NTriples, JSON-LD, XML, RDFa, Microdata*
15. RDF is an abstract data model
Many different serialization formats…
Turtle, NTriples, JSON-LD, XML, RDFa, Microdata*
@prefix dbpedia: <http://dbpedia.org/resource/> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
dbpedia:Friends
schema:name "Friends"@en ;
schema:datePublished "1994-09-22"^^xsd:date ;
schema:numberOfSeasons 10 ;
schema:genre dbpedia:Sitcom .
dbpedia:The_Office
schema:name "The Office"@en ;
schema:genre dbpedia:Sitcom .
16. RDF is an abstract data model
Many different serialization formats…
Turtle, NTriples, JSON-LD, XML, RDFa, Microdata*
17. RDF is an abstract data model
Many different serialization formats…
Turtle, NTriples, JSON-LD, XML, RDFa, Microdata*
18. RDF is an abstract data model
Many different serialization formats…
Turtle, NTriples, JSON-LD, XML, RDFa, Microdata*
19. [Fun fact]
What does RSS stand for?
Rich Site Summary but...
Original name was: RDF Site Summary
Based on first versions of RDF/XML
See https://en.wikipedia.org/wiki/RSS
20. RDF is an abstract data model
Many different serialization formats…
Turtle, NTriples, JSON-LD, XML, RDFa, Microdata*
21. RDF is an abstract data model
Many different serialization formats…
Turtle, NTriples, JSON-LD, XML, RDFa, Microdata*
22. You can store RDF ...
In simple (text) files,
locally, remote, HDFS, ...
Embedded web documents
In graph databases
24. RDF & Graphs (merge)
File_all.ttl
Can you name of any
other format where files
can be merged without
losing data integrity?
CSV, SQL, XML, JSON, ...
@prefix dbpedia: <http://dbpedia.org/resource/> .
@prefix schema: <http://schema.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
dbpedia:Friends
schema:name "Friends"@en ;
schema:numberOfSeasons 10 ;
schema:datePublished "1994-09-22"^^xsd:date ;
schema:genre dbpedia:Sitcom .
dbpedia:The_Office
schema:name "The Office"@en ;
schema:genre dbpedia:Sitcom .
/data/tvseries.ttl
26. RDF is persistent, wherever it’s stored
RDF DB
Input
Files
Output
Files
Import
Export
Exactly
same (*)
(*)
The proper term is isomorphic graphs, to cover some special cases called blank nodes
Query
27. Big ecosystem
SPARQL: RDF query language
RDFS, OWL: RDF schema languages
SHACL, ShEx: RDF constraint languages
See http://book.validatingrdf.com (free online)
R2RML: Virtual RDF views on top of RDBMS (i.e. MySQL)
And many more specification & tools...
28. Takeaway points, so far...
RDF is a graph data model
> can be serialized in many formats
> identifiers are persistent by design
Natively store & integrates diverse data
RDF is kind of the new XML
> but it is much cooler...
> and you don’t need to write XML ;)
30. Semantics & RDF
● RDF is a core part of the Semantic Web vision
● Semantics is defined as:
○ the meaning of something (word, phrase, text, etc)
○ the branch of linguistics and logic concerned with meaning
● Too academic?
“A Little Semantics Goes a Long Way”
by prof. J. Hendler
BuzzwordAlert!!!
31. RDF & Semantics
Ontologies are the results of modelling a specific domain
Some people prefer the terms: model, vocabulary, taxonomy, schema
(doesn’t make much difference)
Ontologies in RDF deal with classes & properties
> Some part is machine readable
> Some part is human readable
Can you tell which part is more important?
(... a more pragmatic view)
32. @prefix ex: <http://example.com/>
ex:TVSeries
rdf:type rdfs:Class ;
rdfs:comment “Series dedicated to TV broadcast” ;
rdfs:subClassOf ex:CreativeWork .
ex:CreativeWork
rdf:type rdfs:Class ;
rdfs:comment “A generic kind of creative work, i.e. books, movies, etc.” .
RDF Schema - Classes
Classes of Things
Machine-Readable
Semantics
Human-Readable
Semantics
… and we can assign types to Things
(i.e. “Friends” is an instance of “TVSeries”)
dbpedia:Friends rdf:type ex:TVSeries.
33. @prefix ex: <http://example.com/>
ex:actor
rdf:type rdf:Property ;
rdfs:comment “The person that is the actor of a TVSeries.” ;
rdfs:domain ex:TVSeries ;
rdfs:range ex:Person .
RDF Schema - Properties
Relationships between subjects and objects
Machine-Readable
Semantics
Human-Readable
Semantics
dbpedia:Friends ex:actor dbpedia:Jennifer_Aniston .
… and we can use this in RDF statements
34. to Infer or to Validate ?
Given only the following, what can we say about
dbpedia:Jennifer_Aniston and dbpedia:Friends ?
dbpedia:Jennifer_Aniston rdf:type ex:Person.
dbpedia:Friends rdf:type ex:TVSeries .
ex:actor
rdf:type rdf:Property ;
rdfs:domain ex:TVSeries ;
rdfs:range ex:Person.
dbpedia:Friends ex:actor dbpedia:Jennifer_Aniston .
35. to Infer or to Validate ?
Given only the following, what can we say ?
ex:actor
rdf:type rdf:Property ;
rdfs:domain ex:TVSeries ;
rdfs:range ex:Person.
ex:Dimitris rdf:type ex:Person .
ex:VoxxedDaysAthens rdf:type ex:Conference .
ex:VoxxedDaysAthens ex:actor ex:Dimitris .
Something is
not right…
ex:VoxxedDaysAthens
is not a ex:TVSeries
36. to Infer or to Validate ?
Given only the following, what can we say ?
ex:actor rdf:type rdf:Property ;
rdfs:domain ex:TVSeries ;
rdfs:range ex:Person.
ex:Dimitris rdf:type ex:Person .
dbpedia:Friends rdf:type ex:TVSeries .
dbpedia:Friends ex:actor ex:Dimitris .
Appears legit
38. Schema stored & queried as Data
Navigates the
class hierarchy
SELECT ?s WHERE {
?s rdf:type/rdfs:subClassOf*
ex:CreativeWork }
dbpedia:Friends,
dbpedia:The_Office,
dbpedia:Narnia
Hierarchy can be
extended without
breaking the query
ex:TVSeries
rdf:type rdfs:Class ;
rdfs:subClassOf ex:CreativeWork .
ex:BookSeries
rdf:type rdfs:Class ;
rdfs:subClassOf ex:CreativeWork .
ex:CreativeWork
rdf:type rdfs:Class .
dbpedia:Friends rdf:type ex:TVSeries.
dbpedia:The_Office rdf:type ex:TVSeries.
dbpedia:Narnia rdf:type ex:BookSeries.
39. Many Available free Schemas
Many existing free (as in beer) ontologies (or schemas)
model different domains
> General purpose (DBpedia, schema.org)
> Geographical (geo)
> Provenance (prov-o)
> Taxonomies / Classification (SKOS family)
> Organizations (org)
> Find ~600 entries at http://lov.okfn.org
40. Reusing Available (Free) schemas
Get part of your data modeling for free
> Groups of people already worked on modeling the domain
> Spent time defining human and machine-readable semantics
Facilitates data integration easier
> Data published with common schemas
> Data easier to be consumed
41. Mapping to Available (Free) schemas
Map when not reusing
> integrate data in a loosely coupled way
ex:TVSeries owl:equivalentClass schema:TVSeries .
ex:actor owl:equivalentProperty schema:actor .
42. RDF & Semantics - take away points
It’s all about Classes & Properties
Human-readable semantics
> Commonly accepted modelling conventions
Machine-readable semantics
> Can be used for inference and/or validation
> Can be queried together with data
Reusing [or linking to] common ontologies / schemas
> Integrating data with less variety
> Network effect (the more people/data use it the better)
> Developing reusable applications against schemas
44. Given only this, can can we do/say?
<https://voxxeddays.com/athens> <https://schema.org/attendee> <http://kontokostas.com>.
schema:Event (domain) schema:Person (range)A person attending the event.
HTTPGET
<https://voxxeddays.com/athens>
rdf:type schema:Event;
schema:name “Voxxed Athens”;
schema:startDate “2018-06-01”;
schema:endDate “2018-06-02”;
schema:inLanguage “English”
schema:description “...”
HTTP GET
<http://kontokostas.com>
rdf:type schema:Person ;
schema:givenName “Dimitris” ;
schema:familyName “Kontokostas” ;
schema:birthPlace dbpedia:Greece ;
schema:jobTitle “Data Engineer” ;
schema:worksFor <https://geophy.com>.
HTTP GET
45. Follow your nose pattern
<http://kontokostas.com> <https://schema.org/birthPlace> <http://dbpedia.org/resource/Greece>.
schema:Person (domain) schema:Place (range)The place where the person was born.
HTTPGET
<http://kontokostas.com>
rdf:type schema:Person ;
schema:givenName “Dimitris” ;
schema:familyName “Kontokostas” ;
schema:birthPlace dbpedia:Greece ;
schema:jobTitle “Data Engineer” ;
schema:worksFor <https://geophy.com>.
HTTP GET
<http://dbpedia.org/resource/Greece>
rdf:type schema:Place, dbpedia:Country;
dbo:capital dbpedia:Athens;
dbo:currency dbpedia:Euro ;
geo:lat “39.0”^^xsd:float ;
geo:long “22.0”^^xsd:float .
HTTP GET
46. RDF & Linked Data
Things represented with http(s)-based URIs
can be self-published
HTTP GET requests on Things return RDF Triples
where it is a subject (or an object)
Decentralized storage / access / semantics
(*) a.k.a. the Web of Data, see TED talk from Tim Berners Lee (Creator of WWW)
47. RDF & Linked Data (on the web)
kontokostas.com
example.com
voxxeddays.com/At
hens
DBpedia
Web of Data DBpedia
DBpedia
DBpedia
Wikipedia
As RDF
48. RDF & Linked Data (on the enterprise)
Web of Data
RDF
DB x
LD x
RDF
DB y
LD y
RDF
DB z
LD z
LD w
49. Linked Open Data Cloud
Diagram from 2014
v2018 is too big
1.184 datasets
15.993 links
https://lod-cloud.net/
50. Reusing available datasets / identifiers
Just like reusing schemas, referencing / reusing external
identifiers, facilitates:
Data integration
e.g. dbpedia:Friends represents the Friends TV series, not some friends
> use dbpedia:Friends directly
> link it: ex:tv_series_123 owl:sameAs dbpedia:Friends
Data enrichment
e.g. dbpedia:Friends may have additional information about the series than our
database, and we can easily (http) get it
51. RDF & Linked Data - take away points
Decentralisation of Data Management
Self-documented schemas & data
Scale your [local] graphs to the [Enterprise] Web
Big pool of stable identifiers (i.e. DBpedia)
52. Pay as you go data integration
You can get benefit with low effort
> RDF views on top of RDBMS with R2RML (mappings, SPARQL 2 SQL translation)
> Convert XML/JSON/CSV/… to RDF with RML
The more time you invest the better the results
> Schema developement, mapping & linking
> Semi-automatically link discovery with tools like Limes & Silk
e.g.: ex:tv_series_123 owl:sameAs dbpedia:Friends
RDF does not need to be your master dataset
54. 28% of TLD (or 39% of HTML pages)
> 3.7M Microdata
> 2.7M JSON-LD
> 1.2M RDFa
In total 9 billion Things & 38 billion RDF triples
Full report at http://webdatacommons.org/structureddata/#results-2017-1
Structured data on the web (Nov 2017)
56. RDF Ontology
> Less strict / formal
> Promotes JSON-LD
Funded & maintained
by all Search engines
drives many google
products...
57.
58. Schema.org && Google && Search
https://developers.google.com/search/docs/guides/search-features
59. Google is...
Using the RDF graph model to integrate diverse
data from webpages & emails
By using the concept of Linked Data
And this is all empowered by a
common ontology (or schema)
62. RDF @GeoPhy
We collect & integrate a lot of data
> on properties, on its surroundings, and on the market conditions
Master dataset on Real Estate (aka Knowledge Graph)
> driving our Machine Learning / Deep Learning models
Challenges...
> We have thousands of sources,
> Sources are updated at arbitrary intervals
> We get our data in CSV, in the good days
And, of course…
we are not Google
to make people
write RDF for us :-)
63. Geophy Data Management Platform
CSV PDF
GeoPhy
Ontologies
Transform
To RDF
Validate
Identify &
Deduplicate
Conflict
resolution
Data Fusion
Data
Wrangling &
Extraction
Annotation &
Provenance
Modeling
Mapping
CoreDB
Provenance
(value-level)
Data Indexing
Data Ingestion
Data Enrichment
Dependency
Detection Geo
Enrichment Trigger ML/DL
API
64. And the closing slide...
People think RDF is a pain because it is complicated.
The truth is even worse.
RDF is painfully simplistic, but it allows you to work with
real-world data and problems that are horribly complicated.
While you can avoid RDF, it is harder to avoid complicated
data and complicated computer problems.
Dan Brickley, Schema.org and Google
Libby Miller, BBC
65. Thank you for your attention
Questions?
Many thanks to Sander, Matt and the whole GeoPhy Eng. Team for their feedback