Knowledge graphs promise a novel platform for better holistic decision making and analytics. Many projects fail to reach their full potential because of the prohibitively high cost of integrating new knowledge from the required information sources.
The talk explains the concept of semantic similarity as a tool for efficient entity clustering and matching based on graph and text embeddings. It will demonstrate the underlying scalable and easy to understand algorithm of Random Indexing.
This work is part of the Ontotext Platform, which increases productivity in developing and maintaining large scale knowledge graphs. The platform enables enterprises to develop and operate on top of such mission-critical systems for decision support, information discovery and metadata management.
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Semantic similarity for faster Knowledge Graph delivery at scale
1. making sense of text and data
October, 2019
Connected Data London
Semantic Similarity for Faster
Knowledge Graph Delivery at Scale
2. Why Knowledge Graphs?
“Cross-industry studies show that on average, less than half of an
organization’s structured data is actively used in making decisions—and
less than 1% of its unstructured data is analyzed or used at all”
What’s Your Data Strategy? Leandro DalleMule and Thomas H. Davenport, Harvard Business Review
Top 5 USA
Banks
4. What is a Knowledge Graph?
Graph, Semantics, Smart, Alive
5. Multiple Enterprise Data Management Systems
KG platforms combine capabilities of several enterprise systems:
o Master and reference data management
o Corporate/Enterprise Taxonomy
o Datawarehouse
o Metadata management
o Digital asset management
o Enterprise search
6. Challenges in Enterprise Semantic Integration
Type Titles
TV Episodes 4’044’529
Short film 681’067
Feature film 516’726
Video 164’061
TV series 164’061
TV movies 126’206
… …
Total * 5’838’514
Type Titles
film 235’707
silent short film 16’377
television film 15’345
short film 11’225
animated film 3’785
… …
… …
Total 289’650
IMDB WikiData
* Later the tests use only 5K crawled datasets
7. Challenges in Enterprise Semantic Integration
Multiple levels of inconsistencies:
o Types: film vs “TV movie”
o Meta-data: “science fiction”, “military
science fiction” vs “Sci-Fi”
o Reference data: “US” vs. “United States”
o Manually curated cross-links (!) for testing
purposes only
8. A Classical Approach
o Start with string matching of the Titles
“Harry Potter and the Deathly Hallows: Part II” vs.
“Harry Potter and the Deathly Hallows – Part 2”
“Perfume: The Story of a Murderer” vs “Perfume”
“Pirate Radio” vs. “The Boat That Rocked”
“Avatar” vs ”Avatar” (4 movies)
9. A Classical Approach with extra Rules
o Add release date matching
Lose 10% of the matches due to bad dates
o Ambiguity is greatly reduced but still many:
tt0238520
16 October 1995
50 min
tt1125875
11 April 1995
48 min
tt0238520
23 June 1995
1h 21 min
11. What is Knowledge Graph Embedding?
o Predict similar graph nodes or properties
o Require no input training data
o Mathematical representation of graph nodes as vectors:
duration
drama
comedy
The Godfather
(2h 58m)
American Pie
(1h 15 min)
vs.
12. o For each film include all actors, director, country of origin
o Vast matrix with entities and literals
Knowledge Graph Embedding Example
Movie [Actor]
“Adam
LeFevre”
[Actor]
“Anthony
Anderson
”
[Actor]
“Mia
Farrow”
[Country]
“France”
[Country]
”US”
[Country]
”United
states”
[Director]”
Luc
Besson”
…
wd:
Q550232
1 1 1 1 1
imdb:
tt0344854
1 1 1 1
... … … … … … … … …
TermsDocument
13. Random Indexing (RI) Algorithm
o Reduces the matrix dimension
with elemental vectors
For each term, w calculate a context vector S(w) by
summing the index vectors of all elemental vectors
x appearing in the context of w
o Light-weight and fast
(250K x 1.45M matrix in < 5m)
o Fast sub-second searches and
requires limited RAM
Actors
Movie
Adam
LeFevre
Anthony
Anderson
Mia
Farrow
Elemental
vectors
wd:
Q550232
1 1 1
imdb:
tt0344854
1 0 1
... … … …
14. Random Indexing (RI) Algorithm #2
o Supports similarity searches for:
Document to Document – similar movies
Document to Term – specific actor/director
Term to Term – similar actor/directors
Term to Document – find movies specific for this
actor/director
o Features all properties of a
Vector Space model
o Partial matching, weights, ranking + context
sensitive semantic search
Actors
Movie
Adam
LeFevre
Anthony
Anderson
Mia
Farrow
Elemental
vectors
wd:
Q550232
1 1 1
imdb:
tt0344854
1 0 1
... … … …
16. KG Consumers
GraphDB
Reference Software Architecture
o Easy consumption of data
o No backend development
o Flexible data processing tools
o Standard and open interfaces
Ontotext Platform
GQL query
SPARQL
RDF /
Structured
data
GQL
mutation
GQL
Federation
Similarity
Plugin
17. Transform CSV to RDF
o Perform standard ETL tasks
o Trim spaces, parse numbers and dates
o Parse IMDB ids from links for testing
o Map table data to RDF
o SPARQL over tabular data
o Split multi-valued fields like ”Action|Thriller”
o Not yet applied schema level
alignment
18. Similarity Plugin API
subject predicate object
wd:Q550232 :actor “Adam LeFevre”
imdb:tt0344854 :actor "Adam LeFevre”
… … …
o Accepts a graph described by <s, p, o>
o Indexes any RDF types
o Works with virtual overlays like:
“Adam LeFevre”
imdb:
tt0344854
wd:
Q550232
“Adam LeFevre”
wd:Q2702
964
rdfs:label
wdt:P161
imdb:actor_2_name
19. Specify KG Embeddings – Select Predicates
o Similarity plugin expects triples <s, p, o>
21. Results
o Find similar RDF resources to “Pirate Radio”
o Even a limited set of predicates return acceptable results
o Important independent alternative for entity matching
22. Important Design Considerations
o Prefer RDF over Property Graph
o Much richer technology ecosystem (schema, dataset, reasoning, strings vs things)
o Virtualization versus Consolidation
o Virtualization works only for simple lookup queries, but not real data integration
o Push result federation to the GraphQL data consumption layer
o Integrating Random Indexing in the KG database
o Push heavy computation as closest to the data
o Choose GraphQL over SPARQL for app developers: