IP LodB project (for more details see iplod.io ) capitalizes on LOD database thinking, to build bridges between patented information and scientific knowledge, whilst focusing on individuals who codify new knowledge and their connected organizations, including those who apply patents in new products and services.
As main outputs the IP LodB produced an intellectual property rights (IPR) linked open data (LOD) map (IP LOD map), and has tested the linkability of the European patent (EP) LOD database, whilst increasing the uniqueness of data using different harmonization techniques.
These slides were developed for NIPO workshop
1. Exploring
Opportunities of
Linked Open
Innovation Data:
Part 1
Presenters: Dolores Modic,
Alan Johnson, Miha Vučkovič
With: Ana Hafner, Borut Lužar,
Borut Rožac, Einar Rasmussen
1
NIPO workshop, February 2020
2. Linked open data
2
From linking documents to linking data & link as a
bearer of meaning
Four rules of linked data (Berners-
Lee, 2006) :
1.Use URIs as names for things
2.Use HTTP URIs so that people can
look up those names.
3.When someone looks up a URI,
provide useful information, using
the standards (RDF, SPARQL)
4.Include links to other URIs so
that they can discover more
things.
4. Data in N-Triples: Subject - Predicate - Object
4
� Subject: specifies the entity under consideration, e.g.
publication (“Mapping the human brain”);
� Predicate: specifies property types for the entity under
consideration, e.g., “authored by”, “published in”, “has
date”, ”has impact factor 2018”;
� Object; specifies a value for the property type, e.g.,
“Dolores Modic”, “Science and Public Policy”, “03 Jun
2017”, “1.575”;
Simple example:
Subject → predicate → object
Mapping the Human Brain → published in→ Science and
Public Policy
5. The promise of Linked Open Data
5
The vision is that all data on the World Wide Web can be treated
and researched as one database, using a machine-readable
format to share and reuse existing data (Khusro, 2014).
Further, the availability of Linked Open Data from large, credible,
institutions has grown dramatically in recent years, e.g.,
European Patent Office, Korean Patent Office, National
Governments, etc.
6. … but the problem with Linked open data
• Linked open data is oriented towards machine-readability, hence human-
readable browsability can be a problem.
• The availability and accessibility of many LOD sources is a problem,
with many unstable and difficult to discover.
• Most LOD sources are not adequately interlinked.
• It is difficult to identify objects referring to the same real-world
entity different LOD sources (or even in the same source).
6
“Users of linked data today are generally programmers and developers who are comfortable
working directly with what is under the hood of this new technology. The rest of us are impatiently
waiting for the user-friendly interface that will let us easily make use of linked data.” (Coyle, 2012)”
7. Clouds and subclouds of
the LOD universe
• LOD cloud: DBPedia as HUB
• limited number of sub-clouds (e.g.
Bio2RDF cloud), i.e. maps; or aggregators
(e.g. LOD-a-LOT)
• IP LOD Map: Innovation oriented map: EP
LOD as the HUB
• Why? Increase the discoverability and
reusability of EP LOD data by integrating
them into sub-cloud (map) which
showcases the likeability of this data with
other LOD datasets.
• How? Not relying purely on machine
support, but utilizing a diligent scientific
approach
03/12/2019 7
8. THE LINKED
OPEN DATA
CLOUD
• The dataset currently
contains 1,239 datasets.
The datasets are widely
distributed into several
categories: e.g.
Government.
It is evolving rapidly in
terms of new included
datasets; first version
2007 with 12 datasets.
But only rudimentary
information on datasets
available; many with dead
links.
8
Source: https://lod-cloud.net/ 2017
10. Descriptive Exploratory Plots
Anscombe's quartet
Data ≈ Experience
Wise researchers conduct descriptive
exploratory analyses of their data before
fitting statistical models.
- Judith D. Singer & John B. Willett
Experience without theory is blind, but
theory without experience is mere
intellectual play.
- Immanuel Kant
10
Each panel displays a scatter plot of 11
observations that have the same descriptive
statistics:
1. Mean: x = 9; y = 7.5
2. Variance: x = 11; y = 4.125
3. Correlation at .816
4. Regression coefficient: y = 3 + .5x
5. R-squared at .67
Johnson, Masyn, & McKelvie, (2020), Anscombe (1973), Singer & Willett (2003),
Figure (https://en.wikipedia.org/wiki/File:Anscombe%27s_quartet_3.svg)
x
y
(1)
x
y
(2)
x
y
(3)
x
y
(4)
11. Name Disambiguation Steps
1. Cleaning and parsing
2. Blocking
3. Choose auxiliary variables
4. Compute potential matches using similarity scores
5. Create unique entities
Diligent alignment and meaningful connections
between the LOD databases are key
11
12. x
y
Overfit
x
y
Parsimony
Johnson, Little, Masyn, McKelvie (2020), Hastie, Tibshirani, & Friedman (2009),
Figure(https://towardsdatascience.com/bias-variance-tradeoff-e8995c42b55b)
Training Data and Predictive Validity
Variance and Bias
12
HighLow
Variance
Bias
LowHigh
x
y
Underfit
All models are wrong
but some useful.
- George E. P. Box
Machine learning models
overfit to training data
produce excess variance when
applied to new data. Similarly,
underfit models produce
excess bias with new data.
13. Machine Learning and Training Data
Data Split Ratio
More parsimonious models need less
data to validate and tune
Training data is a sample from the data used
to fit a parsimonious predictive model.
Validation data is another sample from the
data used to provide an unbiased evaluation
of the predictive model from the previous
step, followed by some tuning.
13
Test data is a third sample (without
replacement) from the data used to used to
provide an unbiased evaluation of a ‘final’
model, refined in the previous steps,
without further adjustment.
Figure (https://towardsdatascience.com/train-validation-and-test-sets-
72cb40cba9e7)
Train Validate Test
14. Name Disambiguation Steps
IP
LodB
naïve
IP LodB
intermediate
Torvik &
Smalheiser
(2009)
Li, …, Torvik,
et al., (2014)
Pezzoni,
Lissoni, et
al., (2014)
1. Cleaning and parsing
a. Find relevant fields in (LOD) source and extract ✔ ✔ ✔ ✔ ✔
b. Extract “family name” and “given name” from “author name”
strings
✔ ✔ ✔ ✔ ✔
c. Remove punctuation, accents, and double spaces
(normalization)
✔ ✔ ✔ ✔ ✔
d. Convert to same format, e.g., ASCII ✔ ✔ ✔ ✔ ✔
e. Remove redundant strings (and tokenize), e.g., c/o IBM ✔ ✔
Alignment procedure: Literature Review (1)
14
15. Name Disambiguation Steps
IP LodB
naïve
IP LodB
intermediat
e
Torvik &
Smalheiser
(2009)
Li, …, Torvik,
et al., (2014)
Pezzoni,
Lissoni, et
al., (2014)
2. Blocking
a. Parse “author name” strings into a “family name” and “given
name” and use alphabetic order
✔ ✔ ✔
b. Parse all “author name” strings into “tokens” and use lexical
distance
✔ ✔
c. Block records using parsed “author names” strings, i.e.,
prima facie list of author-document “match” candidates. In
one block, similar records are collected, e.g. having the same
normalized family name. In subsequent steps, the algorithm
only processes records within each block.
✔ ✔ ✔ ✔ ✔
Alignment procedure: Literature Review (2)
15
16. Name Disambiguation Steps
IP LodB
naïve
IP LodB
intermediat
e
Torvik &
Smalheiser
(2009)
Li, …, Torvik,
et al., (2014)
Pezzoni,
Lissoni, et
al., (2014)
3. Choose auxiliary variables
a. Select independent “auxiliary variables”, e.g., co-authors,
organization affiliation
✔ ✔ ✔ ✔ ✔
b. Extend “auxiliary variables” using parsing techniques
described above, e.g., keywords, address
✔ ✔
c. Create entity cards containing records that are a priori
author-document “match” candidates, using similarity score
values based on “auxiliary variables”
✔ ✔
d. Refine entity card creation using a multi-dimensional vector
based on “auxiliary variables”I
✔
Alignment procedure: Literature Review (3)
16
17. Name Disambiguation Steps
IP LodB
naïve
IP LodB
intermediat
e
Torvik &
Smalheiser
(2009)
Li, …, Torvik,
et al., (2014)
Pezzoni,
Lissoni, et
al., (2014)
4. Compute potential matches using similarity scores
a. Refine “match” candidates using similarity scores based on
parsed “family names” and “given names”, i.e., exceeding a
user defined threshold
✔ ✔ ✔ ✔ ✔
b. Compute similarity scores using trial-and-error values from
“auxiliary variables” for each pair of “match“ candidates
✔ ✔
c. Refine “match” candidates using estimated similarity scores
from auxiliary variables looked-up in a multi-dimensional
vector
✔ ✔ ✔
d. Create a joint entity for pairs of author-document records
that exceed a user defined similarity score threshold
✔ ✔ ✔ ✔ ✔
e. Correct triplet violations using 3-degrees of separation ✔ ✔ ✔ ✔
f. Repeat the process over the block as long as some change is
made within each step
✔ ✔ ✔ ✔
g. Adjust and apply the process on the joined datasets, i.e., EP
LOD and SN Scigraph
✔
Alignment procedure: Literature Review (4)
17
18. Name Disambiguation Steps
IP LodB
naïve
IP LodB
intermediat
e
Torvik &
Smalheiser
(2009)
Li, …, Torvik,
et al., (2014)
Pezzoni,
Lissoni, et
al., (2014)
5. Create unique entities
a. Create a new entity from all remaining entities in a block
(some joint through the process, some the same as before
the process) with a union of the properties of all entities
being joint into it
✔ ✔ ✔ ✔
b. Create unique entities based on EP LOD and SN SciGraph
alignment
✔
c. Show its details in the webpage (www.iplod.io) ✔
Alignment procedure: Literature Review (5)
18
19. Some use cases...
CASE 2: additional information on technology - connecting EP LOD to UNIPROT
19
CASE 1: Additional information on inventor
CASE 1: You found on EP LOD a patent you are interested
in: EP1097195 entitled “Screening of neisserial vaccine
candidates and vaccines against pathogenic Neisseria”
where the applicant was University of Nottingham. You see
that the first name inventor is Ala Aldeen Dlawer. You wish
to know if you could cooperate with him; and knowing a
common science communication channel is Twitter, so you
simply find on Wikidata his Twitter account and surprisingly
find out he went into politics. You will need to target some
other expert on this.
CASE 3: Connecting to additional data on patents - the case for SEP
patents
SEP Figures: Lorenz Brachtendorf, Fabian Gaessler, Dietmar Harhoff
(2019) Approximating the Standard Essentiality of Patents –A
Semantics-based Analysis
20. Thank you
for your
attention.
We gratefully acknowledge that this
work has been co-sponsored by the
Academic Research Programme of
the European Patent Office.
The research results and views
contained inside these materials or
during the workshop are those of
the researchers only. They do not
necessarily represent the views of
the EPO.
We also thank the NIPO and Nord
University for their support for
preparing and organizing this
workshop event.
20