Scientific Knowledge Graphs: an Overview

Scientific Knowledge Graphs: An Overview
Dr Angelo Salatino
Knowledge Media Institute
The Open University
United Kingdom
Université Libre de Bruxelles - 12th May 2021

About me – Angelo Salatino
Research Associate and Associate Lecturer at the Open University
Research Interests: i) new technologies for classifying scientific
papers according to their relevant research topics, and ii) how the
research output of academia fosters innovation in the industry
At the SKM3 team we produce innovative approaches leveraging
large-scale data mining, semantic technologies, machine learning,
and visual analytics to extract meaning from scholarly data and
shed light on the research dynamic
angelo.salatino@open.ac.uk https://salatino.org @angelosalatino

This work is licensed under a Creative Commons Attribution 4.0
International License.

Agenda
• Scientific Knowledge Graphs
• AIDA
• Use cases of AIDA
• Practical tests

Why do we need Scientific Knowledge Graphs?

Science of Science
Picture from the cover of Science Vol 361, Issue 6408

Science of Science
• Science of Science is a multidisciplinary field which helps us to
understand in a quantitative fashion the evolution of science.
• This is possible by capitalising on large amounts of data scientists
produce nowadays:
• Research articles
• Pre-prints
• Grant proposals
• Patents
• Spoiler: SKGs come quite handy for structuring and collecting such
data

Scholarly Data
Improving Editorial Workflow and
Metadata Quality at Springer Nature.
Identifying the research topics that best describe the scope of a scientific publication is a
crucial task for editors, in particular because the quality of these annotations determine how
effectively users are able to discover the right content in online libraries. For this reason,
Springer Nature, the world’s largest academic book publisher, has traditionally entrusted this
task to their most expert editors. These editors manually analyse all new books, possibly
including hundreds of chapters, and produce a list of the most relevant topics. Hence, this
process has traditionally been very expensive, time-consuming, and confined to a few senior
editors. For these reasons, back in 2016 we developed Smart Topic Miner (STM), an ontology-
driven application that assists the Springer Nature editorial team in annotating the volumes of
all books covering conference proceedings in Computer Science. Since then STM has been
regularly used by editors in Germany, China, Brazil, India, and Japan, …
Angelo Salatino
Francesco Osborne
Aliaksandr Birukou
Enrico Motta
The Open University
Springer Nature
The 18th International Semantic Web Conference (ISWC 2019)
Affiliations
Authors
Citations
References
Conference/Journal
Text: Title, Abstract
Keywords
Scholarly data, Bibliographic metadata, Topic classification, Topic detection, …

Scholarly Data
Angelo Salatino
Francesco Osborne
Aliaksandr Birukou
Enrico Motta
mantic Web Conference (ISWC 2019)
Authors
Conference/Journal
Text: Title, Abstract
Keywords
Scholarly data, Bibliographic metadata, Topic classification, Topic detection, …
Keywords
Topic Detection, Science Of Science, Topic classification, Semantic Web, …
23rd International Conference on Theory and Practice of Digital Libraries(TPDL 2019)
Conference/Journal
Conference/Journal

Scholarly Data
<publish>
Project B
<co-op>
Project A
Other
literature
<$$$>
<cite>
Researcher
Institution
Funder
Project
Community
Courtesy of Andrea Mannocci from “Big Scholarly Data and Applications”

Scientific
Knowledge
Graph
Scientific Knowledge Graphs (SKGs) are a way
for representing scholarly knowledge in a
structured, interlinked, and semantically rich
manner.

Scientific Knowledge Graph - Definition
Given a set of entities E, and a set of relations R, a Scientific Knowledge Graph is a
directed multi-relational graph G that comprises triples (subject, predicate, object)
and is a subset of the cross product G ⊆ E ⨉ R ⨉ E.
Nodes and edges have well-defined meanings

Representation through Resource Description Framework
RDF is a standard for data interchange that is used for representing highly interconnected data.
Each RDF statement is a three-part structure consisting of resources where every resource is
identified by a URI. Representing data in RDF allows information to be easily identified,
disambiguated and interconnected by AI systems.
Previous graph in RDF (NT format):
<https://skg.org/paper_635219> <https://skg.org/sc#title> “Detection, Analysis, …”@en .
<https://skg.org/paper_635219> <https://skg.org/sc#abstract> “Analysing rese …”@en .
<https://skg.org/paper_635219> <https://skg.org/sc#has_keyword> “Scholarly Communication”@en .
<https://skg.org/paper_635219> <https://skg.org/sc#type> <https://skg.org/sc#paper> .
<https://skg.org/paper_635219> <https://skg.org/sc#has_author> <https://skg.org/angelo_salatino> .
<https://skg.org/paper_635219> <https://skg.org/sc#has_author> <https://skg.org/francesco_osborne> .
<https://skg.org/paper_635219> <https://skg.org/sc#has_author> <https://skg.org/andrea_mannocci> .
<https://skg.org/angelo_salatino> <https://skg.org/sc#has_affiliation> <https://skg.org/open_university> .
<https://skg.org/francesco_osborne> <https://skg.org/sc#has_affiliation> <https://skg.org/open_university> .
<https://skg.org/andrea_mannocci> <https://skg.org/sc#has_affiliation> <https://skg.org/italian_research_council> .

However …
Not all SKGs are published in RDF. E.g. sample from Dimensions
{
"format":3,
"status":"active",
"id":"pub.1009237776",
"publication_type":"article",
"doi":"10.1093/bybil/49.1.221",
"version_of_record":"https://doi.org/10.1093/bybil/49.1.221",
"pmid":"2650905",
"pmcid":"5381240",
"title":"Does Penicillin Kill Bacteria?",
"year":2017,
"concepts":{
"structure":0.3,
"fire":0.34,
"case study":0.01,
"serviceability":0.6,
"damage":0.12,
"residual mechanical properties":0.67
},
"publication_date":"2017-12-25",
"volume":"32",
"issue":"4",
"pages":"330-333",
….
"journal":{
"id":"jour.1138253",
"title":"The Journal of Clinical Evidence",
"issn":"0068-2691",
"eissn":"2044-9437"
},
"publisher":{
"id":"pblshr.1001577",
"name":"Radiological Society of North America (RSNA)"
},
"journal_lists":[
"ERA 2015",
"Norwegian register level 2",
"PubMed"
],
"clinical_trials":[
"NCT00605345"
],
"open_access_categories":[
"closed"
],
"author_affiliations":[{
"first_name":"Ian",
"last_name":"Bobbington",
"grid_ids":["grid.5335.0", "grid.1001.0"]},
….

Big Scholarly Datasets
• Web of Science
• Scopus
• Google Scholar
• Microsoft Academic Graph
• MA-KG, ma-graph.org
• PubMed
• Dimensions
• Semantic Scholar
• DBLP
• Open Academic Graph
• ScholarlyData
• PID Graph
• Open Research Knowledge Graph
• OpenCitations
• OpenAIRE research graph
• Crossref
• Academy/Industry Dynamics KG
(AIDA)
Disclaimer: this is far from being exhaustive

Differences between datasets
All these datasets are different from
each other:
• size
• scope
• quality
• mistakes, author disambiguation
• index vs. scraping
• comprehensiveness
• integration with other sources
• format
• access to data: license
“The comparison considers all scientific
documents from the period 2008–2017
covered by these data sources.”
Picture from Martijn Visser, Nees Jan van Eck, and Ludo Waltman. "Large-scale comparison of bibliographic
data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic." (2021).

each other:
• size
• scope
• quality
• format
• MAG, Dimensions, Scopus, WOS cover
all areas of Science
• DBLP covers Computer Science
• PubMed covers the field of Medicine
• Semantic Scholar covers Computer
Science and Medicine

each other:
• size
• scope
• quality
• format
• Cleaning data.
• Removing duplicates (same document
can appear in multiple places on the
web)
• Disambiguating authors
• Disambiguating affiliations
• Disambiguating references
WoS > Scopus > MAG

Example of paper formats
There are automatic tools like GROBID that with time get
better and better in extracting metadata, but still it makes
errors

All these datasets are different from each
other:
• size
• scope
• quality
• format
• Web Of Science, Scopus, PubMed index
paper:
• Papers are added to the collection in a
controlled way (metadata are curated)
• MAG, Dimensions, Google Scholar scrape
the web (journal websites) and PDFs.
• This also leads to other quality issues like
identifying the correct metadata

each other:
• size
• scope
• quality
• format
MAG
Dimensions
Crossref
Scopus
WOS
Semantic
Scholar
OpenAIRE
RG
OpenAG
PubMed
title x x x x x x x x x
id x x x x x x x x x
abstract x p x x x x x x x
year x x x x x x x x x
references x x x x x x x x
authors x x x x x x x x x
doi x x x x x x x x x
topics x x x x x
citationcount x x x x x x x
conferences x x x x p x x
journal x x x x x p x x x
authors keywords x x
Legend
x = ok
p = partial information

each other:
• size
• scope
• quality
• format
Description New Scopus RIS tag
Abbreviated source title J2
Abstract AB
Affiliations AD
Article number C7
Article title TI
Authors AU
Chemical name and CAS registry number N1
Cited by count N1
Conference Code N1
CODEN N1
Conference name T2
Correspondence name N1
Conference date Y2
DOI DO
Editor A2
End tag ER
Export date N1
First page SP
Funding Details N1
ISSN/ISBN/EISSN SN
Issue IS
Keywords KW
Language LA
Last page EP
Conference Location CY
Manufacturers N1
PMID/PMCID C2
Proceedings title C3
Publication year PY
Publisher PB
References N1
Scopus database DB
Scopus URL UR
Second article title ST
Sequence database accession number N1
Source title T2
Source type TY
Document type M3
Conference sponsors A4
Scopus TAGS

each other:
• size
• scope
• quality
• format
Open Academic Graph integrates MAG
and Aminer
OpenAIRE research graph integrates
Crossref, Unpaywall, ORCID and MAG
AIDA integrates MAG, GRID, DBPEDIA,
CSO

each other:
• size
• scope
• quality
• format
• Dimension, PubMed, Semantic
Scholar, Crossref are distributed in
JSON
• WOS, MAG in TSV files (MAKG as RDF)
• AIDA, ScholarlyData, OpenCitations in
RDF
• DBLP in XML

All these datasets are different from each
other:
• size
• scope
• quality
• format
Some datasets are available for free:
• Semantic Scholar [ODC-BY]
• Dimensions.ai (if you are a scholar)
Manageable fee
• Mag (50$ for downloading it) [ODC-BY]
Costly
• Scopus
• Web Of Science
Not available to buy
• Google Scholar

Academia/Industry DynAmics (AIDA) Knowledge Graph
• 14M papers and 8M patents annotated with research topics from the Computer
Science Ontology (CSO)
• 4M papers and 5M patents classified according to the type of the author’s
affiliations (academia, industry, or collaborative) and 66 industrial sectors from
Industrial Sectors Ontology (INDUSO)
• Released as an RDF Graph and available via SPARQL or as a dump
http://w3id.org/aida

AIDA pipeline
Research
Papers
Patents
Academia/Industry
DynAmics (AIDA)
Knowledge Graph
AIDA Schema
INDUSO.ttl
Computer
Science
Ontology
Filtering
documents
Filtering
documents
CSO
Classifier
Extraction of:
- affiliation types
- industry sectors
RDF
Generator

AIDA pipeline
Research
Papers
Patents
Academia/Industry
DynAmics (AIDA)
Knowledge Graph
AIDA Schema
INDUSO.ttl
Computer
Science
Ontology
Filtering
documents
Filtering
documents
CSO
Classifier
Extraction of:
- affiliation types
- industry sectors
RDF
Generator
pip install cso-classifier
Salatino, A.A., Osborne, F., Thanapalasingam, T., Motta, E.: The CSO Classifier: Ontology-Driven
Detection of Research Topics in Scholarly Articles. In: TPDL 2019: 23rd International Conference
on Theory and Practice of Digital Libraries. Springer.

CSO Classifier
Uses state-of-the-art technologies to parse documents and recognise
research concepts/topics. As input, it takes the metadata associated with a
research paper (title, abstract, keywords) and returns a selection of
research concepts drawn from the Computer Science Ontology
Salatino, Angelo A., et al. "The CSO classifier:
Ontology-driven detection of research topics in
scholarly articles." International Conference on Theory
and Practice of Digital Libraries. Springer, Cham, 2019.

Syntactic Module
• We split the text in unigrams, bigrams and trigrams
• For each n-gram we measure the Levenshtein similarity with the topics in CSO
• We select CSO topics having similarity above or equal to 0.94 with n-grams
• Helps handling plurals, hyphenated topics, and American vs. British spelling such as:
• “knowledge based systems” and “knowledge-based systems”
• “database” and “databases”
• “data visualisation” and “data visualization”

Semantic Module
• We used a Word Embedding model to capture semantics of words.
• We process the documents
• for each relevant word
• we retrieve from the model its related words
• then we check if those words are in the Computer Science Ontology.

Word Embedding model
“king” = [0.32, 0.76,…]
“queen” = [0.42, 0.76,…]
“woman” = [0.56, 0.43,…]
“man” = [0.59, 0.42,...]
king + (woman – man) = queen
It locates synonyms
(related topics) close to
each other in this vector
space: high cosine
similarity

Semantic Module
Word Embedding model
• We used titles and abstracts from 4.5M papers in Computer Science
• Pre-processed text:
• Topic replacement – “digital libraries” → “digital_libraries”
• Collocation analysis – “highest_accuracies”, “highly_cited_journals”
• Trained word embeddings model (word2vec)
method
skipgram
emb. size
128
window size
10
negative
5
max iter.
5
min-count cutoff
10

Semantic Module
Entity Extraction
• POS tagger, and grammar-based chunk parser <JJ.*>*<NN.*>+
“digital libraries”
CSO concept identification
• Selects all CSO topics found in the top-10 similar words of the resulting n-grams
(with cosine similarity > 0.7)

Semantic Module
Concept ranking
• We assign a score to each identified topic:
• Frequency – number of times it was inferred
• Diversity – number of unique text chunks from which it was inferred
Concept Selection
• Elbow method
CSO Topic score
domain ontologies 40
semantic web 40
ontology learning 40
data mining 40
heterogeneous resources 24
semantics 24
world wide web 10
network architecture 6
scholarly communication 6
ontology matching 6
… …

Post Processing
Combination of output
Semantic enhancement
• We use the superTopicOf to enhance the output set
• E.g., if “machine learning” then also “artificial intelligence”
• Provides wider context for the analysed paper
• Enables analytics on high-level abstract topics (e.g., digital libraries)

Scholarly Data++
Angelo Salatino
Francesco Osborne
Aliaksandr Birukou
Enrico Motta
The Open University
Springer Nature
Affiliations
Authors
Citations
References
Conference/Journal
Text: Title, Abstract, Keywords
scholarly data, semantic web, data mining, ontology, digital libraries, …
Topics
Affiliation Types
Academia
Industry
Keywords
Scholarly data, Bibliographic metadata, Topic classification,
Industrial Sectors
Publishing

Research Flow: Understanding the Knowledge Flow between Academia and
Industry
Each research topic is represented through 4 signals:
Papers from Academia (RA)
Papers from Industry (RI)
Patents from Academia (PA)
Patents from Industry (PI)
A. Salatino, F. Osborne, E. Motta. ResearchFlow:
Understanding the Knowledge Flow between
Academia and Industry. In Knowledge Engineering and
Knowledge Management – 22nd International
Conference, EKAW 2020, Springer, 2020

Diachronic analysis of topics
• First, we normalized all signals according to the ones associated to the main topic
Computer Science
• We devised two indices: RP and AI
𝑅𝑃!
=
𝑅!
− 𝑃!
𝑅!
+ 𝑃! ; 𝐴𝐼!
=
𝐴!
− 𝐼!
𝐴!
+ 𝐼!
• We performed a global analysis in 2007-18
• Topic evolution: we split the time period split in 4 windows of 3 years each,
computed RP and AI, and used the slope 𝛼 of the line 𝑓 𝑥 = 𝛼 . 𝑥 + 𝛽 to assess
its evolution

Diachronic analysis of topics
Distribution of topics according to RP and AI in 2007-18

Topic evolution in 2007-2018 - examples

Forecasting Topic Impact on Industry
• We created a new approach for predicting the impact of a topic on industry.
• It uses four temporal time-series: i) publications from academia, ii) publications from
industry, iii) patents from academia, and iv) patents from industry.
• We tested it on the task of predicting if an emergent research topic will have a
significant impact on industry (> 50 patents) in the following 10 years.
• This evaluation substantiates the hypothesis that considering the four timeseries
separately is conducive to higher quality predictions and suggests that RI and RA
are good indicators for PI.

Machine Learning approach
We used:
• Logistic Regression (LR)
• Random Forest (RF)
• AdaBoost (AB)
• Convoluted Neural Network (CNN)
• Long Short-term Memory Neural Network (LSTM)
On several combinations of time-series: RA, RI, PA and PI

Forecasting Topic Impact on Industry

Conference Dashboard
Angioni, Simone, et al. "The AIDA Dashboard: Analysing Conferences with Semantic Technologies."

AIDA35K – A similar but not-so-similar version of AIDA
Download: http://aida.kmi.open.ac.uk/aida35k/downloads/aida35k.ttl.zip

AIDA35K – Stats
• Contains 35 thousand papers in the field of Semantic Web and Neural Networks
• 249,969 facts (triples)
• 26 different relationships
Download: http://aida.kmi.open.ac.uk/aida35k/downloads/aida35k.ttl.zip

Relationships from paper
• hasAuthor states the author of the paper
• hasConfName and hasConfSeries provide details
about the conference: “The 21st World Wide
Web Conference” and “webconf”
• hasCsoEnhancedTopic, topics extracted with the
CSO Classifier
• hasEntityType defines the type of document
“paper”
• hasJourName states the name of the journal
• hasReference points to all referenced papers
• hasType defines whether the paper is from
academia, industry of collaborative
• hasIndustrialSector, if a paper is industrial it
describes the company industrial sector
• hasYear states the publishing year

Additional relationships from paper with reification
• hasAffiliationDistribution describes the
affiliation of authors. The object of this
relationship is another statement: reified
object.
• This reified object then contains
hasAffiliation and hasAffiliation-weight
identifying the affiliation of the paper and
the percentage of authors belonging to
that affiliation.

To better understand reification
• Imagine there are three authors Angelo, Francesco from The Open University and
Dimitris from the Université Libre De Bruxelles who co-author a paper.
• In simple RDF:
@prefix sc: <http://aida.kmi.open.ac.uk/aida35k/ontology#>.
<https://aida35k.org/p_654> sc:hasEntityType sc:paper .
<https://aida35k.org/p_654> sc:hasAuthor <https://aida35k.org/angelo_salatino> .
<https://aida35k.org/p_654> sc:hasAuthor <https://aida35k.org/francesco_osborne> .
<https://aida35k.org/p_654> sc:hasAuthor <https://aida35k.org/dimitris_sacharidis> .
<https://aida35k.org/p_654> sc:hasAffiliation "The Open University" .
<https://aida35k.org/p_654> sc:hasAffiliation "Université Libre De Bruxelles" .
<https://aida35k.org/p_654> sc:hasAffiliation-weight 0.66 .
<https://aida35k.org/p_654> sc:hasAffiliation-weight 0.33 .
Well. How do we tell which affiliation has weight 0.33?

A revised version with reification
• Imagine there are three authors Angelo, Francesco from The Open University and
Dimitris from the Université Libre De Bruxelles who co-author a paper.
• With reification:
@prefix sc: <http://aida.kmi.open.ac.uk/aida35k/ontology#>.
@prefix re: <https://aida35k.org/>
re:p_654 sc:hasEntityType sc:paper .
re:p_654 sc:hasAuthor re:angelo_salatino .
re:p_654 sc:hasAuthor re:francesco_osborne .
re:p_654 sc:hasAuthor re:dimitris_sacharidis .
re:p_654 sc:hasAffiliationDistribution re:AffiliationDistribution_p_654_open_university .
re:p_654 sc:hasAffiliationDistribution re:AffiliationDistribution_p_654_universite_libre_de_bruxelles .
re:AffiliationDistribution_p_654_open_university sc:hasAffiliation "The Open University" .
re:AffiliationDistribution_p_654_open_university sc:hasAffiliation-weight 0.66 .
re:AffiliationDistribution_p_654_universite_libre_de_bruxelles sc:hasAffiliation "Université Libre De Bruxelles" .
re:AffiliationDistribution_p_654_universite_libre_de_bruxelles sc:hasAffiliation-weight 0.33 .

Additional relationships from paper with reification
• hasCitationDistribution describes the
received citations. The reified object then
contains hasCitationYear and
hasCitationYear-weight identifying the
year and the percentage of total citations
received.
• hasCountryDistribution describes the
countries of the affiliations. Similar to
hasAffiliationDistribution
• hasGridTypeDistribution describes the
grid types of the paper. The reified object
contains hasGridType and hasGridType-
weight identifying the type and the
percentage of affiliations with such type.

Relationships from author
• hasPaper states the paper written by
the author
• hasNetworkInDistribution describes
the affiliation of authors. Similar to
hasAffiliationDistribution
• hasWorkedInDistribution describes
the countries of the affiliations. Similar
to hasCountryDistribution

How do we interact with such data?

GraphDB – Import (leave default values)

GraphDB – Write SPARQL query

Running SPARQL queries
• Describe
• Select papers by year
• Identify types

• Get all ‘industry’ papers and their affiliations
• Get 100 ‘academia’ papers and their affiliations

• Get papers written by Carnegie Mellon University
• Count papers written by United States researchers

• Count citation of a paper
• Count papers of a topic

• Get Journals containing the word ‘semantic’
• ASK

References
• Simone Angioni, Angelo Salatino, Francesco Osborne, Diego Reforgiato Recupero, and
Enrico Motta. Integrating Knowledge Graphs for Analysing Academia and Industry
Dynamics. Scientific Knowledge Graph Workshop at TPDL 2020.
• Simone Angioni, Angelo Salatino, Francesco Osborne, Diego Reforgiato Recupero, and
Enrico Motta. Integrating Knowledge Graphs for Comparing the Scientific Output of
Academia and Industry. In ISWC 2019 Posters & Demonstrations and Industry Tracks @
The Semantic Web – ISWC 2019, 26-30 October 2019, Auckland, New Zeland, CEUR
Workshop, 2019.
Francesco
Osborne
Angelo
Salatino
Simone
Angioni
Enrico
Motta
Diego Ref.
Recupero

Scientific Knowledge Graphs: an Overview

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Scientific Knowledge Graphs: an Overview

Ähnlich wie Scientific Knowledge Graphs: an Overview (20)

Mehr von Angelo Salatino

Mehr von Angelo Salatino (11)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Scientific Knowledge Graphs: an Overview