ISSA: AI Pipeline and Tools Help Scientists Search Scientific Archives
1. * Wimmics: AI in bridging social semantics and formal semantics on the Web
Franck MICHEL* - Université Côte d’Azur, CNRS, Inria, I3S, France
ISSA: Generic Pipeline,
Knowledge Model and
Visualization tools to
Help Scientists Search and
Make Sense of a Scientific Archive
2. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Issue: skyrocketing pace of publications
Bibliographic search difficult:
• Find and make sense of relevant articles
• Search across multiple disciplines
Central role of open scientific archives
But the provided services have limitations:
• String-based search fails to grasp semantic relationships
• Keywords often too general to be helpful
Need for smarter search services exploiting this knowledge
2
Open Science
3. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France 3
Propose a generic, reusable, extensible
solution to optimize bibliographic search
in an open scientific archive.
4. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
How did we do that?
• Extract rich metadata from the publications
in multiple languages
• Turn it into a semantic index published
on the web as a RDF knowledge graph
• Link with general vocabularies as well as
domain-specific vocabularies
• Provide flexible search/visualization tools
able to exploit the index
4
5. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France 5
The ISSA
pipeline
6. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
OpenArchive
ISSA
Pipeline
User Communities
DEFINE
Step 1. Retrieval of metadata records
7. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Retrieval
(OAI-PMH)
OpenArchive Metadata records
1
ISSA
Pipeline
User Communities
DEFINE
Step 1. Retrieval of metadata records
What metadata ?
• Title
• Authors (strings)
• Date
• Publication
• Languages
• Identifiers
• Abstract
• License
• URL of the PDF file
• …
OAI-PMH protocol:
• Supported by many open
libraries & archives (70% [1])
• Harvested by aggregators
e.g. Google Scholar,
OpenAIRE
[1] Ramírez-Montoya, María-Soledad & Ceballos, Hector. (2017). Institutional
Repositories. 10.1201/9781315155890-5.
8. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Retrieval
(OAI-PMH)
OpenArchive Metadata records
1
Virtuoso
Triple Store
2
Translation
to RDF
ISSA
Pipeline
User Communities
DEFINE QUERY
Step 2. Populate the knowledge graph with metadata
Metadata RDF representation with standard vocabularies:
Dublin Core, BIBO, FABIO/FRBR,
EPRINT, FOAF, PROVO, Schema.org
(Morph-xR2RML)
9. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Retrieval
(OAI-PMH)
OpenArchive Metadata records
1
Full text
extraction 3
< / >
< / >
< / >
Structured text
Virtuoso
Triple Store
2
Translation
to RDF
ISSA
Pipeline
User Communities
DEFINE QUERY
Step 3. Full text extraction
(GROBID)
10. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Retrieval
(OAI-PMH)
OpenArchive Metadata records
1
Full text
extraction 3
< / >
< / >
< / >
Structured text
Virtuoso
Triple Store
2
4
Linked Descriptors and Named Entities
Thematic & geographic Indexing (Annif)
NEs extraction & linking (Entity-fishing, Spotlight, Dictionary)
Translation
to RDF
Vocabularies & Datasets
Wikidata, DBpedia, Geonames,
domain thesauri
ISSA
Pipeline
User Communities
DEFINE QUERY
Step 4. Indexing and NEs extractions
ANNOTATE
& VALIDATE
11. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Find out descriptors that
characterize publications
Rely on the Annif open-source
indexating p/f
AGROVOC thesaurus
Training corpus: Agritrop
subset + expert descriptors
Evaluation of different
classification models
11
Thematic &
geographic indexing
Structured text Structured text
12. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Annotate parts of text with
referring to concepts from
controlled vocabularies:
Wikidata
Geonames (through Wikidata)
DBpedia
AGROVOC
12
NEs extraction
and linking
13. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Thematic & geographic Indexing (Annif)
NEs extraction & linking (Entity-fishing, Spotlight, Dictionary)
Retrieval
(OAI-PMH)
OpenArchive Metadata records
1
Full text
extraction 3
< / >
< / >
< / >
Structured text
Virtuoso
Triple Store
2
4
Linked Descriptors and Named Entities
Translation
to RDF
Vocabularies & Datasets
Wikidata, DBpedia, Geonames,
domain thesauri
Translation to RDF
5
ISSA
Pipeline
User Communities
DEFINE QUERY
ANNOTATE
& VALIDATE
Step 5. Populate the knowledge graph with
descriptors and NEs
(Morph-xR2RML)
Web Annotation Vocabulary
14. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Thematic & geographic Indexing (Annif)
NEs extraction & linking (Entity-fishing, Spotlight, Dictionary)
Retrieval
(OAI-PMH)
OpenArchive Metadata records
1
Full text
extraction 3
< / >
< / >
< / >
Structured text
Virtuoso
Triple Store
2
4
Linked Descriptors and Named Entities
Translation
to RDF
Vocabularies & Datasets
Wikidata, DBpedia, Geonames,
domain thesauri
Translation to RDF
5
Mining & Visualization
Association rules mining
Augmented visualization
6
ISSA
Pipeline
User Communities
DEFINE QUERY
ANNOTATE
& VALIDATE
DEFINE & USE
Step 6. Mining and Visualization
15. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France 15
Mining & Visualization
16. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Explore descriptors association rules
16
Extract and visualize
association rules between
articles’ descriptors
with ARViz.
Suited for the discovery
of (possibly unexpected)
associations
17. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Explore/navigate networks of entities
17
Solve complex competency questions by visually exploring networks of
descriptors, authors, articles with LDViz.
18. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Explore/navigate networks of entities
18
Solve complex competency questions by visually exploring networks of
descriptors, authors, articles with LDViz.
19. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Explore/navigate networks of entities
19
Solve complex competency questions by visually exploring networks of
descriptors, authors, articles with LDViz.
20. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Explore/navigate networks of entities
20
Solve complex competency questions by visually exploring networks of
descriptors, authors, articles with LDViz.
21. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Explore networks of articles, descriptors…
Same tools to explore:
• Network of articles with
co-authors
• Network of authors with
co-publications
• Networks of institutions
with same research topics
• …
22. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Quick
summary
• Pipeline and visualization tools successfully
deployed for Agritrop
• 100,000+ articles’ metadata and abstract
• 12,000 OA articles with full text
• Pipeline for Agritrop ready to transfer
to other archives with limited work
• Only open licenses (code, documentation…)
• Based on OS, robust tools and technologies,
Docker-based
• Extensible with new steps following simple
guidelines
22
23. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Perspectives
https://unsplash.com/photos/ROOrGTNurYI
24. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Perspectives
https://unsplash.com/photos/ROOrGTNurYI
CIRAD willing to deploy the ISSA pipeline and
visualization tools in production for all users of Agritrop.
25. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
ISSA 2 – CfP CollEx-Persée 2021-2022
Exploit & expand the results of ISSA:
◦ Extract new knowledge: relationships between NEs,
authors disambiguation, cross references… Link to taxonomic registries?
◦ Broaden the service offering for researchers and documentalists:
semantic search, geographical visualization, bibliometry
◦ Non-supervised indexing + improve data quality metrics
Extend the PoC to the HAL instance of EuroMov Digital Health in Motion
25
26. Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Thank-you
https://issa.cirad.fr/
https://github.com/issa-project
@ProjetISSA