SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Merging controlled vocabularies
through semantic alignment
based on linked data
Authors: Konstantinos Kyprianos, Ioannis Papadakis

IONIAN UNIVERSITY
DEPARTMENT OF ARCHIVES, LIBRARY SCIENCE AND MUSEOLOGY
Ioannou Theotoki 72, 49100, Corfu

1
Presentation outline









Introduction
Proposed approach
Proof of concept
Deployment of the proposed
approach
Deployment results
Comparative evaluation
Conclusions
Future work

2
Introduction (1/2)


Controlled vocabularies are predefined lists of words for
knowledge organization and the description of libraries’
collections



Creation of semantically similar yet syntactically and
linguistically heterogeneous controlled vocabularies with
overlapping parts



Matching tools and techniques: Lexical similarity



Matching tools and techniques: Semantic alignment

◦ Compares terms according to the order of their characters
◦ Edit – distance, prefix / suffix variations, n-grams etc.
◦ Based on semantic techniques to identify similar terms
between two structured vocabularies

3
Introduction (2/2)


Our approach:

Methodology to bring together semantically similar
yet different vocabularies through the semantic
alignment of the underlying terms with the
employment of LOD technologies
◦ Semantic alignment is achieved through external linguistic
datasets
◦ There is no requirement of any kind of structure (schema or
ontology) to the compared datasets

4
Proposed approach

• S is the set of terms in Source dataset
• T is the set of terms in Target dataset
• L is the set of terms in the Linguistic
dataset
• L’ is the set of terms that are found to
be linguistically associated with some
terms of the Source dataset
• L’’ is the set of terms in L that are found
to be semantically associated with
some terms of the L’
• T’ contains the terms in T that are
linguistically associated with some
terms of L’ and L’’
5
Proof of concept (1/2)


University of Piraeus digital library (Dione)
◦ Theses and dissertations
◦ 3,323 bilingual subject headings
◦ DSpace installation



New York Times – NYT
◦ Approximately 10.000 subject headings
◦ Journal articles



DBpedia
◦ Extracts structured information from Wikipedia
◦ 3,5 million entities



WordNet
◦ Lexical database
◦ Consists of synsets (~117,659 distinct concepts containing terms
interlinked through conceptual-semantic relations)
6
Proof of concept (2/2)
1. let the source dataset S be D (i.e. Dione)
2. let the target dataset T be N (i.e. NYT)
3. let the linguistic datasetA L be DB (i.e.
DBpedia) and
4. let the linguistic datasetB L be W
(i.e.WordNet)
5. D1’ corresponds to S’, assuming that the
linguistic dataset L is DB. In a similar
manner, D2’ corresponds to S’, assuming
that the linguistic dataset L is W.
6. DB’ and DB’’ correspond to L’ and L’’
respectively, assuming that the linguistic
dataset L is DB. In a similar manner, W’
and W’’ correspond to L’ and L’’
respectively, assuming that the linguistic
dataset L is W.
7. N1’ corresponds to T’, assuming that the
linguistic dataset L is DB. In a similar
manner, N2’ corresponds to T’ assuming
that the linguistic dataset L is W.

7
Deployment of the proposed approach




◦
◦
◦

Google Refine

Tool to manipulate tabular data
Reconciliation of data with existent knowledge bases
RDF extension

Process
1.
2.
3.

4.

5.
6.

Subject headings from Dione are imported to Google Refine
DBpedia and WordNet endpoints are registered in Google Refine as
SPARQL reconciliation services
The subject headings of Dione are linguistically matched (i.e. lexical
similarity) against DBpedia’s and WordNet’s reconciliation services
creating the corresponding subsets
The terms in the subsets of step 3 are extended with semantically
equivalent terms (i.e. semantic alignment) deriving from the rest of
DBpedia and WordNet
Subject headings from NYT are imported to Google Refine
The subject headings of NYT are linguistically matched (i.e. lexical
similarity) against the terms belonging to the subsets that are
described in steps 3 and 4
8
Deployment results (1/2)
Linguistically
matched terms
between



◦
◦

Dione and DBpedia
Dione and Wordnet

through lexical
similarity techniques

Dione

DBpedia

WordNet

One-word
Subject
Headings

331 (29%)

297 (65%)

Two-words
Subject
Headings

658 (59%)

128 (28%)

Subject
Headings with
3+ words

130 (12%)

30 (7%)

Subject
Headings with
Subdivisions

0

0

Sum
(1,574)

1,119

455

9
Deployment results (2/2)
D = 3,323 terms

D
D2’

D1’

1119

DB

DB’’

455

DB’

W’’

W’

W

986

5,700
72

86
77

45

N

N = 10,000 terms
N1’ = 163

N2’ = 117

10
Comparative evaluation (1/4)


The proposed methodology is compared against
an algorithm (introduced in a previous work*)
addressed to Dione and NYT based only on
lexical similarity techniques
 Dione and NYT are not described by schemas.
Thus, any attempt to merge their underlying
terms cannot be based on traditional ontologyalignment techniques

*Papadakis, I., Kyprianos, K.: Merging Controlled Vocabularies for More Efficient

Subject-based Search. International Journal of Knowledge Management. 7(3), 76-90,
July-September (2011)
11
Comparative evaluation (2/4)
List A

List B

207
280

List A. Previous work: only lexically matched pairs between Dione and NYT
List B. Proposed work: lexically AND semantically matched pairs between Dione
and NYT
12
Comparative evaluation (3/4)
List B

List A

27

180

100

List A ∧ List B = 180 terms

13
Comparative evaluation (4/4)
Matched
pairs

List A

List B

D1-NYT1

 (lexical)

 (lexical)

…

 (lexical)

 (lexical)

D158-NYT158

 (lexical)

 (lexical)

…

 (lexical)

 (semantic)

D180-NYT180

 (lexical)

 (semantic)

…



 (semantic)

D280-NYT280



 (semantic)

…

 (lexical)



D307-NYT307

 (lexical)


TOTAL:

No. of pairs

158
180
22
100
27
307

14
Conclusions


A methodology was presented that is capable of finding
equivalent terms between semantically similar controlled
vocabularies



Lexical similarities discovery and semantic alignment
through external LOD datasets



Google Refine renders the deployment of the proposed
methodology as a straightforward process that can be
applied to other cases aiming in discovering equivalent
terms in different yet semantically similar datasets



The deployment of the proposed methodology is facilitated
through the employment of linked data technologies

15
Future work


Future work is targeted towards the reconciliation of
Dione’s subject headings with linked data services such as
French National Library (RAMEAU), German National
Library (GND), Biblioteca National de Espana (BNE) and
LIBRIS.

16
Thank you for your attention!
Questions?

17

Weitere ähnliche Inhalte

Was ist angesagt?

IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET Journal
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - IntroductionChristian Perone
 
A comparative analysis of particle swarm optimization and k means algorithm f...
A comparative analysis of particle swarm optimization and k means algorithm f...A comparative analysis of particle swarm optimization and k means algorithm f...
A comparative analysis of particle swarm optimization and k means algorithm f...ijnlc
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...Sebastian Ruder
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationEugene Nho
 
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...TELKOMNIKA JOURNAL
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Sebastian Ruder
 
AMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYAMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYijnlc
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 

Was ist angesagt? (20)

Tutorial on word2vec
Tutorial on word2vecTutorial on word2vec
Tutorial on word2vec
 
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
IRJET- Automatic Language Identification using Hybrid Approach and Classifica...
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
A comparative analysis of particle swarm optimization and k means algorithm f...
A comparative analysis of particle swarm optimization and k means algorithm f...A comparative analysis of particle swarm optimization and k means algorithm f...
A comparative analysis of particle swarm optimization and k means algorithm f...
 
Vectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for SearchVectorland: Brief Notes from Using Text Embeddings for Search
Vectorland: Brief Notes from Using Text Embeddings for Search
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
FaDA: Fast document aligner with word embedding - Pintu Lohar, Debasis Gangul...
 
NLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic ClassificationNLP Project: Paragraph Topic Classification
NLP Project: Paragraph Topic Classification
 
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...
WCLOUDVIZ: Word Cloud Visualization of Indonesian News Articles Classificatio...
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
Topics Modeling
Topics ModelingTopics Modeling
Topics Modeling
 
Does sizematter
Does sizematterDoes sizematter
Does sizematter
 
Ir models
Ir modelsIr models
Ir models
 
Canini09a
Canini09aCanini09a
Canini09a
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
AMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYAMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITY
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 

Ähnlich wie Merging controlled vocabularies through semantic alignment based on linked data

Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
The Essay Scoring Tool (TEST) for Hindi
The Essay Scoring Tool (TEST) for HindiThe Essay Scoring Tool (TEST) for Hindi
The Essay Scoring Tool (TEST) for Hindisinghg77
 
An Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense DisambiguationAn Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense DisambiguationSurabhi Verma
 
Automatize Document Topic And Subtopic Detection With Support Of A Corpus
Automatize Document Topic And Subtopic Detection With Support Of A CorpusAutomatize Document Topic And Subtopic Detection With Support Of A Corpus
Automatize Document Topic And Subtopic Detection With Support Of A CorpusRichard Hogue
 
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...Kim Daniels
 
Improving Text Categorization with Semantic Knowledge in Wikipedia
Improving Text Categorization with Semantic Knowledge in WikipediaImproving Text Categorization with Semantic Knowledge in Wikipedia
Improving Text Categorization with Semantic Knowledge in Wikipediachjshan
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextFulvio Rotella
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextUniversity of Bari (Italy)
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...Hiroki Shimanaka
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003Ajay Ohri
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 

Ähnlich wie Merging controlled vocabularies through semantic alignment based on linked data (20)

1 l5eng
1 l5eng1 l5eng
1 l5eng
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
The Essay Scoring Tool (TEST) for Hindi
The Essay Scoring Tool (TEST) for HindiThe Essay Scoring Tool (TEST) for Hindi
The Essay Scoring Tool (TEST) for Hindi
 
An Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense DisambiguationAn Improved Approach to Word Sense Disambiguation
An Improved Approach to Word Sense Disambiguation
 
Automatize Document Topic And Subtopic Detection With Support Of A Corpus
Automatize Document Topic And Subtopic Detection With Support Of A CorpusAutomatize Document Topic And Subtopic Detection With Support Of A Corpus
Automatize Document Topic And Subtopic Detection With Support Of A Corpus
 
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...
A Comparative Analysis Of The Entropy And Transition Point Approach In Repres...
 
Improving Text Categorization with Semantic Knowledge in Wikipedia
Improving Text Categorization with Semantic Knowledge in WikipediaImproving Text Categorization with Semantic Knowledge in Wikipedia
Improving Text Categorization with Semantic Knowledge in Wikipedia
 
AINL 2016: Eyecioglu
AINL 2016: EyeciogluAINL 2016: Eyecioglu
AINL 2016: Eyecioglu
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from TextCooperating Techniques for Extracting Conceptual Taxonomies from Text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
 
W17 5406
W17 5406W17 5406
W17 5406
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Distributional semantics
Distributional semanticsDistributional semantics
Distributional semantics
 
Networks and Natural Language Processing
Networks and Natural Language ProcessingNetworks and Natural Language Processing
Networks and Natural Language Processing
 
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
 
Blei ngjordan2003
Blei ngjordan2003Blei ngjordan2003
Blei ngjordan2003
 
E-text in EFL - Four flavours
E-text in EFL - Four flavoursE-text in EFL - Four flavours
E-text in EFL - Four flavours
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 

Kürzlich hochgeladen

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 

Kürzlich hochgeladen (20)

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 

Merging controlled vocabularies through semantic alignment based on linked data

  • 1. Merging controlled vocabularies through semantic alignment based on linked data Authors: Konstantinos Kyprianos, Ioannis Papadakis IONIAN UNIVERSITY DEPARTMENT OF ARCHIVES, LIBRARY SCIENCE AND MUSEOLOGY Ioannou Theotoki 72, 49100, Corfu 1
  • 2. Presentation outline         Introduction Proposed approach Proof of concept Deployment of the proposed approach Deployment results Comparative evaluation Conclusions Future work 2
  • 3. Introduction (1/2)  Controlled vocabularies are predefined lists of words for knowledge organization and the description of libraries’ collections  Creation of semantically similar yet syntactically and linguistically heterogeneous controlled vocabularies with overlapping parts  Matching tools and techniques: Lexical similarity  Matching tools and techniques: Semantic alignment ◦ Compares terms according to the order of their characters ◦ Edit – distance, prefix / suffix variations, n-grams etc. ◦ Based on semantic techniques to identify similar terms between two structured vocabularies 3
  • 4. Introduction (2/2)  Our approach: Methodology to bring together semantically similar yet different vocabularies through the semantic alignment of the underlying terms with the employment of LOD technologies ◦ Semantic alignment is achieved through external linguistic datasets ◦ There is no requirement of any kind of structure (schema or ontology) to the compared datasets 4
  • 5. Proposed approach • S is the set of terms in Source dataset • T is the set of terms in Target dataset • L is the set of terms in the Linguistic dataset • L’ is the set of terms that are found to be linguistically associated with some terms of the Source dataset • L’’ is the set of terms in L that are found to be semantically associated with some terms of the L’ • T’ contains the terms in T that are linguistically associated with some terms of L’ and L’’ 5
  • 6. Proof of concept (1/2)  University of Piraeus digital library (Dione) ◦ Theses and dissertations ◦ 3,323 bilingual subject headings ◦ DSpace installation  New York Times – NYT ◦ Approximately 10.000 subject headings ◦ Journal articles  DBpedia ◦ Extracts structured information from Wikipedia ◦ 3,5 million entities  WordNet ◦ Lexical database ◦ Consists of synsets (~117,659 distinct concepts containing terms interlinked through conceptual-semantic relations) 6
  • 7. Proof of concept (2/2) 1. let the source dataset S be D (i.e. Dione) 2. let the target dataset T be N (i.e. NYT) 3. let the linguistic datasetA L be DB (i.e. DBpedia) and 4. let the linguistic datasetB L be W (i.e.WordNet) 5. D1’ corresponds to S’, assuming that the linguistic dataset L is DB. In a similar manner, D2’ corresponds to S’, assuming that the linguistic dataset L is W. 6. DB’ and DB’’ correspond to L’ and L’’ respectively, assuming that the linguistic dataset L is DB. In a similar manner, W’ and W’’ correspond to L’ and L’’ respectively, assuming that the linguistic dataset L is W. 7. N1’ corresponds to T’, assuming that the linguistic dataset L is DB. In a similar manner, N2’ corresponds to T’ assuming that the linguistic dataset L is W. 7
  • 8. Deployment of the proposed approach   ◦ ◦ ◦ Google Refine Tool to manipulate tabular data Reconciliation of data with existent knowledge bases RDF extension Process 1. 2. 3. 4. 5. 6. Subject headings from Dione are imported to Google Refine DBpedia and WordNet endpoints are registered in Google Refine as SPARQL reconciliation services The subject headings of Dione are linguistically matched (i.e. lexical similarity) against DBpedia’s and WordNet’s reconciliation services creating the corresponding subsets The terms in the subsets of step 3 are extended with semantically equivalent terms (i.e. semantic alignment) deriving from the rest of DBpedia and WordNet Subject headings from NYT are imported to Google Refine The subject headings of NYT are linguistically matched (i.e. lexical similarity) against the terms belonging to the subsets that are described in steps 3 and 4 8
  • 9. Deployment results (1/2) Linguistically matched terms between  ◦ ◦ Dione and DBpedia Dione and Wordnet through lexical similarity techniques Dione DBpedia WordNet One-word Subject Headings 331 (29%) 297 (65%) Two-words Subject Headings 658 (59%) 128 (28%) Subject Headings with 3+ words 130 (12%) 30 (7%) Subject Headings with Subdivisions 0 0 Sum (1,574) 1,119 455 9
  • 10. Deployment results (2/2) D = 3,323 terms D D2’ D1’ 1119 DB DB’’ 455 DB’ W’’ W’ W 986 5,700 72 86 77 45 N N = 10,000 terms N1’ = 163 N2’ = 117 10
  • 11. Comparative evaluation (1/4)  The proposed methodology is compared against an algorithm (introduced in a previous work*) addressed to Dione and NYT based only on lexical similarity techniques  Dione and NYT are not described by schemas. Thus, any attempt to merge their underlying terms cannot be based on traditional ontologyalignment techniques *Papadakis, I., Kyprianos, K.: Merging Controlled Vocabularies for More Efficient Subject-based Search. International Journal of Knowledge Management. 7(3), 76-90, July-September (2011) 11
  • 12. Comparative evaluation (2/4) List A List B 207 280 List A. Previous work: only lexically matched pairs between Dione and NYT List B. Proposed work: lexically AND semantically matched pairs between Dione and NYT 12
  • 13. Comparative evaluation (3/4) List B List A 27 180 100 List A ∧ List B = 180 terms 13
  • 14. Comparative evaluation (4/4) Matched pairs List A List B D1-NYT1  (lexical)  (lexical) …  (lexical)  (lexical) D158-NYT158  (lexical)  (lexical) …  (lexical)  (semantic) D180-NYT180  (lexical)  (semantic) …   (semantic) D280-NYT280   (semantic) …  (lexical)  D307-NYT307  (lexical)  TOTAL: No. of pairs 158 180 22 100 27 307 14
  • 15. Conclusions  A methodology was presented that is capable of finding equivalent terms between semantically similar controlled vocabularies  Lexical similarities discovery and semantic alignment through external LOD datasets  Google Refine renders the deployment of the proposed methodology as a straightforward process that can be applied to other cases aiming in discovering equivalent terms in different yet semantically similar datasets  The deployment of the proposed methodology is facilitated through the employment of linked data technologies 15
  • 16. Future work  Future work is targeted towards the reconciliation of Dione’s subject headings with linked data services such as French National Library (RAMEAU), German National Library (GND), Biblioteca National de Espana (BNE) and LIBRIS. 16
  • 17. Thank you for your attention! Questions? 17