Instance-based Ontology Matching by Instance Enrichment

Ontology matching Instance-based OM IBOMbIE Experiments Comparison other OM Conclusions

Instance-Based Ontology Matching
By Instance Enrichment

Balthasar A.C. Schopman
–
supervisors:
Antoine Isaac
Shenghui Wang
Stefan Schlobach

Vrije Universiteit Amsterdam

June 29, 2009


Outline

1 Ontology matching

2 Instance-based OM

3 IBOMbIE

4 Experiments

5 Comparison other OM

6 Conclusions


Research questions

General research questions:
How do different algorithm design options of
IBOMbIE influence the final result?
How does the performance of IBOMbIE relate to other OM
algorithms?


Questions from the audience

Crucial questions: please interrupt me.
Other questions: after presentation please.


Introduction

Ontology

Definition of an ontology1 :
An ontology typically (1) defines a vocabulary relevant in
a certain domain of interest, (2) specifies the meaning of
terms and (3) specifies relations between terms.

Ontologies:
controlled vocabulary
thesaurus
database schema
canonical semantic web ontology: a set of typed, interrelated
concepts defined in a formal language

1
by Euzenat and Shvaiko


Introduction

Ontology Matching (OM)

Ontologies ...
facilitate interoperability between parties
do not solve heterogeneity problem, but raise it to a higher
level: the OM level

Elementary OM techniques:
terminological
structure-based
semantic-based
instance-based


Introduction

Instance-based OM (IBOM)

Variants IBOM:
1 use dually annotated instances (DAI)
2 create DAI
3 use extension of concepts (DAI not required)

General pros and cons:
Con: does not deduce speciﬁc relations
Con: suitable instances rarely available
Pro: focus on active part of ontology
Pro: able to deal with ambiguous linguistic phenomena:
synonym, homonym


Intro

Definitions of ‘instance of’-relation

Example definitions:
Canonical semantic web definition
Library definition

someone:Peter

foaf:name foaf:knows
rdf:type

"Peter" someone:Nate
foaf:Person


Intro

Definitions of ‘instance of’-relation

Example definitions:
Canonical semantic web definition
Library definition

ontology /
vocabulary object o1

c1 c1

c2

c3
object o2

... c1 c2

c3
...


Intro

Application

Two library scenarios: KB and TEL
match controlled vocabularies
data-sets: book catalogs
multi-lingual


IBOM

IBOM: measuring similarity

c1
c2


IBOM

Jaccard coefficient

Jaccard coefficient:
|i1 ∩ i2 |
J(c1 , c2 ) =
|i1 ∪ i2 |

quantifies the overlap of the extension of concepts
→ relatedness between concepts

Con: no multi-sets


IBOM

Creating dually annotated instances (DAI)

Jaccard needs DAI
If DAI unavailable:
exact instance matching → merge annotations
approximate instance matching → enrich instances


Instance matching

Approximate instance matching

Instance similarity measures:
Lucene
vector space model (VSM)


Enriching instances

Basic instance enrichment (IE)

data-set D1 data-set D2

i i

i1 i2

a b match A B
i i


Enriching instances

Basic instance enrichment (IE)


i i

i1 i2

a b A B
i i
A B


Enriching instances

IE parameter: topN


i i2
i1
1st A B
a b match
i3

2nd D
match
i i4
3rd
A C
match


Enriching instances

IE parameter: topN


i i2
i1
A B
a b i3

A B D
i i4

A C


Enriching instances

IE parameter: topN


i i2
i1
A B
a b i3

A B D
i i4
D
A C


Enriching instances

IE parameter: topN


i i2
i1
A B
a b i3

A B D
i i4
D
A C
A C


Enriching instances

IE parameter: similarity threshold (ST)


i i2
i1
sim(i1,i2) A B
a b = 0.8
i3

sim(i1,i3)
D
= 0.4
i i4

sim(i1,i4) A C
= 0.2


Enriching instances



i i2
i1
sim(i1,i2) A B
a b = 0.8
i3

A B sim(i1,i3)
D
= 0.4
i i4

sim(i1,i4) A C
= 0.2


Enriching instances



i i2
i1
sim(i1,i2) A B
a b = 0.8
i3

A B sim(i1,i3)
D
= 0.4
i i4
D
sim(i1,i4) A C
= 0.2


Enriching instances



i i2
i1
sim(i1,i2) A B
a b = 0.8
i3

A B sim(i1,i3)
D
= 0.4
i i4
D
sim(i1,i4) A C
A C = 0.2


Experimental questions

Experimental questions

Instance similarity measure
topN parameter
ST parameter
combining topN + ST parameters
performance as compared to other OM algorithms


Evaluation

Alignment evaluation

Methods:
Gold standard := good alignment
Reindexing

Measures:
Precision
Recall
f-measure


Results of experiments

Results: instance similarity measure - quality

1 1
P VSM P VSM
R VSM R VSM
F VSM F VSM
P Lucene P Lucene
R Lucene R Lucene
0.8 F Lucene 0.8 F Lucene

0.6 0.6
performance

performance
0.4 0.4

0.2 0.2

0 0
10 100 1000 10000 100000 1e+06 100 1000 10000 100000 1e+06
mapping rank mapping rank

(a) Gold standard (b) Reindex

Virtually equal



Results: instance similarity measure - quality

1 1
precision VSM
precision Lucene

0.8 0.8

0.6 0.6

performance
overlap

0.4 0.4

0.2 0.2

0 0
1 10 100 1000 10000 100000 1e+06 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

(c) Overlap (d) Manual Evaluation

Edge to VSM



Results: instance similarity measure - run-time

amount time to enrich 100K
indexed instances (hrs:min)
instances Lucene VSM
524K 1:04 0:17 1600
VSM
Lucene

1400

1,457K 7:20 0:22 1200

2,506K 26:15 0:32 1000

increase run-time
(e) stats 800

600

400

200

0
4 6 8 10 12 14 16 18 20 22 24 26
indexed documents * 100K

(f) ﬁgure it out

Optimizations VSM:
pre-calculate weights indexed documents
purge insigniﬁcant weights (35% + 50%)
word centered indexing approach



Results: instance similarity measure - run-time

amount time to enrich 100K
indexed instances (hrs:min)
instances Lucene VSM
524K 1:04 0:17 1600
VSM
Lucene

1400

1,457K 7:20 0:22 1200

2,506K 26:15 0:32 1000

increase run-time
(g) stats 800

600

400

200

0
4 6 8 10 12 14 16 18 20 22 24 26
indexed documents * 100K

(h) ﬁgure it out

Optimizations VSM:
pre-calculate weights indexed documents
purge insigniﬁcant weights (35% + 50%)
word centered indexing approach



Results: topN parameter (TEL)

As N increases, quality of mappings decrease

0.45 0.25
top1 (baseline) top1 (baseline)
top2 top2
top3 top3
0.4 top4 top4
top5 top5
top6 0.2 top6
0.35

0.3

0.15
f-measure

f-measure
0.25

0.2
0.1

0.15

0.1
0.05

0.05

0 0
1 10 100 1000 10000 100000 1e+06 100 1000 10000 100000 1e+06

(i) Gold standard (j) Reindex



Results: similarity threshold parameter (KB)

Best performance with ST: ST=µ
Best performance: baseline (topN=1, ST=∞)

0.6 0.4
baseline baseline
T=mean-1.5s T=mean-1.5s
T=mean-s T=mean-s
T=mean-.5s 0.35 T=mean-.5s
0.5 T=mean T=mean
T=mean+.5s T=mean+.5s
T=mean+s T=mean+s
T=mean+1.5s 0.3 T=mean+1.5s

0.4
0.25
f-measure

f-measure
0.3 0.2

0.15
0.2

0.1

0.1
0.05

0 0
10 100 1000 10000 100000 1e+06 100 1000 10000 100000 1e+06

(k) Gold standard (l) Reindex



Results: combining parameters
Using both parameters performs good in TEL, not in KB...
possibly due to:
more selective IBOMbIE pays oﬀ in TEL, because vocabularies
+ instance annotations are more diﬀerent than in KB scenario.

0.4 0.3
baseline baseline
topN=1 ST=mu-0.5s topN=1 ST=mu-0.5s
topN=1 ST=mu topN=1 ST=mu
0.35 topN=1 ST=mu+0.5s topN=1 ST=mu+0.5s
topN=2 ST=mu-0.5s 0.25 topN=2 ST=mu-0.5s
topN=2 ST=mu topN=2 ST=mu
topN=2 ST=mu+0.5s topN=2 ST=mu+0.5s
0.3 topN=3 ST=mu-0.5s topN=3 ST=mu
topN=3 ST=mu topN=3 ST=mu+0.5s
0.2
0.25
f-measure

f-measure
0.2 0.15

0.15
0.1

0.1

0.05
0.05

0 0
100 1000 10000 100000 1e+06 100 1000 10000 100000 1e+06

(m) KB (n) TEL

(evaluation method: reindexing)


OAEI

Ontology alignment evaluation initiative (OAEI)

terminol- structure- semantic- instance-
ogical based based based
DSSim #
Lily #
TaxoMap #
IBOMbIE # # #

DSSim, Lily and TaxoMap:
consider KB ontologies “huge”
feature functionality to deal with large ontologies


OAEI

Performance comparison: quality

0.8
P IBOMbIE topN=1
R IBOMbIE topN=1
P DSSim
0.7 R DSSim
P Lily
R Lily
P TaxoMap
0.6 R TaxoMap

0.5
performance

0.4

0.3

0.2

0.1

0
0 2000 4000 6000 8000 10000
mapping rank


OAEI

Performance comparison: resources + coverage

matcher run-time amount mappings
DSSim 12:00 2930
Lily ? 2797
TaxoMap 2:40 1851
IBOMbIE 1:54 7000+

(Amount lexically equal concepts KB vocabulaires = 2,895)


Conclusions + discussion

IBOMbIE algorithm is quite promising:
Relatively low run-time
Able to deal with large ontologies
Amount + quality of mappings
Pros of IBOM
Able to align ontologies using disjunct data-sets

Basic instance enrichment appears best performing method.
Possible cause: Jaccard coeﬃcient does not support multi-sets.


Fin

Thank you... any questions ?


Vocabularies

vocabulary size
KB GTT 35K
Brinkman 5K
TEL LCSH 340K
Rameau 155K
SWD 805K



D1 D2
annotated annotated
with with µ σ
KB O1 O2 0.297 0.106
O2 O1 0.279 0.101
TEL O1 O2 0.260 0.097
O2 O1 0.232 0.084

standard ST: µ
1
step-size: 2 σ


VSM
Weights are components of vectors:
term frequency - inverse document frequency: TF-IDF
e.g. audiovisual features

tﬁdfw ,d = tfw ,d ∗ idfw
√
nw ,d
tfw ,d =
|d|
|D|
idfw = log
|d ∈ D : w ∈ d|
VSM cosine similarity
n
d1 · d2 i =1 wi ,d1 wi ,d2
cosine sim(d1 , d2 ) = =
|d1 ||d2 | i wi2 1 i wi2 2
,d ,d


Evaluation method: gold standard

Gold standard := good alignment

|{reference} ∩ {retrieved}|
P = precision =
|{retrieved}|
|{reference} ∩ {retrieved}|
R = recall =
|{reference}|
P ∗R
F = f − measure = 2 ∗
P +R


Evaluation method: reindexing
o_1 o_2

a x

b y

c z

instance i_dual instance i_dual

{a, b} {x, z}
reindex

{x} {a, b}

dually annotated instances |{reference}∩{retrieved}|
|{retrieved}|
P=
|{reindexed instances}|
dually annotated instances |{reference}∩{retrieved}|
|{reference}|
R=


IbOM by IM algorithm overview

Whole algorithm
Start: two data-sets Dx and Dy
1 Enrich instances of Dx with annotations of instances of Dy
For every instance a:
1 Find N best matching instances {b} in Dy
2 Add annotations of {b} to a
2 Enrich vice versa
3 Merge data-sets into one dually annotated data-set
4 Apply Jaccard measure

Instance-based Ontology Matching by Instance Enrichment

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

Instance-based Ontology Matching by Instance Enrichment