Seals 2nd campaign results

Results
of
the
second
worldwide

evalua3on
campaign
for

seman3c
tools

©
the
SEALS
Project

h>p://www.seals-‐project.eu/

2nd
SEALS
Yards3cks
for

Ontology
Management

2nd
SEALS
Yards3cks
for
Ontology

Management

•  Conformance
and
interoperability
results

•  Scalability
results

•  Conclusions

3

Conformance
evalua3on

•  Ontology
language
conformance

–  The
ability
to
adhere
to
exis3ng
ontology
language

speciﬁca3ons

•  Goal:
to
evaluate
the
conformance
of
seman3c

technologies
with
regards
to
ontology
representa3on

languages

Tool X

O1 O1’ O1’’

Step 1: Import + Export

O1 = O1’’ + α - α’

4

Metrics

•  Execu9on
informs
about
the
correct
execu3on:

–  OK.
No
execu3on
problem

–  FAIL.
Some
execu3on
problem

–  Pla+orm
Error
(P.E.)
PlaQorm
excep3on

•  Informa9on
added
or
lost
in
terms
of
triples,
axioms,
etc.

Oi = Oi’ + α - α’

•  Conformance
informs
whether
the
ontology
has
been

processed
correctly
with
no
addi3on
or
loss
of

informa3on:

–  SAME
if
Execu'on
is
OK
and
Informa'on
added
and

Informa'on
lost
are
void

–  DIFFERENT
if
Execu'on
is
OK
but
Informa'on
added
or
Oi = Oi’ ?
Informa'on
lost
are
not
void

–  NO
if
Execu'on
is
FAIL
or
P.E.

5

Interoperability
evalua3on

•  Ontology
language
interoperability

–  The
ability
to
interchange
ontologies
and
use
them

•  Goal:
to
evaluate
the
interoperability
of
seman3c
technologies
in

terms
of
the
ability
that
such
technologies
have
to
interchange

ontologies
and
use
them

Tool X Tool Y

O1 O1’ O1’’ O1’’’ O1’’’’

Step 1: Import + Export Step 2: Import + Export
O1 = O1’’ + α - α’ O1’’=O1’’’’ + β - β’

Interchange

O1 = O1’’’’ + α - α’ + β - β’

6

Metrics

•  Execu9on
informs
about
the
correct
execu3on:

–  OK.
No
execu3on
problem

–  FAIL.
Some
execu3on
problem

–  Pla+orm
Error
(P.E.)
PlaQorm
excep3on

–  Not
Executed.
(N.E.)
Second
step
not
executed

•  Informa9on
added
or
lost
in
terms
of
triples,
axioms,
etc.

Oi = Oi’ + α - α’

•  Interchange
informs
whether
the
ontology
has
been

interchanged
correctly
with
no
addi3on
or
loss
of

informa3on:

–  SAME
if
Execu'on
is
OK
and
Informa'on
added
and
Informa'on

lost
are
void

–  DIFFERENT
if
Execu'on
is
OK
but
Informa'on
added
or

Informa'on
lost
are
not
void
Oi = Oi’ ?
–  NO
if
Execu'on
is
FAIL,
N.E.,
or
P.E.

7

Test
suites
used

Name
Deﬁni9on
Nº
Tests

RDF(S)
Import
Test
Suite
Manual
82

OWL
Lite
Import
Test
Suite
Manual
82

OWL
DL
Import
Test
Suite
Keyword-‐driven
generator
561

OWL
Full
Import
Test
Suite
Manual
90

OWL
Content
Pa>ern
Expressive
generator
81

OWL
Content
Pa>ern
Expressive
Expressive
generator
81

OWL
Content
Pa>ern
Full
Expressive
Expressive
generator
81

8

Tools
evaluated

1st
Evalua3on

Campaign

2nd
Evalua3on

Campaign

9

Evalua3on
Execu3on

•  Evalua3ons
automa3cally
performed
with
the
SEALS

PlaQorm

–  h>p://www.seals-‐project.eu/

SEALS

•  Evalua3on
materials
available
Test Suite
Test Suite

Test Suite
Raw Result

–  Test
Data

–  Results

Test Suite
Interpretation

–  Metadata
Conformance Interoperability Scalability

10

Dynamic
result
visualiza3on

11

RDF(S)
conformance
results

•  Jena
and
Sesame
behave

iden3cally
(no
problems)

•  The
behaviour
of
the
OWL
API-‐
based
tools
(NeOn
Toolkit,
OWL

API
and
Protégé
4)
has

signiﬁcantly
changed

–  Transform
ontologies
to
OWL
2

–  Some
problems

•  Less
in
newer
versions

•  Protégé
OWL
improves

12

OWL
Lite
conformance
results

•  Jena
and
Sesame
behave

iden3cally
(no
problems)

•  The
OWL
API-‐based
tools
(NeOn

Toolkit,
OWL
API
and
Protégé
4)

improve

–  Transform
ontologies
to
OWL
2

•  Protégé
OWL
improves

13

OWL
DL
conformance
results

•  Jena
and
Sesame
behave

iden3cally
(no
problems)

•  OWL
API
and
Protégé
4
improve

•  NeOn
Toolkit

worsenes

•  Protégé
OWL
behaves

iden3cally

•  Robustness
increases

14

Content
pa>ern
conformance
results

•  New
issues
iden3ﬁed
in

the
OWL
API-‐based
tools

(NeOn
Toolkit,
OWL
API

and
Protégé
4)

•  New
issue
iden3ﬁed
in

Protégé
4

•  No
new
issues

15

Interoperability
results

1st
Evalua3on
2nd
Evalua3on

Campaign
Campaign
•  Same
analysis
as
in

conformance

•  OWL
DL:
New
issue
found

in
interchanges
from

Protégé
4
to
Protégé
OWL

•  Conclusions:

–  RDF-‐based
tool
have
no

interoperability
problems

–  OWL-‐based
tools
have
no

interoperability
problems

with
OWL
Lite
but
have

some
with
OWL
DL.

–  Tools
based
on
the
OWL

API
cannot
interoperate

using
RDF(S)
(they

convert
ontologies
into

OWL
2)

04.08.2010

16

2nd
SEALS
Yards3cks
for
Ontology

Management

•  Conformance
and
interoperability
results

•  Scalability
results

•  Conclusions

17

Scalability
evalua3on

Tool X

O1 O1’ O1’’

Step 1: Import + Export

O1 = O1’’ + α - α’

18

Execu3on
se^ngs

Test
suites:

•  Real
World.
Complex
ontologies
from
biological
and

medical
domains

•  Real
World
NCI.
Thesaurus
subsets
(1.5-‐2
3mes
bigger)

•  LUBM.
Synthe3c
ontologies

Execu9on
Environment:

•  Win7-‐64bit,
Intel
Core
2
Duo
CPU,
2.40GHz,
4.00
GB
RAM

(Real
World
Ontologies
Test
Collec'ons)

•  WinServer-‐64bit,
AMD
Dual
Core,
2.60
GHz
(4
Processors),

8.00
GB
RAM
(LUBM
Ontologies
Test
Collec'on)

Constraint:

•  30
min
threshold
per
test
case

19

Real
World
Scalability
Test
Suite

Test
Size
Triples
Protégé
Protégé4
Protégé OWL
API
OWL
API
Neon

Neon
Jena
v. Sesame

MB
OWL

v.41
4
v.42
v.310
v.324
v.232
v.252
270
v.265

RO1
0.2
3K
5
(sec)
2
2
2
2
3
2
3
2

RO2
0.6
4K
2
2
2
2
2
2
2
3
1

RO3
1
11K
11
3
4
12
5
7
7
8
2

RO4
3
31K
4
5
5
5
4
5
5
5
3

RO5
4
82K
8
8
10
7
7
12
7
8
4

RO6
6
92K
8
9
12
9
9
11
14
9
4

RO7
10
135K
10
11
11
11
10
13
11
10
4

RO8
10
167K
14
9
8
8
9
11
11
12
4

RO9
20
270K
22
20
24
18
16
19
19
18
7

R10
24
315K
68
21
24
19
18
26
20
19
8

R11
26
346K
162
25
19
22
21
27
22
22
9

R12
40
407K
-‐
24
22
26
23
28
30
26
9

R13
44
646K
-‐
36
33
35
34
44
40
37
13

R14
46
671K
-‐
30
27
28
28
35
37
41
13

R15
84
864K
-‐
34
26
32
26
36
33
69
21

R16
117
1623K
-‐
-‐
-‐
-‐
-‐
-‐
-‐
102
33

20

Real
World
NCI
Scalability
Test
Suite

Test
Size
Triples
Protégé
Protégé4
Protégé4
OWL
API
OWL
API
NTK
v. NTK
v. Jena
v. Sesame

MB
OWL

v.41
v.42
v.310
v.324
232
252
270
v.265

NO1
0.5
3.6K
10
(sec)
5
6
4
3
4
4
4
2

NO2
0.6
4.3K
4
3
3
3
3
3
3
3
2

NO3
1
11K
5
4
4
4
4
4
4
3
2

NO4
4
31K
9
5
8
5
5
6
5
5
3

NO5
11
82K
13
7
10
8
8
9
8
9
5

NO6
14
109K
17
8
10
9
10
10
10
10
5

NO7
18
135K
19
9
12
10
10
12
12
11
5

NO8
23
167K
23
10
14
11
11
13
13
14
7

NO9
38
270K
37
15
16
15
13
18
17
20
9

N10
44
314K
74
16
18
16
17
21
19
23
10

N11
48
347K
136
17
19
16
18
21
20
24
10

N12
56
407K
-‐
20
22
19
19
26
24
30
13

N13
89
646K
-‐
29
28
28
29
39
35
47
18

N14
92
671K
-‐
28
32
28
29
39
35
49
21

N15
118
864K
-‐
34
36
34
36
48
45
63
26

N16
211
1540K
-‐
61
61
62
71
83
100
282
41

21

LUBM
Test
Suite

Test
Size
Protégé
Protégé4
Protégé4
OWL
API
OWL
API
NTK
v. NTK
v. Jena
v. Sesame

MB
OWL

v.41
v.42
v.310
v.324
232
252
270
v.265

LO1
8
29
20
25
15
29
11
16
17
5

LO2
19
1M52
19
30
18
30
16
22
30
8

LO3
28
2M59
17
28
27
40
20
26
42
10

LO4
39
4M05
24
33
33
41
28
39
47
12

LO5
51
17M27
36
40
-‐
54
-‐
54
59
14

LO6
60
22M43
41
45
-‐
60
-‐
1M04
1M03
16

LO7
72
26M32
1M1
53
-‐
1M18
-‐
1M28
1M17
19

LO8
82
-‐
1M16
59
-‐
1M3
-‐
-‐
1M27
20

LO9
92
-‐
1M37
1M8
-‐
2M12
-‐
-‐
1M39
23

L10
105
-‐
2M2
1M31
-‐
2M53
-‐
-‐
1M48
27

L11
116
-‐
3M18
-‐
-‐
-‐
-‐
-‐
2M02
33

L12
129
-‐
4M59
-‐
-‐
-‐
-‐
-‐
2M15
35

L13
143
-‐
7M21
-‐
-‐
-‐
-‐
-‐
2M33
40

L14
153
-‐
9M07
-‐
-‐
-‐
-‐
-‐
2M4
42

L15
162
-‐
11M23
-‐
-‐
-‐
-‐
-‐
2M52
43

L16
174
-‐
14M09
-‐
-‐
-‐
-‐
-‐
3M02
44

L17
184
-‐
17M
-‐
-‐
-‐
-‐
-‐
3M2
46

L18
197
-‐
23M05
-‐
-‐
-‐
-‐
-‐
3M34
51

L19
251
-‐
27M21
-‐
-‐
-‐
-‐
-‐
3M49
1M12

22

LUBM
Test
Suite
(II)

Test
Size
,
Protégé4
Jena
v. Sesame
Test
Size
,
Sesame
v. Test
Size
,
Sesame
v.
MB
v.41
270
v.265
MB
265
MB
265

L20
263
-‐
4M05
1M11
L36
412
1M44
Le51
1,105
-‐

L21
284
-‐
4M17
1M03
L37
421
1M45
Le52
1,205
-‐

L22
242
-‐
4M18
1M07
L38
430
1M49
Le53
1,302
-‐

L23
251
-‐
4M36
1M03
L39
441
1M49
Le54
1,404
-‐

L24
263
-‐
4M56
1M07
L40
453
1M55
Le55
1,514
-‐

L25
284
-‐
5M31
1M17
L41
467
2M05

L26
297
-‐
5M35
1M18
L42
480
2M04

L27
307
-‐
5M46
1M22
L43
489
2M14

L28
317
-‐
6M09
1M27
L44
498
2M13

L29
330
-‐
6M13
1M3

L45
510
2M23

L30
340
-‐
6M23
1M3
LUBM
EXTENDED
TEST
SUITE

L31
354
-‐
8M03
1M35
Le46
598
2M49

L32
363
-‐
8M07
1M31
16M58

Le47
705

L33
375
-‐
9M19
1M33
Le48
802
-‐

L34
386
-‐
-‐
1M3

Le49
906
-‐

L35
399
-‐
-‐
1M39

Le50
1,001
-‐

23

2nd
SEALS
Yards3cks
for
Ontology

Management

•  Conformance
and
interoperability
results

•  Scalability
results

•  Conclusions

24

Conclusions
–
Test
data

•  Test
suites
are
not
exhaus3ve

–  The
new
test
suites
helped
detec3ng
new
issues

•  A
more
expressive
test
suite
does
not
imply

detec3ng
more
issues

•  We
used
exis3ng
ontologies
as
input
for
the
test

data
generator

–  Requires
a
previous
analysis
of
the
ontologies
to

detect
defects

–  We
found
ontologies
with
issues
that
we
had
to

correct

25

Conclusions
-‐
Results

•  Tools
have
improved
their
conformance,
interoperability,

and
robustness

•  High
influence
of
development
decisions

–  the
OWL
API
radically
changed
the
way
of
dealing
with
RDF

ontologies

•  need
tools
for
easy
evalua3on

•  need
stronger
regression
tes3ng

•  The
automated
genera3or
defined
test
cases
that
a
person

would
have
never
though
about
but
which
iden3fied
new

tool
issues

•  using
bigger
ontologies
for
conformance
and

interoperability
tes3ng
makes
much
more
difficult
to
find

problems
in
the
tools

26

Evaluating Storage and
Reasoning Systems

Index
•  Evaluation scenarios
•  Evaluation descriptions
•  Test data
•  Tools
•  Results
•  Conclusion

Advanced
reasoning
system

•  Descrip3on
logic
based
system
(DLBS)

•  Standard
reasoning
services

–  Classifica3on

–  Class
sa3sfiability

–  Ontology
sa3sfiability

–  Logical
entailment

Exis3ng
evalua3ons

•  Datasets

– 
Synthe3c
genera3on

– 
Hand
craked
ontologies

– 
Real-‐world
ontologies

•  Evalua3ons

–  KRSS
benchmark

–  TANCS
benchmark

–  Gardiner
dataset

04.08.2010

30

Evaluation criteria
•  Interoperability
–  the capability of the software product to interact with one or more
specified systems
–  a system must
•  conform to the standard input formats
•  be able to perform standard inference services
•  Performance
–  the capability of the software to provide appropriate
performance, relative to the amount of resources used, under
stated conditions

Evaluation metrics

•  Interoperability
–  Number of tests passed without parsing errors
–  Number of inference tests passed
•  Performance
–  Loading time
–  Inference time

Class satisfiability evaluation
•  Standard inference service that is widely used in
ontology engineering
•  The goal: to assess both DLBS s interoperability and
performance
•  Input
–  OWL ontology
–  One or several class IRIs
•  Output
–  TRUE the evaluation outcome coincide with expected result
–  FALSE the evaluation outcome differ from expected outcome
–  ERROR indicates IO error
–  UNKNOWN indicates that the system is unable to compute
inference in the given timeframe

Class satisfiability evaluation

Ontology satisfiability evaluation
•  Standard inference service typically carried out before
performing any other reasoning task
performance
•  Input
–  OWL ontology
•  Output

Ontology satisfiability evaluation

Classification evaluation
•  Inference service that is typically carried out after
testing ontology satisfiability and prior to
performing any other reasoning task
•  The goal: to assess both DLBS s interoperability
and performance
•  Input
–  OWL ontology
•  Output
–  OWL ontology
–  UNKNOWN indicates that the system is unable to
compute inference in the given timeframe

Logical entailment evaluation
•  Standard inference service that is the basis for query
answering
performance
•  Input
–  2 OWL ontologies
•  Output

Storage and reasoning systems
evaluation component

•  SRS component is intended to evaluate the
description logic based systems (DLBS)
–  Implementing OWL-API 3 de-facto standard for DLBS
–  Implementing SRS SEALS DLBS interface
•  SRS supports test data in all syntactic formats
supported by OWL-API 3
•  SRS saves the evaluation results and
interpretations in MathML 3 format

DLBS interface
•  Java methods to be implemented by system
developers
–  OWLOntology loadOntology(IRI iri)
–  boolean isSatisfiable(OWLOntology onto, OWLClass
class)
–  boolean isSatisfiable(OWLOntology onto)
–  OWLOntology classifyOntology(OWLOntology onto)
–  URI saveOntology(OWLOntology onto, IRI iri)
–  boolean entails(OWLOntology onto1, OWLOntology
onto2)

Testing Data

•  The ontologies from the Gardiner evaluation
suite.
–  Over 300 ontologies of varying expressivity and size.
•  Various versions of the GALEN ontology
•  Various ontologies that have been created in EU
funded projects, such as SEMINTEC, VICODI
and AEO
•  155 entailment tests from OWL 2 test cases
repository

Evaluation setup
•  3
DLBSs

–  FaCT++
C++
implementa3on
of
FaCT
OWL
DL
reasoner

–  HermiT
Java
based
OWL
DL
reasoner
u3lizing
novel
hypertableau

algorithms

–  Jcel
Java
based
OWL
2
EL
reasoner

–  FaCT++C

evaluated
without
OWL
prepareReasoner()
call

–  HermiTC
evaluated
without
OWL
prepareReasoner()
call

•  2
AMD
Athlon(tm)
64
X2
Dual
Core
Processor
4600+
machines

with
2GB
of
main
memory

–  DLBSs
were
allowed
to
allocate
up
to
1
GB

Evaluation results: Classification
FaCT++ HermiT jcel
ALT, ms 68 506 856
ART, ms 15320 167808 2144
TRUE 160 145 16
FALSE 0 0 0
ERROR 47 33 4
UNKNOWN 3 32 0

Evaluation results: Class satisfiability

FaCT++ HermiT jcel
ALT, ms 1047 255 438
ART, ms 21376 517043 1113
TRUE 157 145 15
FALSE 1 0 0
ERROR 36 35 5
UNKNOWN 16 30 0

Evaluation results: Ontology
satisfiability
FaCT++ HermiT jcel
ALT, ms 1315 410 708
ART, ms 25175 249802 1878
TRUE 134 146 16
FALSE 0 0 0
ERROR 45 33 4
UNKNOWN 0 31 0

Evaluation results: Entailment
FaCT++ HermiT
ALT, ms 14 33
ART, ms 1 20673
TRUE 46 119
FALSE 67 14
ERROR 34 9
UNKNOWN 0 3

Evaluation results: Non entailment
FaCT++ HermiT
ALT, ms 47 92
ART, ms 5 127936
TRUE 7 7
FALSE 0 1
ERROR 3 1
UNKNOWN 0 1

Comparative evaluation:
Classification
FaCT++C HermiTC
ALT, ms 309 207
ART, ms 3994 2272
TRUE 112 112

Comparative evaluation: Class
satisfiability
FaCT++C HermiTC
ALT, ms 333 225
ART, ms 216 391
TRUE 113 113

Comparative evaluation: Ontology
satisfiability
FaCT++C HermiTC
ALT, ms 333 225
ART, ms 216 391
TRUE 113 113

Comparative evaluation: Entailment
FaCT++C HermiTC
ALT, ms 7 7
ART, ms 2 24
TRUE 1 1

Comparative evaluation: Non-
Entailment
FaCT++C HermiTC
ALT, ms 22 18
ART, ms 2 43
TRUE 4 4

Classification
FaCT++C HermiTC FaCT++ HermiT jcel
ALT, ms 398 355 1471 771 856
ART, ms 11548 1241 36650 2817 2144
TRUE 16 16 16 16 16

Comparative evaluation: Class
satisfiability
ALT, ms 382 342 532 1062 438
ART, ms 159 223 7603 3437 1113
TRUE 15 15 15 15 15

Ontology satisfiability
ALT, ms 360 365 1389 1262 708
ART, ms 11548 202 36650 4790 1878
TRUE 16 16 16 16 16

Challenging ontologies:
Classification
Ontology Mosquito GALEN mged go worm-
-anatomy anatomy
Classes 1864 2749 229 19528 6731

Relations 2 413 102 1 5

FaCT++C,LT ms 3760 663 189 4362 783

FaCT++C,RT ms 9568 9970 355 28041 45739

HermiTC,LT ms 510 609 273 4328 973

HermiTC,RT ms 944 12623 27974 12698 2491

Challenging ontologies:
Classification
Ontology plans information human Fly- emap
anato
my
Classes 118 121 8342 6326 13731

Relations 263 197 1 3 1

FaCT++C, LT ms 67 106 3186 662 1965

FaCT++C, RT ms 661 126 132607 5016 156714

HermiTC, LT ms 67 95 1192 746 1311

HermiTC, RT ms 115576 7064 3842 6564 7097

Challenging ontologies: Class
satisfiability
Ontology not GALEN mged go plans
GALEN
Class Digestion Trimetho Thing GO_0042 schedule
prim 447
Classes 3087 2749 229 19528 118

Relations 413 413 102 1 263

FaCT++C, LT 1130 652 174 4351 78

FaCT++C, RT 3215 1065 160 1465 79

HermiTC, LT 1087 680 358 3961 67

HermiTC, RT 11210 9108 4333 2776 3459

Challenging ontologies: Ontology
satisfiability
Ontology not GALEN mged go plans
GALEN
Classes 3087 2749 229 19528 118

Relations 413 413 102 1 263

FaCT++C, LT 992 618 189 4383 67

FaCT++C, RT 3047 1057 170 1413 74

HermiTC, LT 1166 590 346 4371 69

HermiTC, RT 11562 9408 3197 2687 1827

Conclusion
•  Errors:
–  datatypes not supported in the systems
–  syntax related : a system was unable to
register a role or a concept
–  expressivity errors
•  Execution time is dominated by small
number of hard problems

SEALS
Ontology
Matching

Evalua3on
campaign

…
also
known
as
OAEI
2011.5

6/26/12

63

Ontology
Matching

Person
People

Author
Author

<
Author,
Author,
=,
0.97
>
writes

Commi>eeMember
<
Paper,
Paper,
=,
0.94
>
Reviewer

<
reviews,
reviews,
=,
0.91
>

<
writes,
writes,
=,
0.7
>

PCMember
<
Person,
People,
=,
0.8
>
reviews

<
Document,
Doc,
=,
0.7
>

<
Reviewer,
Review,
=,
0.6
>

reviews
…

Doc

Document

Paper

Paper

writes

Review

6/26/12

64

OAEI
&
SEALS

•  OAEI
:
Ontology
Alignment
Evalua3on
Ini3a3ve

–  Organized
as
annual
campaign
from
2005
to
2012

–  Included
in
Ontology
Matching
workshop
at
ISWC

–  Diﬀerent
tracks
(evalua3on
scenarios)
organized
by

diﬀerent
researchers

•  Star3ng
in
2010:
Support
from
SEALS

–  OAEI
2010,
OAEI
2011,
and
OAEI
2011.5

6/26/12

65

OAEI
2011.5
par3cipants

6/26/12

66

Jose
Aguirre

OAEI
tracks

Jerome

Euzenat

INRIA
Grenoble

•  Benchmark

–  Matching
diﬀerent
versions
of
the
same
ontology

–  Scalability:

Size

run3mes

•  Conference

•  Mul3Farm

•  Anatomy

•  Large
BioMed

6/26/12

67

Ondřej
Šváb-‐Zamazal

OAEI
tracks

Vojtěch
Svátek

Prague
University

of
Economics

•  Benchmark

•  Conference

–  Same
domain,
diﬀerent
ontology

–  Manually
generated
reference
alignment

•  Mul3Farm

•  Anatomy

•  Large
BioMed

6/26/12

68

Chris3an
Meilicke,

OAEI
tracks

Cassia
Trojahn

University
Mannheim

INRIA
Grenoble

•  Benchmark

•  Conference

•  Mul3Farm:
Mul3lingual
Ontology
Matching

–  Based
on
Conference

–  Testcases
for
Spanish,
German,

French,
Russian,
Portuguese,

Czech,
Dutch,
Chinese

•  Anatomy

•  Large
BioMed

6/26/12

69

Chris3an
Meilicke,

OAEI
tracks

Heiner
Stuckenschmidt

University
Mannheim

•  Benchmark

•  Conference

•  Mul3Farm

•  Anatomy

–  Matching
mouse

on
human
anatomy

–  Run3mes

•  Large
BioMed

6/26/12

70

Ernesto
Jimenez
Ruiz

OAEI
tracks

Bernardo
Cuenca
Grau

Ian
Horrocks

University
of
Oxford

•  Benchmark

•  Conference

•  Mul3Farm

•  Anatomy

•  Large
BioMed

–  Very
large
dataset
(FMA-‐NCI)

–  Includes
coherence
analysis

6/26/12

71

Detailed
results

h>p://oaei.ontologymatching.org/2011.5/
results/index.html

6/26/12

72

Ques3ons?

Write
a
mail
to
Chris3an
Meilicke

chris3an@informa3k.uni-‐mannheim.de

6/26/12

73

IWEST
2012
workshop
located
at
ESWC
2012

Seman3c
Search
Systems

Evalua3on
Campaign

6/26/12

74

Two
phase
approach

•  Seman3c
search
tools
evalua3on
demands
a

user-‐in-‐the-‐loop
phase

–  usability
criterion

•  Two
phases:

–  User-‐in-‐the-‐loop

–  Automated

6/26/12

75

Evalua3on
criteria
by
phase

Each
phase
will
address
a
diﬀerent
subset
of

criteria.

•  Automated
phase:
query
expressiveness,

scalability,
performance

•  User-‐in-‐the-‐loop
phase:
usability,
query

expressiveness

6/26/12

76

Par3cipants

Tool
Descrip9on
UITL
Auto

K-‐Search
Form-‐based
x
x

Ginseng
Natural
language
with
constrained
vocabulary
and
x

grammar

NLP-‐Reduce
Natural
language
for
full
English
ques3ons,
sentence
x

fragments,
and
keywords.

Jena
Arq
SPARQL
query
engine.
Automated
phase
baseline
x

RDF.Net
Query
SPARQL-‐based
x

Seman3c
Crystal
Graph-‐based
x

Aﬀec3ve
Graphs
Graph-‐based
x

6/26/12

77

Usability
Evalua3on
Setup

•  Data:
Mooney
Natural
Language
Learning
Data

•  Subjects:

20
(10
expert
users;
10
casual
users)

–  Each
subject
evaluated
the
5
par3cipa3ng
tools

•  Task:
Formulate
5
ques3ons
in
each
tool’s
interface

•  Data
Collected:

success
rate,
input
3me,
number
of

a>empts,
response
3me,
user
sa3sfac3on

ques3onnaires,
demographics

04.08.2010

78

1
concept,

1
rela3on

Ques3ons

1)
Give
me
all
the
capitals
of
the
USA?
2
concepts,
2
rela3ons

2)
What
are
the
ci9es
in
states
through
which
the

Mississippi
runs?
compara3ve

3)
Which
states
have
a
city
named
Columbia
with
a
city

popula3on
over
50,000?

superla3ve

4)
Which
lakes
are
in
the
state
with
the
highest
point?

5)
Tell
me
which
rivers
do
not
traverse
the

nega3on

state
with
the
capital
Nashville?

04.08.2010

79

Automated
Evalua3on
Setup

•  Data:
EvoOnt
dataset

–  Five
sizes:
1K
10K
100K
1M
10M
triples

•  Task:
Answer
10
ques3ons
per
dataset
size

•  Data
Collected:

ontology
load
3me,
query
3me,
number

of
results,
result
list

•  Analyses:
precision,
recall,
f-‐measure,
mean
query
3me,

mean
3me
per
result,
etc

04.08.2010

80

Conﬁgura3on

•  All
tools
executed
on
SEALS
PlaQorm

•  Each
tool
executed
within
a
Virtual
Machine

Linux
Windows

OS
Ubuntu
10.10
(64-‐bit)
Windows
7
(64-‐bit)

Num
CPUs
2
4

Memory
(GB)
4
4

Tools
Arq
v2.8.2
and
Arq
v2.9.0
RDF
Query
v0.5.1-‐beta

6/26/12

81

FINDINGS
-‐
USABILITY

6/26/12

82

Graph-‐based
tools
most
liked

(highest
ranks
and
average
SUS
scores)

Tool
100.0
Semantic-Crystal

•  Perceived
by
expert
users

System Usability Scale "SUS" Questionnaire score

Affective-Graphs
K-Search
Ginseng
Nlp-Reduce

80.0 as
intui9ve
allowing
them

to
easily
formulate
more

60.0 complex
queries.

40.0
•  Casual
users
enjoyed
the

fun
and
visually-‐appealing

20.0
interfaces
which
created
a

17
pleasant
search

.0
experience.

Casual Expert

UserType

04.08.2010

83

Form-‐based
approach
most
liked
by
casual

users

•  Perceived
by
casual
users
as

Tool
5
Extended Questionnaire Question "The system's query

Semantic-Crystal
language was easy to understand and use" score

Affective-Graphs
K-Search
Ginseng
Nlp-Reduce
midpoint
between
NL
and

4
graph-‐based.

•  Allow
more
complex
queries

3 than
the
NL
does.

•  Less
complicated
and
less

2
61
query
input
3me
than
the

graph-‐based.

1
17 •  Together
with
graph-‐based:

Casual Expert
most
liked
by
expert
users

UserType

04.08.2010

84

Casual
Users
liked
Controlled-‐NL
approach

•  Casuals:

Tool
•  liked
guidance
through

100.0
Semantic-Crystal
System Usability Scale "SUS" Questionnaire score

Affective-Graphs

sugges3ons.

K-Search
Ginseng
Nlp-Reduce

80.0
•  Prefer
to
be
‘controlled’
by
the

language
model,
allowing
only

60.0
valid
queries.

40.0
•  Experts:

•  restric3ve
and
frustra3ng.

20.0
•  Prefer
to
have
more
ﬂexibility

and
expressiveness
rather
than

.0
17 support
and
restric3on.

Casual Expert

UserType

04.08.2010

85

Free-‐NL
challenge:
habitability
problem

1.0 Tool
Semantic-Crystal
Affective-Graphs •  Free-‐NL
liked
for
its
simplicity,

K-Search

.8
Ginseng
Nlp-Reduce familiarity,
naturalness
and
low

query
input
3me
required.

Answer found rate

42 96
.6

•  Facing
habitability
problem:

mismatch
between
users
query

98
.4

terms
and
tools
ones.

.2

99 •  Lead
to
lowest
success
rate,

highest
number
of
trials
to
get

.0 97

Casual Expert

UserType a
sa3sfying
answer,
and
in
turn

very
low
user
sa3sfac3on.

04.08.2010

86

FINDINGS
-‐
AUTOMATED

6/26/12

87

Overview

•  K-‐Search
couldn’t
load
the
ontologies

–  external
ontology
import
not
supported

–  cyclic
rela3ons
with
concepts
in
remote
ontologies
not

supported

•  Non-‐NL
tools
transform
queries
a
priori

•  Na3ve
SPARQL
tools
exhibit
diﬀerences
in
query

approach
(see
load
and
query
3mes)

6/26/12

88

Ontology
load
3me

Arq v2.8.2 ontology load time
Arq v2.9.0 ontology load time
100000 RDF Query v0.5.1-beta ontology load time
•  RDF
Query
loads

ontology
on-‐the-‐ﬂy.

Load
3mes
therefore

independent
of

Time (ms)

10000
dataset
size.

•  Arq
loads
ontology

1000
into
memory.

1 10 100 1000
Dataset size (thousands of triples)

6/26/12

89

Query
3me

Arq v2.8.2 mean query time •  RDF
Query
loads

Arq v2.9.0 mean query time ontology
on-‐the-‐ﬂy.

100000 RDF Query v0.5.1-beta mean query time
Query
3mes
therefore

incorporate
load
3me.

•  Expensive
for
more

than
one
query
in
a

Time (ms)

10000
session.

•  Arq
loads
ontology

into
memory.

1000
•  Query
3mes
largely

independent
of

dataset
size

1 10 100 1000
Dataset size (thousands of triples)

6/26/12

90

SEALS
Seman3c
Web
Service
Tools

Evalua3on
Campaign
2011

Seman9c
Web
Service
Discovery

Evalua9on
Results

04.08.2010
6/26/1204.08.2010

91

Evalua3on
of
SWS
Discovery

•  Finding
Web
Services
based
on
their
seman3c

descrip3ons

•  For
a
given
goal,
and
a
given
set
of
service

descrip3ons,
the
tool
returns
the
match
degree

between
the
goal
and
each
service

•  Measurement
services
are
provided
via
the
SEALS

plaQorm
to
measure
the
rate
of
matching

correctness

92 92

Campaign Overview
http://www.seals-project.eu/seals-evaluation-campaigns/2nd-seals-evaluation-campaigns/
semantic-web-service-tools-evaluation-campaign-2011

• 
Goal

–  Which
ontology/annota3on
is
the
best:
WSMO-‐Lite,
OWL-‐S
or

SAWSDL?

•  Assump3ons:

–  Same
corresponding
Test
Collec3ons
(TCs)

–  Same
corresponding
Matchmaking
algorithms
(Tools)

–  The
corresponding
tools
will
belong
to
the
same

provider

–  The
level
of
performance
of
a
tool
for
a
speciﬁc
TC
is

of
secondary
importance

93 93

Campaign Overview
http://www.seals-project.eu/seals-evaluation-campaigns/2nd-seals-evaluation-campaigns/
semantic-web-service-tools-evaluation-campaign-2011

Given
that
a
tool
T
can
apply
the
same
corresponding

matchmaking
algorithm
M
to
corresponding
test

collec3ons,
say,
TC1,
TC2
and
TC3,
we
would
like
to

compare
the
performance
(e.g.
Precision,
Recall)

among
MTC1,
MTC2
and
MTC3

94 94

Background:
S3
Challenge

h>p://www-‐ags.d•i.uni-‐sb.de/~klusch/s3/index.html

T1
T2
……
Tn
TI
TII
……
TXV

……

M1
M2
……
Mn
MI
MII
……
MXV

TCa
(e.g
owl-‐s)
TCb
(e.g.
sawsdl)
……

95 95

Background:
S3
Challenge

h>p://www-‐ags.d•i.uni-‐sb.de/~klusch/s3/index.html

1st
Evalua9on
Campaign
(2010)

T1
T2
……
Tn
TI
TII
……
TXV

……

M1
M2
……
Mn
MI
MII
……
MXV

TCa
(e.g
owl-‐s)
TCb
(e.g.
sawsdl)
……

96 96

Background:
SWS
Challenge

h>p://sws-‐challenge.org/wiki/index.php/Scenario:_Shipment_Discovery

T1
TI
Ta

M1
MI
Ma
……

Formalism1(e.g.
ocml)
FormalismI(e.g.
owl-‐s)
Formalisma

Goal
descrip3ons
(e.g.
plain
text)

97 97

SEALS
2nd

SWS
Discovery
Evalua3on

T1
T2
T3
……

M

TC1
(e.g
owl-‐s)
TC2
(e.g.
sawsdl)
TC3
(e.g.
wsmo-‐lite)
……

98 98

SEALS
Test
Collec3ons

•  WSMO-‐LITE-‐TC
(1080
services,
42
goals)

h>p://seals.s32.at/tdrs-‐web/testdata/persistent/WSMO-‐LITE-‐TC-‐SWRL/1.0-‐4b

h>p://seals.s32.at/tdrs-‐web/testdata/persistent/WSMO-‐LITE-‐TC-‐SWRL/1.0-‐4g

•  SAWSDL-‐TC
(1080
services,
42
goals)

h>p://seals.s32.at/tdrs-‐web/testdata/persistent/SAWSDL-‐TC/3.0-‐1b

h>p://seals.s32.at/tdrs-‐web/testdata/persistent/SAWSDL-‐TC/3.0-‐1g

•  OWLS-‐TC
(1083
services,
42
goals)

h>p://seals.s32.at/tdrs-‐web/testdata/persistent/OWLS-‐TC/4.0-‐11b

h>p://seals.s32.at/tdrs-‐web/testdata/persistent/OWLS-‐TC/4.0-‐11g

99

Metrics
–
Galago
(1)

100 100

Metrics
–
Galago
(2)

101 101

SWS
Discovery
Evalua3on
Workﬂow

102

SWS
Tool
Deployment

Wrapper
for
SEALS
plaQorm

103

Tools

WSMO-‐LITE-‐TC
SAWSDL-‐TC
OWLS-‐TC

WSMO-‐LITE-‐OU1
SAWSDL-‐OU1

SAWSDL-‐URJC2
OWLS-‐URJC2

SAWSDL-‐M03
OWLS-‐M03

1.
Ning
Li,
The
Open
University

2.
Ziji
Cong
et
al.,
University
of
Rey
Juan
Carlos

3.
Ma>hias
Klusch
et
al.
German
Research
Center
for
Ar3ﬁcial
Intelligence

104 104

Tools

WSMO-‐LITE-‐TC
SAWSDL-‐TC
OWLS-‐TC

WSMO-‐LITE-‐OU1
SAWSDL-‐OU1

SAWSDL-‐URJC2
OWLS-‐URJC2

SAWSDL-‐M03
OWLS-‐M03

1.
Ning
Li,
The
Open
University

2.
Ziji
Cong
et
al.,
University
of
Rey
Juan
Carlos

3.
Ma>hias
Klusch
et
al.
German
Research
Center
for
Ar3ﬁcial
Intelligence

105 105

Evalua3on
Execu3on

•  Evalua3on
workﬂow
was
executed
on
the
SEALS

PlaQorm

•  All
tools
were
executed
within
a
Virtual
Machine

Windows

OS
Windows
7
(64-‐bit)

Num
CPUs
4

Memory
(GB)
4

Tools
WSMO-‐LITE-‐OU,
SAWSDL-‐OU

106

6/26/12

Par3al
Evalua3on
Results

WSMO-‐LITE
vs.
SAWSDL

WSMO-‐LITE-‐OU
SAWSDL-‐OU

M

WSMO-‐LITE-‐TC
SAWSDL-‐TC

107

*
This
table
only
shows
the
results
that
are
diﬀerent

108

Analysis

•  Out
of
42
goals,
only
19
have
diﬀerent
results
in
terms

of
Precision
and
recall

•  On
17
out
of
19
occasions,
WSMO-‐Lite
improves

discovery
precision
over
SAWSDL
through
specializing

service
seman3cs

•  WSMO-‐Lite
performs
worse
than
SAWSDL
in
6
of
19

occasions
on
discovery
recall
while
performing
the

same
for
the
other
13
occasions

109

Analysis

•  Goal
#17:
novel_author_service.wsdl
(Educ3on
domain)

h>p://seals.s32.at/tdrs-‐web/testdata/persistent/WSMO-‐LITE-‐TC-‐SWRL/1.0-‐4b/suite/
17/component/GoalDocument/

•  Services
chosen
from
SAWSDL
but
not
WSMO-‐Lite

(Economy
domain)

•  roman3cnovel_authormaxprice_service.wsdl

•  roman3cnovel_authorprice_service.wsdl

•  roman3cnovel_authorrecommendedprice_service

•  short-‐story_authorprice_service.wsdl

•  science-‐ﬁc3on-‐novel_authorprice_service.wsdl

•  scienceﬁc3onbook_authorrecommendedprice_service.wsdl

•  ……….

110

Lessons
Learned

•  WSMO-‐LITE-‐OU
tends
to
perform
be>er
than

SAWSDL-‐OU
in
terms
of
precision,
but
slightly
worse

in
recall.

•  The
only
feature
of
WSMO-‐Lite
used
against
SAWSDL

was
the
service
category
(based
on
TC
domains).

–  Services
were
ﬁltered
by
service
category
in
WSMO-‐LITE-‐
OU
and
not
in
SAWSDL-‐OU

•  Further
tests
with
addi3onal
tools
and
measures
are

needed
for
any
conclusive
results
about
WSMO-‐Lite

vs.
SAWSDL
(many
tools
are
not
available
yet)

111

Conclusions

•  This
has
been
the
ﬁrst
SWS
evalua3on
campaign
in
the

community
focusing
on
the
impact
of
the
service
ontology/
annota3on
on
performance

•  This
comparison
has
been
facilitated
by
the
genera3on
of

WSMO-‐LITE-‐TC
as
a
counterpart
of
SAWSDL-‐TC
and
OWLS-‐TC

in
the
SEALS
repository

•  The
current
comparison
only
involves
2
ontologies/
annota3ons
(WSMO-‐Lite
and
SAWSDL)

•  Raw
and
Interpreta3on
results
are
available
in
RDF
via
the

SEALS
repository
(public
access)

112

Seals 2nd campaign results

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Seals 2nd campaign results

Ähnlich wie Seals 2nd campaign results (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Seals 2nd campaign results