3. 2nd
SEALS
Yards3cks
for
Ontology
Management
• Conformance
and
interoperability
results
• Scalability
results
• Conclusions
3
4. Conformance
evalua3on
• Ontology
language
conformance
– The
ability
to
adhere
to
exis3ng
ontology
language
specifica3ons
• Goal:
to
evaluate
the
conformance
of
seman3c
technologies
with
regards
to
ontology
representa3on
languages
Tool X
O1 O1’ O1’’
Step 1: Import + Export
O1 = O1’’ + α - α’
4
5. Metrics
• Execu9on
informs
about
the
correct
execu3on:
– OK.
No
execu3on
problem
– FAIL.
Some
execu3on
problem
– Pla+orm
Error
(P.E.)
PlaQorm
excep3on
• Informa9on
added
or
lost
in
terms
of
triples,
axioms,
etc.
Oi = Oi’ + α - α’
• Conformance
informs
whether
the
ontology
has
been
processed
correctly
with
no
addi3on
or
loss
of
informa3on:
– SAME
if
Execu'on
is
OK
and
Informa'on
added
and
Informa'on
lost
are
void
– DIFFERENT
if
Execu'on
is
OK
but
Informa'on
added
or
Oi = Oi’ ?
Informa'on
lost
are
not
void
– NO
if
Execu'on
is
FAIL
or
P.E.
5
6. Interoperability
evalua3on
• Ontology
language
interoperability
– The
ability
to
interchange
ontologies
and
use
them
• Goal:
to
evaluate
the
interoperability
of
seman3c
technologies
in
terms
of
the
ability
that
such
technologies
have
to
interchange
ontologies
and
use
them
Tool X Tool Y
O1 O1’ O1’’ O1’’’ O1’’’’
Step 1: Import + Export Step 2: Import + Export
O1 = O1’’ + α - α’ O1’’=O1’’’’ + β - β’
Interchange
O1 = O1’’’’ + α - α’ + β - β’
6
7. Metrics
• Execu9on
informs
about
the
correct
execu3on:
– OK.
No
execu3on
problem
– FAIL.
Some
execu3on
problem
– Pla+orm
Error
(P.E.)
PlaQorm
excep3on
– Not
Executed.
(N.E.)
Second
step
not
executed
• Informa9on
added
or
lost
in
terms
of
triples,
axioms,
etc.
Oi = Oi’ + α - α’
• Interchange
informs
whether
the
ontology
has
been
interchanged
correctly
with
no
addi3on
or
loss
of
informa3on:
– SAME
if
Execu'on
is
OK
and
Informa'on
added
and
Informa'on
lost
are
void
– DIFFERENT
if
Execu'on
is
OK
but
Informa'on
added
or
Informa'on
lost
are
not
void
Oi = Oi’ ?
– NO
if
Execu'on
is
FAIL,
N.E.,
or
P.E.
7
8. Test
suites
used
Name
Defini9on
Nº
Tests
RDF(S)
Import
Test
Suite
Manual
82
OWL
Lite
Import
Test
Suite
Manual
82
OWL
DL
Import
Test
Suite
Keyword-‐driven
generator
561
OWL
Full
Import
Test
Suite
Manual
90
OWL
Content
Pa>ern
Expressive
generator
81
OWL
Content
Pa>ern
Expressive
Expressive
generator
81
OWL
Content
Pa>ern
Full
Expressive
Expressive
generator
81
8
10. Evalua3on
Execu3on
• Evalua3ons
automa3cally
performed
with
the
SEALS
PlaQorm
– h>p://www.seals-‐project.eu/
SEALS
• Evalua3on
materials
available
Test Suite
Test Suite
Test Suite
Raw Result
– Test
Data
– Results
Test Suite
Interpretation
– Metadata
Conformance Interoperability Scalability
10
12. RDF(S)
conformance
results
• Jena
and
Sesame
behave
iden3cally
(no
problems)
• The
behaviour
of
the
OWL
API-‐
based
tools
(NeOn
Toolkit,
OWL
API
and
Protégé
4)
has
significantly
changed
– Transform
ontologies
to
OWL
2
– Some
problems
• Less
in
newer
versions
• Protégé
OWL
improves
12
13. OWL
Lite
conformance
results
• Jena
and
Sesame
behave
iden3cally
(no
problems)
• The
OWL
API-‐based
tools
(NeOn
Toolkit,
OWL
API
and
Protégé
4)
improve
– Transform
ontologies
to
OWL
2
• Protégé
OWL
improves
13
14. OWL
DL
conformance
results
• Jena
and
Sesame
behave
iden3cally
(no
problems)
• OWL
API
and
Protégé
4
improve
• NeOn
Toolkit
worsenes
• Protégé
OWL
behaves
iden3cally
• Robustness
increases
14
15. Content
pa>ern
conformance
results
• New
issues
iden3fied
in
the
OWL
API-‐based
tools
(NeOn
Toolkit,
OWL
API
and
Protégé
4)
• New
issue
iden3fied
in
Protégé
4
• No
new
issues
15
16. Interoperability
results
1st
Evalua3on
2nd
Evalua3on
Campaign
Campaign
• Same
analysis
as
in
conformance
• OWL
DL:
New
issue
found
in
interchanges
from
Protégé
4
to
Protégé
OWL
• Conclusions:
– RDF-‐based
tool
have
no
interoperability
problems
– OWL-‐based
tools
have
no
interoperability
problems
with
OWL
Lite
but
have
some
with
OWL
DL.
– Tools
based
on
the
OWL
API
cannot
interoperate
using
RDF(S)
(they
convert
ontologies
into
OWL
2)
04.08.2010
16
17. 2nd
SEALS
Yards3cks
for
Ontology
Management
• Conformance
and
interoperability
results
• Scalability
results
• Conclusions
17
19. Execu3on
se^ngs
Test
suites:
• Real
World.
Complex
ontologies
from
biological
and
medical
domains
• Real
World
NCI.
Thesaurus
subsets
(1.5-‐2
3mes
bigger)
• LUBM.
Synthe3c
ontologies
Execu9on
Environment:
• Win7-‐64bit,
Intel
Core
2
Duo
CPU,
2.40GHz,
4.00
GB
RAM
(Real
World
Ontologies
Test
Collec'ons)
• WinServer-‐64bit,
AMD
Dual
Core,
2.60
GHz
(4
Processors),
8.00
GB
RAM
(LUBM
Ontologies
Test
Collec'on)
Constraint:
• 30
min
threshold
per
test
case
19
24. 2nd
SEALS
Yards3cks
for
Ontology
Management
• Conformance
and
interoperability
results
• Scalability
results
• Conclusions
24
25. Conclusions
–
Test
data
• Test
suites
are
not
exhaus3ve
– The
new
test
suites
helped
detec3ng
new
issues
• A
more
expressive
test
suite
does
not
imply
detec3ng
more
issues
• We
used
exis3ng
ontologies
as
input
for
the
test
data
generator
– Requires
a
previous
analysis
of
the
ontologies
to
detect
defects
– We
found
ontologies
with
issues
that
we
had
to
correct
25
26. Conclusions
-‐
Results
• Tools
have
improved
their
conformance,
interoperability,
and
robustness
• High
influence
of
development
decisions
– the
OWL
API
radically
changed
the
way
of
dealing
with
RDF
ontologies
• need
tools
for
easy
evalua3on
• need
stronger
regression
tes3ng
• The
automated
genera3or
defined
test
cases
that
a
person
would
have
never
though
about
but
which
iden3fied
new
tool
issues
• using
bigger
ontologies
for
conformance
and
interoperability
tes3ng
makes
much
more
difficult
to
find
problems
in
the
tools
26
28. Index
• Evaluation scenarios
• Evaluation descriptions
• Test data
• Tools
• Results
• Conclusion
29. Advanced
reasoning
system
• Descrip3on
logic
based
system
(DLBS)
• Standard
reasoning
services
– Classifica3on
– Class
sa3sfiability
– Ontology
sa3sfiability
– Logical
entailment
31. Evaluation criteria
• Interoperability
– the capability of the software product to interact with one or more
specified systems
– a system must
• conform to the standard input formats
• be able to perform standard inference services
• Performance
– the capability of the software to provide appropriate
performance, relative to the amount of resources used, under
stated conditions
32. Evaluation metrics
• Interoperability
– Number of tests passed without parsing errors
– Number of inference tests passed
• Performance
– Loading time
– Inference time
33. Class satisfiability evaluation
• Standard inference service that is widely used in
ontology engineering
• The goal: to assess both DLBS s interoperability and
performance
• Input
– OWL ontology
– One or several class IRIs
• Output
– TRUE the evaluation outcome coincide with expected result
– FALSE the evaluation outcome differ from expected outcome
– ERROR indicates IO error
– UNKNOWN indicates that the system is unable to compute
inference in the given timeframe
35. Ontology satisfiability evaluation
• Standard inference service typically carried out before
performing any other reasoning task
• The goal: to assess both DLBS s interoperability and
performance
• Input
– OWL ontology
• Output
– TRUE the evaluation outcome coincide with expected result
– FALSE the evaluation outcome differ from expected outcome
– ERROR indicates IO error
– UNKNOWN indicates that the system is unable to compute
inference in the given timeframe
37. Classification evaluation
• Inference service that is typically carried out after
testing ontology satisfiability and prior to
performing any other reasoning task
• The goal: to assess both DLBS s interoperability
and performance
• Input
– OWL ontology
• Output
– OWL ontology
– ERROR indicates IO error
– UNKNOWN indicates that the system is unable to
compute inference in the given timeframe
39. Logical entailment evaluation
• Standard inference service that is the basis for query
answering
• The goal: to assess both DLBS s interoperability and
performance
• Input
– 2 OWL ontologies
• Output
– TRUE the evaluation outcome coincide with expected result
– FALSE the evaluation outcome differ from expected outcome
– ERROR indicates IO error
– UNKNOWN indicates that the system is unable to compute
inference in the given timeframe
41. Storage and reasoning systems
evaluation component
• SRS component is intended to evaluate the
description logic based systems (DLBS)
– Implementing OWL-API 3 de-facto standard for DLBS
– Implementing SRS SEALS DLBS interface
• SRS supports test data in all syntactic formats
supported by OWL-API 3
• SRS saves the evaluation results and
interpretations in MathML 3 format
42. DLBS interface
• Java methods to be implemented by system
developers
– OWLOntology loadOntology(IRI iri)
– boolean isSatisfiable(OWLOntology onto, OWLClass
class)
– boolean isSatisfiable(OWLOntology onto)
– OWLOntology classifyOntology(OWLOntology onto)
– URI saveOntology(OWLOntology onto, IRI iri)
– boolean entails(OWLOntology onto1, OWLOntology
onto2)
43. Testing Data
• The ontologies from the Gardiner evaluation
suite.
– Over 300 ontologies of varying expressivity and size.
• Various versions of the GALEN ontology
• Various ontologies that have been created in EU
funded projects, such as SEMINTEC, VICODI
and AEO
• 155 entailment tests from OWL 2 test cases
repository
44. Evaluation setup
• 3
DLBSs
– FaCT++
C++
implementa3on
of
FaCT
OWL
DL
reasoner
– HermiT
Java
based
OWL
DL
reasoner
u3lizing
novel
hypertableau
algorithms
– Jcel
Java
based
OWL
2
EL
reasoner
– FaCT++C
evaluated
without
OWL
prepareReasoner()
call
– HermiTC
evaluated
without
OWL
prepareReasoner()
call
• 2
AMD
Athlon(tm)
64
X2
Dual
Core
Processor
4600+
machines
with
2GB
of
main
memory
– DLBSs
were
allowed
to
allocate
up
to
1
GB
62. Conclusion
• Errors:
– datatypes not supported in the systems
– syntax related : a system was unable to
register a role or a concept
– expressivity errors
• Execution time is dominated by small
number of hard problems
65. OAEI
&
SEALS
• OAEI
:
Ontology
Alignment
Evalua3on
Ini3a3ve
– Organized
as
annual
campaign
from
2005
to
2012
– Included
in
Ontology
Matching
workshop
at
ISWC
– Different
tracks
(evalua3on
scenarios)
organized
by
different
researchers
• Star3ng
in
2010:
Support
from
SEALS
– OAEI
2010,
OAEI
2011,
and
OAEI
2011.5
6/26/12
65
73. Ques3ons?
Write
a
mail
to
Chris3an
Meilicke
chris3an@informa3k.uni-‐mannheim.de
6/26/12
73
74. IWEST
2012
workshop
located
at
ESWC
2012
Seman3c
Search
Systems
Evalua3on
Campaign
6/26/12
74
75. Two
phase
approach
• Seman3c
search
tools
evalua3on
demands
a
user-‐in-‐the-‐loop
phase
– usability
criterion
• Two
phases:
– User-‐in-‐the-‐loop
– Automated
6/26/12
75
76. Evalua3on
criteria
by
phase
Each
phase
will
address
a
different
subset
of
criteria.
• Automated
phase:
query
expressiveness,
scalability,
performance
• User-‐in-‐the-‐loop
phase:
usability,
query
expressiveness
6/26/12
76
77. Par3cipants
Tool
Descrip9on
UITL
Auto
K-‐Search
Form-‐based
x
x
Ginseng
Natural
language
with
constrained
vocabulary
and
x
grammar
NLP-‐Reduce
Natural
language
for
full
English
ques3ons,
sentence
x
fragments,
and
keywords.
Jena
Arq
SPARQL
query
engine.
Automated
phase
baseline
x
RDF.Net
Query
SPARQL-‐based
x
Seman3c
Crystal
Graph-‐based
x
Affec3ve
Graphs
Graph-‐based
x
6/26/12
77
78. Usability
Evalua3on
Setup
• Data:
Mooney
Natural
Language
Learning
Data
• Subjects:
20
(10
expert
users;
10
casual
users)
– Each
subject
evaluated
the
5
par3cipa3ng
tools
• Task:
Formulate
5
ques3ons
in
each
tool’s
interface
• Data
Collected:
success
rate,
input
3me,
number
of
a>empts,
response
3me,
user
sa3sfac3on
ques3onnaires,
demographics
04.08.2010
78
79. 1
concept,
1
rela3on
Ques3ons
1)
Give
me
all
the
capitals
of
the
USA?
2
concepts,
2
rela3ons
2)
What
are
the
ci9es
in
states
through
which
the
Mississippi
runs?
compara3ve
3)
Which
states
have
a
city
named
Columbia
with
a
city
popula3on
over
50,000?
superla3ve
4)
Which
lakes
are
in
the
state
with
the
highest
point?
5)
Tell
me
which
rivers
do
not
traverse
the
nega3on
state
with
the
capital
Nashville?
04.08.2010
79
80. Automated
Evalua3on
Setup
• Data:
EvoOnt
dataset
– Five
sizes:
1K
10K
100K
1M
10M
triples
• Task:
Answer
10
ques3ons
per
dataset
size
• Data
Collected:
ontology
load
3me,
query
3me,
number
of
results,
result
list
• Analyses:
precision,
recall,
f-‐measure,
mean
query
3me,
mean
3me
per
result,
etc
04.08.2010
80
81. Configura3on
• All
tools
executed
on
SEALS
PlaQorm
• Each
tool
executed
within
a
Virtual
Machine
Linux
Windows
OS
Ubuntu
10.10
(64-‐bit)
Windows
7
(64-‐bit)
Num
CPUs
2
4
Memory
(GB)
4
4
Tools
Arq
v2.8.2
and
Arq
v2.9.0
RDF
Query
v0.5.1-‐beta
6/26/12
81
83. Graph-‐based
tools
most
liked
(highest
ranks
and
average
SUS
scores)
Tool
100.0
Semantic-Crystal
• Perceived
by
expert
users
System Usability Scale "SUS" Questionnaire score
Affective-Graphs
K-Search
Ginseng
Nlp-Reduce
80.0 as
intui9ve
allowing
them
to
easily
formulate
more
60.0 complex
queries.
40.0
• Casual
users
enjoyed
the
fun
and
visually-‐appealing
20.0
interfaces
which
created
a
17
pleasant
search
.0
experience.
Casual Expert
UserType
04.08.2010
83
84. Form-‐based
approach
most
liked
by
casual
users
• Perceived
by
casual
users
as
Tool
5
Extended Questionnaire Question "The system's query
Semantic-Crystal
language was easy to understand and use" score
Affective-Graphs
K-Search
Ginseng
Nlp-Reduce
midpoint
between
NL
and
4
graph-‐based.
• Allow
more
complex
queries
3 than
the
NL
does.
• Less
complicated
and
less
2
61
query
input
3me
than
the
graph-‐based.
1
17 • Together
with
graph-‐based:
Casual Expert
most
liked
by
expert
users
UserType
04.08.2010
84
85. Casual
Users
liked
Controlled-‐NL
approach
• Casuals:
Tool
• liked
guidance
through
100.0
Semantic-Crystal
System Usability Scale "SUS" Questionnaire score
Affective-Graphs
sugges3ons.
K-Search
Ginseng
Nlp-Reduce
80.0
• Prefer
to
be
‘controlled’
by
the
language
model,
allowing
only
60.0
valid
queries.
40.0
• Experts:
• restric3ve
and
frustra3ng.
20.0
• Prefer
to
have
more
flexibility
and
expressiveness
rather
than
.0
17 support
and
restric3on.
Casual Expert
UserType
04.08.2010
85
86. Free-‐NL
challenge:
habitability
problem
1.0 Tool
Semantic-Crystal
Affective-Graphs • Free-‐NL
liked
for
its
simplicity,
K-Search
.8
Ginseng
Nlp-Reduce familiarity,
naturalness
and
low
query
input
3me
required.
Answer found rate
42 96
.6
• Facing
habitability
problem:
mismatch
between
users
query
98
.4
terms
and
tools
ones.
.2
99 • Lead
to
lowest
success
rate,
highest
number
of
trials
to
get
.0 97
Casual Expert
UserType a
sa3sfying
answer,
and
in
turn
very
low
user
sa3sfac3on.
04.08.2010
86
88. Overview
• K-‐Search
couldn’t
load
the
ontologies
– external
ontology
import
not
supported
– cyclic
rela3ons
with
concepts
in
remote
ontologies
not
supported
• Non-‐NL
tools
transform
queries
a
priori
• Na3ve
SPARQL
tools
exhibit
differences
in
query
approach
(see
load
and
query
3mes)
6/26/12
88
89. Ontology
load
3me
Arq v2.8.2 ontology load time
Arq v2.9.0 ontology load time
100000 RDF Query v0.5.1-beta ontology load time
• RDF
Query
loads
ontology
on-‐the-‐fly.
Load
3mes
therefore
independent
of
Time (ms)
10000
dataset
size.
• Arq
loads
ontology
1000
into
memory.
1 10 100 1000
Dataset size (thousands of triples)
6/26/12
89
90. Query
3me
Arq v2.8.2 mean query time • RDF
Query
loads
Arq v2.9.0 mean query time ontology
on-‐the-‐fly.
100000 RDF Query v0.5.1-beta mean query time
Query
3mes
therefore
incorporate
load
3me.
• Expensive
for
more
than
one
query
in
a
Time (ms)
10000
session.
• Arq
loads
ontology
into
memory.
1000
• Query
3mes
largely
independent
of
dataset
size
1 10 100 1000
Dataset size (thousands of triples)
6/26/12
90
91. SEALS
Seman3c
Web
Service
Tools
Evalua3on
Campaign
2011
Seman9c
Web
Service
Discovery
Evalua9on
Results
04.08.2010
6/26/1204.08.2010
91
92. Evalua3on
of
SWS
Discovery
• Finding
Web
Services
based
on
their
seman3c
descrip3ons
• For
a
given
goal,
and
a
given
set
of
service
descrip3ons,
the
tool
returns
the
match
degree
between
the
goal
and
each
service
• Measurement
services
are
provided
via
the
SEALS
plaQorm
to
measure
the
rate
of
matching
correctness
92 92
93. Campaign Overview
http://www.seals-project.eu/seals-evaluation-campaigns/2nd-seals-evaluation-campaigns/
semantic-web-service-tools-evaluation-campaign-2011
•
Goal
– Which
ontology/annota3on
is
the
best:
WSMO-‐Lite,
OWL-‐S
or
SAWSDL?
• Assump3ons:
– Same
corresponding
Test
Collec3ons
(TCs)
– Same
corresponding
Matchmaking
algorithms
(Tools)
– The
corresponding
tools
will
belong
to
the
same
provider
– The
level
of
performance
of
a
tool
for
a
specific
TC
is
of
secondary
importance
93 93
104. Tools
WSMO-‐LITE-‐TC
SAWSDL-‐TC
OWLS-‐TC
WSMO-‐LITE-‐OU1
SAWSDL-‐OU1
SAWSDL-‐URJC2
OWLS-‐URJC2
SAWSDL-‐M03
OWLS-‐M03
1.
Ning
Li,
The
Open
University
2.
Ziji
Cong
et
al.,
University
of
Rey
Juan
Carlos
3.
Ma>hias
Klusch
et
al.
German
Research
Center
for
Ar3ficial
Intelligence
104 104
105. Tools
WSMO-‐LITE-‐TC
SAWSDL-‐TC
OWLS-‐TC
WSMO-‐LITE-‐OU1
SAWSDL-‐OU1
SAWSDL-‐URJC2
OWLS-‐URJC2
SAWSDL-‐M03
OWLS-‐M03
1.
Ning
Li,
The
Open
University
2.
Ziji
Cong
et
al.,
University
of
Rey
Juan
Carlos
3.
Ma>hias
Klusch
et
al.
German
Research
Center
for
Ar3ficial
Intelligence
105 105
106. Evalua3on
Execu3on
• Evalua3on
workflow
was
executed
on
the
SEALS
PlaQorm
• All
tools
were
executed
within
a
Virtual
Machine
Windows
OS
Windows
7
(64-‐bit)
Num
CPUs
4
Memory
(GB)
4
Tools
WSMO-‐LITE-‐OU,
SAWSDL-‐OU
106
6/26/12
107. Par3al
Evalua3on
Results
WSMO-‐LITE
vs.
SAWSDL
WSMO-‐LITE-‐OU
SAWSDL-‐OU
M
WSMO-‐LITE-‐TC
SAWSDL-‐TC
107
108. *
This
table
only
shows
the
results
that
are
different
108
109. Analysis
• Out
of
42
goals,
only
19
have
different
results
in
terms
of
Precision
and
recall
• On
17
out
of
19
occasions,
WSMO-‐Lite
improves
discovery
precision
over
SAWSDL
through
specializing
service
seman3cs
• WSMO-‐Lite
performs
worse
than
SAWSDL
in
6
of
19
occasions
on
discovery
recall
while
performing
the
same
for
the
other
13
occasions
109
111. Lessons
Learned
• WSMO-‐LITE-‐OU
tends
to
perform
be>er
than
SAWSDL-‐OU
in
terms
of
precision,
but
slightly
worse
in
recall.
• The
only
feature
of
WSMO-‐Lite
used
against
SAWSDL
was
the
service
category
(based
on
TC
domains).
– Services
were
filtered
by
service
category
in
WSMO-‐LITE-‐
OU
and
not
in
SAWSDL-‐OU
• Further
tests
with
addi3onal
tools
and
measures
are
needed
for
any
conclusive
results
about
WSMO-‐Lite
vs.
SAWSDL
(many
tools
are
not
available
yet)
111
112. Conclusions
• This
has
been
the
first
SWS
evalua3on
campaign
in
the
community
focusing
on
the
impact
of
the
service
ontology/
annota3on
on
performance
• This
comparison
has
been
facilitated
by
the
genera3on
of
WSMO-‐LITE-‐TC
as
a
counterpart
of
SAWSDL-‐TC
and
OWLS-‐TC
in
the
SEALS
repository
• The
current
comparison
only
involves
2
ontologies/
annota3ons
(WSMO-‐Lite
and
SAWSDL)
• Raw
and
Interpreta3on
results
are
available
in
RDF
via
the
SEALS
repository
(public
access)
112