3. Some of us work on the data supply side
• We collect and generate data
• We represent, integrate, store and
make them accessible through data
models (and relevant technology)
• We get them ready for usage and
exploitation
4. Some others work on the data exploitation side
• We use data to build predictive,
descriptive or other types of analytics
solutions
• We use data to build and power AI
applications
6. But there is a gap between the two sides that very
often we don’t see
7. And that’s the semantic gap
• The situation when the data models of
the supply side are misunderstood
and misused by the exploitation side.
• The situation when the data
requirements of the exploitation side
are misunderstood by the supply side.
• Typically the more distant is supply
from usage, the greater is the
semantic gap.
8. Data meaning is communicated through (semantic)
data models
• Conceptual descriptions and representations of data that convey the
latter’s meaning in an explicit and commonly understood and accepted
way among humans and systems.
9. The semantic gap is caused by bad semantic models
• We model data meaning in a
wrong way.
• We model data meaning in a
non-explicit way
• We model data meaning in a
not commonly accepted way
13. What do we do wrong?
• We often give inaccurate and misleading
or ambiguous names to data modeling
elements:
• If I name a table “Car” then its rows
should represent concrete cars (e.g.,
the car with registration number XYZ)
• But if my rows represent car models
(e.g., BMW 3.16 or AUDI A4), then the
table should be named “CarModel”, not
“Car”.
14. Why we do it?
• Not realizing there any other interpretations of
the name we use
• Assuming other interpretations are irrelevant
and that people will know what we mean
• Assuming that the correct meaning will be
inferred by the context.
15. How to narrow the gap
• Always contemplate an element’s name in
relative isolation and try to think all the possible
and legitimate ways this can be interpreted by a
human.
• If an element’s name has more that one
interpretations, make it unambiguous, even if
the other interpretations are not within the
domain or not very likely to occur
• Observe how the element is used in practice by
your modelers, annotators, developers and users.
17. • Supply-Demand Analysis
• Top Skills per Job
• Career Paths
At Textkernel we do Labour Market Analytics
18. For that we need synonyms!
• Two terms are synonymous when they mean the same thing in (almost )
all contexts.
• We need synonyms to get statistics on the actual professions and skills,
no matter the form or language they are expressed in text
19. Can we use any data model for synonymy? Not really!
Term Synonyms Model
Profession Occupation, Vocation, Work,
Living
KBPedia
Chief Executive Officer CEO, chief operating officer Wordnet
Chief Executive Officer Senior executive officer,
chairman, CEO, managing
director, president
ESCO
Economist economics science researcher,
macro analyst, economics
analyst, interest analyst, ...
ESCO
Data Scientist data engineer, research data
scientist, data expert, data
research scientist
ESCO
20. Why this gap?
• We forget or ignore that synonymy is a vague
and context dependent relation.
• We mix synonymy with hyponymy and
semantic relatedness and similarity
• We are unaware of subtle but important
differences in meaning for our particular
domain or context
• We don’t document biases, assumptions and
choices
21. How to narrow the gap
• Insist on meaning equivalence over mere
relatedness
• Get multiple opinions (from people and data)
• If you can’t be sure that your synonyms are
indeed synonyms, then don’t call them like
that
• Always document the criteria, assumptions
and biases of your synonymy.
24. For that we need semantically related terms!
• The meaning of an ambiguous term in a
text is most likely the one that is related to
the meanings of the other terms in the
same text.
• Therefore, knowing which terms are
semantically related, helps in performing
disambiguation.
25. Can we use any related terms for disambiguation? Not really!
• We need related terms that are not very
ambiguous themselves
• We need related terms that are highly specific
to our target term.
• We need related terms that are prevalent in
the data we process.
26. A soccer experiment
Back in 2015, my old team had to detect and
disambiguate mentions of soccer players and teams in
short textual extracts from video scenes from football
matches:
“It's the 70th minute of the game and after a magnificent
pass by Casemiro, Ronaldo managed to beat Claudio Bravo.
Real now leads 1-0."
For that we used an in-house system, called Knowledge
Tagger, and DBpedia as domain knowledge about soccer
teams and players.
27. A soccer experiment
Initially, we ran the system with all the DBPedia
related entities for each player as disambiguation
evidence.
Precision was 60% and recall 55%
Then we pruned DBPedia and kept only three
relations:
• Players and their current teams
• Players and their current co-players
• Players and their current managers
Precision increased to 82% and recall to 80%
28. Why this gap?
• We usually don’t want just any relatedness but
a relatedness that actually helps our goal.
• Our task’s required relatedness seems to be
compatible with the one provided by the data,
yet there are subtle differences that make the
latter non-useful or even harmful.
• Semantic relatedness is a vague relation for
which it’s relatively easy to get agreement
outside of any context, but hard within one.
29. How to narrow the gap
• Uncover the hidden assumptions and expectations
behind the “should be related” requirement.
• Give people examples of terms that you think
they can be related
• Ask them to judge them as related or not in
context.
• Challenge them to justify their decisions.
• Identify patterns and rules that characterize
these decisions.
• Use this information to derive the “relatedness”
you need.
31. Take aways
The Semantic Gap in Data
Science is real
We can avoid and /or
narrow it though by paying
more attention
➔ We often model data
meaning badly
➔ We often understand the
data meaning wrongly
➔ We often produce the
wrong results
➔ Ambiguity
➔ Vagueness
➔ Variety and diversity
➔ Context-dependence
➔ Understand basic
semantic phenomena
➔ Understand how data can
be misunderstood
➔ Be aware of and
document assumptions,
choices and biases
Closing it is hard
32. Thank you!
Panos Alexopoulos
Head of Ontology @ Textkernel
Writing a book on semantic data modeling @ O’Reilly
E-mail: alexopoulos@textkernel.nl
Web: http://www.panosalexopoulos.com
LinkedIn: www.linkedin.com/in/panosalexopoulos
Twitter: @PAlexop