SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Mind the Semantic Gap
How "talking semantics" can help you perform better data science
Panos Alexopoulos
Head of Ontology
We are all here for the same purpose
Some of us work on the data supply side
• We collect and generate data
• We represent, integrate, store and
make them accessible through data
models (and relevant technology)
• We get them ready for usage and
exploitation
Some others work on the data exploitation side
• We use data to build predictive,
descriptive or other types of analytics
solutions
• We use data to build and power AI
applications
And many of us do both
But there is a gap between the two sides that very
often we don’t see
And that’s the semantic gap
• The situation when the data models of
the supply side are misunderstood
and misused by the exploitation side.
• The situation when the data
requirements of the exploitation side
are misunderstood by the supply side.
• Typically the more distant is supply
from usage, the greater is the
semantic gap.
Data meaning is communicated through (semantic)
data models
• Conceptual descriptions and representations of data that convey the
latter’s meaning in an explicit and commonly understood and accepted
way among humans and systems.
The semantic gap is caused by bad semantic models
• We model data meaning in a
wrong way.
• We model data meaning in a
non-explicit way
• We model data meaning in a
not commonly accepted way
Let’s talk about
names
Which data model is correct?
Well, none!
What do we do wrong?
• We often give inaccurate and misleading
or ambiguous names to data modeling
elements:
• If I name a table “Car” then its rows
should represent concrete cars (e.g.,
the car with registration number XYZ)
• But if my rows represent car models
(e.g., BMW 3.16 or AUDI A4), then the
table should be named “CarModel”, not
“Car”.
Why we do it?
• Not realizing there any other interpretations of
the name we use
• Assuming other interpretations are irrelevant
and that people will know what we mean
• Assuming that the correct meaning will be
inferred by the context.
How to narrow the gap
• Always contemplate an element’s name in
relative isolation and try to think all the possible
and legitimate ways this can be interpreted by a
human.
• If an element’s name has more that one
interpretations, make it unambiguous, even if
the other interpretations are not within the
domain or not very likely to occur
• Observe how the element is used in practice by
your modelers, annotators, developers and users.
Let’s talk about
synonymy
• Supply-Demand Analysis
• Top Skills per Job
• Career Paths
At Textkernel we do Labour Market Analytics
For that we need synonyms!
• Two terms are synonymous when they mean the same thing in (almost )
all contexts.
• We need synonyms to get statistics on the actual professions and skills,
no matter the form or language they are expressed in text
Can we use any data model for synonymy? Not really!
Term Synonyms Model
Profession Occupation, Vocation, Work,
Living
KBPedia
Chief Executive Officer CEO, chief operating officer Wordnet
Chief Executive Officer Senior executive officer,
chairman, CEO, managing
director, president
ESCO
Economist economics science researcher,
macro analyst, economics
analyst, interest analyst, ...
ESCO
Data Scientist data engineer, research data
scientist, data expert, data
research scientist
ESCO
Why this gap?
• We forget or ignore that synonymy is a vague
and context dependent relation.
• We mix synonymy with hyponymy and
semantic relatedness and similarity
• We are unaware of subtle but important
differences in meaning for our particular
domain or context
• We don’t document biases, assumptions and
choices
How to narrow the gap
• Insist on meaning equivalence over mere
relatedness
• Get multiple opinions (from people and data)
• If you can’t be sure that your synonyms are
indeed synonyms, then don’t call them like
that
• Always document the criteria, assumptions
and biases of your synonymy.
Let’s talk about
semantic relatedness
Another critical capability for good analytics is entity
disambiguation
For that we need semantically related terms!
• The meaning of an ambiguous term in a
text is most likely the one that is related to
the meanings of the other terms in the
same text.
• Therefore, knowing which terms are
semantically related, helps in performing
disambiguation.
Can we use any related terms for disambiguation? Not really!
• We need related terms that are not very
ambiguous themselves
• We need related terms that are highly specific
to our target term.
• We need related terms that are prevalent in
the data we process.
A soccer experiment
Back in 2015, my old team had to detect and
disambiguate mentions of soccer players and teams in
short textual extracts from video scenes from football
matches:
“It's the 70th minute of the game and after a magnificent
pass by Casemiro, Ronaldo managed to beat Claudio Bravo.
Real now leads 1-0."
For that we used an in-house system, called Knowledge
Tagger, and DBpedia as domain knowledge about soccer
teams and players.
A soccer experiment
Initially, we ran the system with all the DBPedia
related entities for each player as disambiguation
evidence.
Precision was 60% and recall 55%
Then we pruned DBPedia and kept only three
relations:
• Players and their current teams
• Players and their current co-players
• Players and their current managers
Precision increased to 82% and recall to 80%
Why this gap?
• We usually don’t want just any relatedness but
a relatedness that actually helps our goal.
• Our task’s required relatedness seems to be
compatible with the one provided by the data,
yet there are subtle differences that make the
latter non-useful or even harmful.
• Semantic relatedness is a vague relation for
which it’s relatively easy to get agreement
outside of any context, but hard within one.
How to narrow the gap
• Uncover the hidden assumptions and expectations
behind the “should be related” requirement.
• Give people examples of terms that you think
they can be related
• Ask them to judge them as related or not in
context.
• Challenge them to justify their decisions.
• Identify patterns and rules that characterize
these decisions.
• Use this information to derive the “relatedness”
you need.
Let’s summarize
Take aways
The Semantic Gap in Data
Science is real
We can avoid and /or
narrow it though by paying
more attention
➔ We often model data
meaning badly
➔ We often understand the
data meaning wrongly
➔ We often produce the
wrong results
➔ Ambiguity
➔ Vagueness
➔ Variety and diversity
➔ Context-dependence
➔ Understand basic
semantic phenomena
➔ Understand how data can
be misunderstood
➔ Be aware of and
document assumptions,
choices and biases
Closing it is hard
Thank you!
Panos Alexopoulos
Head of Ontology @ Textkernel
Writing a book on semantic data modeling @ O’Reilly
E-mail: alexopoulos@textkernel.nl
Web: http://www.panosalexopoulos.com
LinkedIn: www.linkedin.com/in/panosalexopoulos
Twitter: @PAlexop

Weitere ähnliche Inhalte

Was ist angesagt?

An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationSeth Grimes
 
Text Analytics Market Insights: What's Working and What's Next
Text Analytics Market Insights: What's Working and What's NextText Analytics Market Insights: What's Working and What's Next
Text Analytics Market Insights: What's Working and What's NextSeth Grimes
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Seth Grimes
 
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics
 
Text Analytics Presentation
Text Analytics PresentationText Analytics Presentation
Text Analytics PresentationSkylar Ritchie
 
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...Analytics India Magazine
 
Project sentiment analysis
Project sentiment analysisProject sentiment analysis
Project sentiment analysisBob Prieto
 
Text Analytics Overview, 2011
Text Analytics Overview, 2011Text Analytics Overview, 2011
Text Analytics Overview, 2011Seth Grimes
 
Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2Sara Hooker
 
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!Sri Ambati
 
Text Analytics Today
Text Analytics TodayText Analytics Today
Text Analytics TodaySeth Grimes
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016George Roth
 
Text Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewText Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewSeth Grimes
 
These slides cover the final defense presentation for my Doctorate degree. Th...
These slides cover the final defense presentation for my Doctorate degree. Th...These slides cover the final defense presentation for my Doctorate degree. Th...
These slides cover the final defense presentation for my Doctorate degree. Th...Eric Brown
 
Text Analytics Applied (LIDER roadmapping presentation)
Text Analytics Applied (LIDER roadmapping presentation)Text Analytics Applied (LIDER roadmapping presentation)
Text Analytics Applied (LIDER roadmapping presentation)Seth Grimes
 
project sentiment analysis
project sentiment analysisproject sentiment analysis
project sentiment analysissneha penmetsa
 
Twitter Sentiment & Investing - modeling stock price movements with twitter s...
Twitter Sentiment & Investing - modeling stock price movements with twitter s...Twitter Sentiment & Investing - modeling stock price movements with twitter s...
Twitter Sentiment & Investing - modeling stock price movements with twitter s...Eric Brown
 

Was ist angesagt? (20)

An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentation
 
Text Analytics Market Insights: What's Working and What's Next
Text Analytics Market Insights: What's Working and What's NextText Analytics Market Insights: What's Working and What's Next
Text Analytics Market Insights: What's Working and What's Next
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Text Analytics for Dummies 2010
Text Analytics for Dummies 2010
 
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text Analytics
 
Text Analytics Presentation
Text Analytics PresentationText Analytics Presentation
Text Analytics Presentation
 
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
Deciphering AI - Unlocking the Black Box of AIML with State-of-the-Art Techno...
 
Project sentiment analysis
Project sentiment analysisProject sentiment analysis
Project sentiment analysis
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Text Analytics Overview, 2011
Text Analytics Overview, 2011Text Analytics Overview, 2011
Text Analytics Overview, 2011
 
Module 9: Natural Language Processing Part 2
Module 9:  Natural Language Processing Part 2Module 9:  Natural Language Processing Part 2
Module 9: Natural Language Processing Part 2
 
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
Get hands-on with Explainable AI at Machine Learning Interpretability(MLI) Gym!
 
Text Analytics Today
Text Analytics TodayText Analytics Today
Text Analytics Today
 
The Need for Explainable AI - Dorothea Wisemann
The Need for Explainable AI - Dorothea WisemannThe Need for Explainable AI - Dorothea Wisemann
The Need for Explainable AI - Dorothea Wisemann
 
Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016Unstructured data processing webinar 06272016
Unstructured data processing webinar 06272016
 
Text Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewText Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry View
 
These slides cover the final defense presentation for my Doctorate degree. Th...
These slides cover the final defense presentation for my Doctorate degree. Th...These slides cover the final defense presentation for my Doctorate degree. Th...
These slides cover the final defense presentation for my Doctorate degree. Th...
 
Text Analytics Applied (LIDER roadmapping presentation)
Text Analytics Applied (LIDER roadmapping presentation)Text Analytics Applied (LIDER roadmapping presentation)
Text Analytics Applied (LIDER roadmapping presentation)
 
project sentiment analysis
project sentiment analysisproject sentiment analysis
project sentiment analysis
 
Twitter Sentiment & Investing - modeling stock price movements with twitter s...
Twitter Sentiment & Investing - modeling stock price movements with twitter s...Twitter Sentiment & Investing - modeling stock price movements with twitter s...
Twitter Sentiment & Investing - modeling stock price movements with twitter s...
 

Ähnlich wie Mind the Semantic Gap in Data Science

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
​​Explainability in AI and Recommender systems: let’s make it interactive!
​​Explainability in AI and Recommender systems: let’s make it interactive!​​Explainability in AI and Recommender systems: let’s make it interactive!
​​Explainability in AI and Recommender systems: let’s make it interactive!Eindhoven University of Technology / JADS
 
Language First Protocol from QSi
Language First Protocol from QSiLanguage First Protocol from QSi
Language First Protocol from QSiJohn O'Gorman
 
SOFLUX Meetup - Landing on your dream job
SOFLUX Meetup - Landing on your dream jobSOFLUX Meetup - Landing on your dream job
SOFLUX Meetup - Landing on your dream jobMarta Guerra
 
Veda Semantics - introduction document
Veda Semantics - introduction documentVeda Semantics - introduction document
Veda Semantics - introduction documentrajatkr
 
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnn
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnNLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnn
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnshradhasharma2101
 
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...Glen Cathey
 
Sourceconaifullv6forslideshare 120108133422-phpapp02
Sourceconaifullv6forslideshare 120108133422-phpapp02Sourceconaifullv6forslideshare 120108133422-phpapp02
Sourceconaifullv6forslideshare 120108133422-phpapp02Rose Nolen
 
Hacking Hired - Job Hunting Vectors
Hacking Hired - Job Hunting VectorsHacking Hired - Job Hunting Vectors
Hacking Hired - Job Hunting VectorsRachel Harpley
 
The Revolution of Digital Marketing in the Artificial Intelligence era
The Revolution of Digital Marketing in the Artificial Intelligence eraThe Revolution of Digital Marketing in the Artificial Intelligence era
The Revolution of Digital Marketing in the Artificial Intelligence eraMohamed Hanafy
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Chapter 17 Reading and Writing Social ResearchSOC 363 Re.docx
Chapter 17 Reading and Writing Social ResearchSOC 363 Re.docxChapter 17 Reading and Writing Social ResearchSOC 363 Re.docx
Chapter 17 Reading and Writing Social ResearchSOC 363 Re.docxcravennichole326
 
Introduction to Semantic Technology for SharePoint Administrators
Introduction to Semantic Technology for SharePoint AdministratorsIntroduction to Semantic Technology for SharePoint Administrators
Introduction to Semantic Technology for SharePoint AdministratorsBradley Bennet
 
Effective Searching: Part 3 - Narrow your search (Generic Web)
Effective Searching: Part 3 - Narrow your search (Generic Web)Effective Searching: Part 3 - Narrow your search (Generic Web)
Effective Searching: Part 3 - Narrow your search (Generic Web)Jamie Bisset
 
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docx
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docxCase StudyIn March 1994, Randal Schwartz was indicted on three f.docx
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docxwendolynhalbert
 
Ontology for Knowledge and Data Strategies.pptx
Ontology for Knowledge and Data Strategies.pptxOntology for Knowledge and Data Strategies.pptx
Ontology for Knowledge and Data Strategies.pptxMike Bennett
 

Ähnlich wie Mind the Semantic Gap in Data Science (20)

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
NLP Ecosystem
NLP EcosystemNLP Ecosystem
NLP Ecosystem
 
​​Explainability in AI and Recommender systems: let’s make it interactive!
​​Explainability in AI and Recommender systems: let’s make it interactive!​​Explainability in AI and Recommender systems: let’s make it interactive!
​​Explainability in AI and Recommender systems: let’s make it interactive!
 
Language First Protocol from QSi
Language First Protocol from QSiLanguage First Protocol from QSi
Language First Protocol from QSi
 
SOFLUX Meetup - Landing on your dream job
SOFLUX Meetup - Landing on your dream jobSOFLUX Meetup - Landing on your dream job
SOFLUX Meetup - Landing on your dream job
 
Veda Semantics - introduction document
Veda Semantics - introduction documentVeda Semantics - introduction document
Veda Semantics - introduction document
 
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnn
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnnNLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnn
NLP (4) for class 9 (1).pptx nnnnnnnnnnnnnnnnnnnnnnnnnnnnn
 
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
Talent Sourcing and Matching - Artificial Intelligence and Black Box Semantic...
 
Sourceconaifullv6forslideshare 120108133422-phpapp02
Sourceconaifullv6forslideshare 120108133422-phpapp02Sourceconaifullv6forslideshare 120108133422-phpapp02
Sourceconaifullv6forslideshare 120108133422-phpapp02
 
Hacking Hired - Job Hunting Vectors
Hacking Hired - Job Hunting VectorsHacking Hired - Job Hunting Vectors
Hacking Hired - Job Hunting Vectors
 
The Revolution of Digital Marketing in the Artificial Intelligence era
The Revolution of Digital Marketing in the Artificial Intelligence eraThe Revolution of Digital Marketing in the Artificial Intelligence era
The Revolution of Digital Marketing in the Artificial Intelligence era
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
EEDL_JUL23_Webinar_FINAL.pdf
EEDL_JUL23_Webinar_FINAL.pdfEEDL_JUL23_Webinar_FINAL.pdf
EEDL_JUL23_Webinar_FINAL.pdf
 
Chapter 17 Reading and Writing Social ResearchSOC 363 Re.docx
Chapter 17 Reading and Writing Social ResearchSOC 363 Re.docxChapter 17 Reading and Writing Social ResearchSOC 363 Re.docx
Chapter 17 Reading and Writing Social ResearchSOC 363 Re.docx
 
670-11 Analysis of Urban Conversations 675-5
670-11 Analysis of Urban Conversations 675-5670-11 Analysis of Urban Conversations 675-5
670-11 Analysis of Urban Conversations 675-5
 
September16
September16September16
September16
 
Introduction to Semantic Technology for SharePoint Administrators
Introduction to Semantic Technology for SharePoint AdministratorsIntroduction to Semantic Technology for SharePoint Administrators
Introduction to Semantic Technology for SharePoint Administrators
 
Effective Searching: Part 3 - Narrow your search (Generic Web)
Effective Searching: Part 3 - Narrow your search (Generic Web)Effective Searching: Part 3 - Narrow your search (Generic Web)
Effective Searching: Part 3 - Narrow your search (Generic Web)
 
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docx
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docxCase StudyIn March 1994, Randal Schwartz was indicted on three f.docx
Case StudyIn March 1994, Randal Schwartz was indicted on three f.docx
 
Ontology for Knowledge and Data Strategies.pptx
Ontology for Knowledge and Data Strategies.pptxOntology for Knowledge and Data Strategies.pptx
Ontology for Knowledge and Data Strategies.pptx
 

Kürzlich hochgeladen

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Kürzlich hochgeladen (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Mind the Semantic Gap in Data Science

  • 1. Mind the Semantic Gap How "talking semantics" can help you perform better data science Panos Alexopoulos Head of Ontology
  • 2. We are all here for the same purpose
  • 3. Some of us work on the data supply side • We collect and generate data • We represent, integrate, store and make them accessible through data models (and relevant technology) • We get them ready for usage and exploitation
  • 4. Some others work on the data exploitation side • We use data to build predictive, descriptive or other types of analytics solutions • We use data to build and power AI applications
  • 5. And many of us do both
  • 6. But there is a gap between the two sides that very often we don’t see
  • 7. And that’s the semantic gap • The situation when the data models of the supply side are misunderstood and misused by the exploitation side. • The situation when the data requirements of the exploitation side are misunderstood by the supply side. • Typically the more distant is supply from usage, the greater is the semantic gap.
  • 8. Data meaning is communicated through (semantic) data models • Conceptual descriptions and representations of data that convey the latter’s meaning in an explicit and commonly understood and accepted way among humans and systems.
  • 9. The semantic gap is caused by bad semantic models • We model data meaning in a wrong way. • We model data meaning in a non-explicit way • We model data meaning in a not commonly accepted way
  • 11. Which data model is correct?
  • 13. What do we do wrong? • We often give inaccurate and misleading or ambiguous names to data modeling elements: • If I name a table “Car” then its rows should represent concrete cars (e.g., the car with registration number XYZ) • But if my rows represent car models (e.g., BMW 3.16 or AUDI A4), then the table should be named “CarModel”, not “Car”.
  • 14. Why we do it? • Not realizing there any other interpretations of the name we use • Assuming other interpretations are irrelevant and that people will know what we mean • Assuming that the correct meaning will be inferred by the context.
  • 15. How to narrow the gap • Always contemplate an element’s name in relative isolation and try to think all the possible and legitimate ways this can be interpreted by a human. • If an element’s name has more that one interpretations, make it unambiguous, even if the other interpretations are not within the domain or not very likely to occur • Observe how the element is used in practice by your modelers, annotators, developers and users.
  • 17. • Supply-Demand Analysis • Top Skills per Job • Career Paths At Textkernel we do Labour Market Analytics
  • 18. For that we need synonyms! • Two terms are synonymous when they mean the same thing in (almost ) all contexts. • We need synonyms to get statistics on the actual professions and skills, no matter the form or language they are expressed in text
  • 19. Can we use any data model for synonymy? Not really! Term Synonyms Model Profession Occupation, Vocation, Work, Living KBPedia Chief Executive Officer CEO, chief operating officer Wordnet Chief Executive Officer Senior executive officer, chairman, CEO, managing director, president ESCO Economist economics science researcher, macro analyst, economics analyst, interest analyst, ... ESCO Data Scientist data engineer, research data scientist, data expert, data research scientist ESCO
  • 20. Why this gap? • We forget or ignore that synonymy is a vague and context dependent relation. • We mix synonymy with hyponymy and semantic relatedness and similarity • We are unaware of subtle but important differences in meaning for our particular domain or context • We don’t document biases, assumptions and choices
  • 21. How to narrow the gap • Insist on meaning equivalence over mere relatedness • Get multiple opinions (from people and data) • If you can’t be sure that your synonyms are indeed synonyms, then don’t call them like that • Always document the criteria, assumptions and biases of your synonymy.
  • 23. Another critical capability for good analytics is entity disambiguation
  • 24. For that we need semantically related terms! • The meaning of an ambiguous term in a text is most likely the one that is related to the meanings of the other terms in the same text. • Therefore, knowing which terms are semantically related, helps in performing disambiguation.
  • 25. Can we use any related terms for disambiguation? Not really! • We need related terms that are not very ambiguous themselves • We need related terms that are highly specific to our target term. • We need related terms that are prevalent in the data we process.
  • 26. A soccer experiment Back in 2015, my old team had to detect and disambiguate mentions of soccer players and teams in short textual extracts from video scenes from football matches: “It's the 70th minute of the game and after a magnificent pass by Casemiro, Ronaldo managed to beat Claudio Bravo. Real now leads 1-0." For that we used an in-house system, called Knowledge Tagger, and DBpedia as domain knowledge about soccer teams and players.
  • 27. A soccer experiment Initially, we ran the system with all the DBPedia related entities for each player as disambiguation evidence. Precision was 60% and recall 55% Then we pruned DBPedia and kept only three relations: • Players and their current teams • Players and their current co-players • Players and their current managers Precision increased to 82% and recall to 80%
  • 28. Why this gap? • We usually don’t want just any relatedness but a relatedness that actually helps our goal. • Our task’s required relatedness seems to be compatible with the one provided by the data, yet there are subtle differences that make the latter non-useful or even harmful. • Semantic relatedness is a vague relation for which it’s relatively easy to get agreement outside of any context, but hard within one.
  • 29. How to narrow the gap • Uncover the hidden assumptions and expectations behind the “should be related” requirement. • Give people examples of terms that you think they can be related • Ask them to judge them as related or not in context. • Challenge them to justify their decisions. • Identify patterns and rules that characterize these decisions. • Use this information to derive the “relatedness” you need.
  • 31. Take aways The Semantic Gap in Data Science is real We can avoid and /or narrow it though by paying more attention ➔ We often model data meaning badly ➔ We often understand the data meaning wrongly ➔ We often produce the wrong results ➔ Ambiguity ➔ Vagueness ➔ Variety and diversity ➔ Context-dependence ➔ Understand basic semantic phenomena ➔ Understand how data can be misunderstood ➔ Be aware of and document assumptions, choices and biases Closing it is hard
  • 32. Thank you! Panos Alexopoulos Head of Ontology @ Textkernel Writing a book on semantic data modeling @ O’Reilly E-mail: alexopoulos@textkernel.nl Web: http://www.panosalexopoulos.com LinkedIn: www.linkedin.com/in/panosalexopoulos Twitter: @PAlexop