UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Linking Entities in Tweets

•

0 gefällt mir•1,151 views

This presentation describes the participation of the UNIBA team in the Named Entity rEcognition and Linking (NEEL) Chal- lenge. We propose a knowledge-based algorithm able to recognize and link named entities in English tweets. The approach combines the simple Lesk algorithm with information coming from both a distributional semantic model and usage frequency of Wikipedia concepts. The algorithm per- forms poorly in the entity recognition, while it achieves good results in the disambiguation step.

UNIBA: Exploiting a
Distributional Semantic Model for
Disambiguating and
Linking Entities in Tweets
Pierpaolo Basile, Annalina Caputo, Giovanni
Semeraro, Fedelucio Narducci
{fedelucio.narducci, pierpaolo.basile}@uniba.it
#Microposts2015, NEEL Challenge, Florence 18th May 2015

The Challenge
Just watched Frozen for the first time ever and knew the
words to all the songs... How?! #productplacement
Problem: Find and link entities in tweets
ProductEntity type

Our Approach
• Entity Recognition
• using PoS-tag
• relying on n-grams
• Disambiguation
• knowledge-based method that combines a
Distributional Semantic Models (DSM) with prior
probability assigned to each DBpedia concept
• Type
• manual map for all types defined in the dbpedia-owl
ontology to the respective types in the task

Entity Recognition: Indexing
Frozen
<dbpedia.org/resource/Frozen_(Madonna_song)>
Frozen
<dbpedia.org/resource/Frozen_(2013_film)>
Apple
<dbpedia.org/resource/Apple_Inc.>
Apple Inc.
<dbpedia.org/resource/Apple_Inc.>
Barack Obama
<http://dbpedia.org/resource/Barack_Obama>
DBpedia titles file and DBpedia NLP resources
http://wifo5-04.informatik.uni-mannheim.de/downloads/datasets/
Indexing

Entity Recognition…
PoS-tagger
N-grams
generation
Tokenization and
Normalization
Candidate list
of surface
forms
Tweet

…Entity Recognition
Search and
Filtering
Search Score
Levenshtein Distance
Jaccard Index
Candidate list
of surface
forms
Candidate
entities and
list of possible
concepts

Disambiguation
Building the glosses
Building the context
Semantic Ranking
3-step approach

Disambiguation:
Building the glosses
"Frozen" is a song by American singer-
songwriter Madonna…
Frozen is a 2013 American 3D computer-
animated musical…
DBpedia extended abstracts

Disambiguation:
Building the context
Just watched Frozen for the first time ever and knew the
words to all the songs... How?! #productplacement
<just, watched, first, time, knew, words, all, songs,
how, product, placement>
Context

Disambiguation:
Semantic Ranking 1/3
• Words as points in a
mathematical space
• Close words are similar
• Word space is built analyzing
word co-occurrences in a
large corpus
• Vector composition using
superposition (+)

Disambiguation:
Semantic Ranking 2/3
word2vec: https://code.google.com/p/word2vec/
Distributional Semantic Model built on Wikipedia
Context
• Cosine similarity
between the gloss and
the context
• Linear combination
with a function which
takes into account the
usage of concepts in
Wikipedia

Disambiguation:
Semantic Ranking 3/3
Statistics about the usage of concepts in Wikipedia
𝑝 𝑐𝑖𝑗 𝑒𝑖 =
𝑡 𝑒𝑖, 𝑐𝑖𝑗 + 1
#𝑒𝑖 + |𝐶𝑖|
Concept probability
given the entity

𝑝 𝑐𝑖𝑗 𝑒𝑖 =
𝑡 𝑒𝑖, 𝑐𝑖𝑗 + 1
#𝑒𝑖 + |𝐶𝑖|
Disambiguation: Semantic
overlap 3/3
Statistics about the usage of concepts in Wikipedia
Number of times
ei is linked as cij
Number of concepts
assigned to ei

Evaluation
• Development set
• 500 manually annotated tweets
• Metrics
• SLM: Strong Link Match
• STMM: Strong Typed Mention Match
• MC: Mention Ceaf
• System setup
• TweetNLP for tokenization and PoS-tagging
• word2vec for DSM building: 400 vector dimensions
analyzing only terms that occur at least 25 times
• Developed in JAVA

Results
• Low performance in entity recognition
• Good results in disambiguation: F=0.825
considering correct recognition and no-NIL
instances
Entity Recognition F-SLM F-STMM F-MC
PoS-tag 0.362 0.267 0.389
N-grams 0.258 0.191 0.306

UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Linking Entities in Tweets

Weitere ähnliche Inhalte

Andere mochten auch

Most web search engine users are increasingly expecting direct and contextually relevant answers to their information needs rather than mere links to documents. In order to arrive at such answers we need to tackle several issues, including (but not limited to) entity linking, entity retrieval, entity reconciliation, intent classification, and personalization, all without losing sight of efficiency. In this talk I will give some background on how such an end-to-end pipeline for semantic search is being implemented and improved at Yahoo.

Web-scale semantic search

Edgar Meij

(Micro)Blog : un sujet de recherche actuel [08/02/2011]

Guillaume Cabanac

Barometre RegionsJob/Bringr : les conversations "emploi" sur les réseaux sociaux

HelloWork

Laure Soulier, Lamjed Ben Jabeur, Paul Mousset, Lynda Tamine. Quels facteurs de pertinence pour la recherche de produits e-commerce ?. Dans : Conférence francophone en Recherche d'Information et Applications (CORIA 2016), Toulouse, 09/03/2016-11/03/2016, Association Francophone de Recherche d'Information et Applications (ARIA), p. 415-430, mars 2016. https://www.irit.fr/publis/SIG/2016_CORIA_SOULIER.pdf Un moteur de recherche e-commerce vise à fournir un accès rapide et efficace à des produits qui correspondent aux besoins et aux préférences de l'utilisateur parmi une liste de produits similaires ou étroitement liés. Nous avons participé à la campagne d'évaluation « Living Lab for Information Retrieval » qui proposait une tâche de recherche de produits évaluée par des utilisateurs réels lors de scénarios de recherche réelle sur un site de e-commerce. L’évaluation expérimentale a montré des résultats prometteurs de notre modèle. Dans ce papier, nous proposons une analyse des fichiers logs issus de notre modèle afin d'identifier des facteurs d’efficacité liés à la requête et aux produits. L'objectif de cette étude est d'ouvrir des pistes de recherche pour la formalisation de modèles de recherche de produits. E-commerce product retrieval aims to provide a quick and efficient access to products that fit user’s needs and preferences among a tail of similar or closely related products. We participated to the ``Living Lab for Information Retrieval'' evaluation campaign devoted to a product search task in which real users evaluated participants' retrieval models in real search scenarios on e-commerce websites. The experimental evaluation has shown encouraging results for our proposed model. In this paper, we conduct an analysis of users' feeadback with respect to the clicks obtained by our model. The goal of the paper is therefore to identify the effectiveness factors underlying the user's queries and the retrieved products in order to open perspectives in the formalization of product search models.

Quels facteurs de pertinence pour la recherche de produits e-commerce ?

Lamjed Ben Jabeur

Op 7 maart was het Cross Media Cafe - Uit het Lab. Aan welke projecten en innovaties wordt gewerkt in de medialabs? Zowel bij kennisinstellingen als bij bedrijven? Wat komt er op ons af dat onze manier van media consumeren, gebruiken en produceren revolutionair gaat veranderen? Waar werken de “mad scientists” aan? Waar maken de uitvinders zich druk over? En hoe snel vindt het zijn weg naar de markt? Daarnaast vieren iMMovator en Beeld en Geluid hun nieuwe samenwerking en het daarbij horende nieuwe logo.

Moederpresentatie Cross Media Cafe - Uit het Lab

Media Perspectives

Lamjed Ben Jabeur, Lynda Tamine, Mohand Boughanem. Intégration des facteurs temps et autorité sociale dans un modèle bayésien de recherche de tweets. Dans : Conférence francophone en Recherche d'Information et Applications (CORIA 2012), Bordeaux, 21/03/12-23/03/12, LABRI, p. 301-316, 2012 ftp://ftp.irit.fr/IRIT/SIG/BenJabeur_CORIA2012.pdf Cet article présente une approche sociale pour la recherche d’information dans les microblogs intégrant diverses sources d’évidence au sein d’un réseau bayésien. Notre contribution consiste à étendre la notion classique de pertinence, basée sur la similarité textuelle, par de nouveaux facteurs tels que l’importance sociale des blogueurs et la magnitude temporelle des microblogs. Dans ce papier, l’importance sociale d’un blogueur est assimilée à son influence dans le réseau et est évaluée par un score de PageRank déduit sur le réseau de diffusion des microblogs. Nous proposons d’estimer la magnitude temporelle selon le nombre de voisins temporels qui incluent les termes de la requête. Afin de valider notre approche, une évaluation expérimentale à été menée sur la collection de microblogs de référence TREC Tweets2011. Les résultats montrent que notre modèle présente un gain de 24% par rapport à la médiane des résultats officiels de TREC Microblog 2011.

Intégration des facteurs temps et autorité sociale dans un modèle bayésien de...

Lamjed Ben Jabeur

Un modèle de recherche d’information sociale dans les microblogs : cas de Twi...

Lamjed Ben Jabeur

Cet article propose une nouvelle approche, basée sur les réseaux sociaux, pour l'accès aux ressources bibliographiques. Nous introduisons un modèle d'information sociale dont les auteurs sont les principales entités et les relations sont extraites à partir des liens de coauteur et de citation. En effet, ces relations sont pondérées en tenant compte des interactions entre les auteurs et des annotations sociales produites par les utilisateurs. Dans ce modèle, la pertinence d'un document est estimée par combinaison de la pertinence thématique et de la pertinence sociale, qui est à son tour dérivée de l'importance sociale des auteurs associés. Nous évaluons la viabilité de notre modèle sur une collection d'articles scientifiques dont les annotation sociales sont extraites depuis le réseau social académique CiteULike.org. Les résultats obtenus montrent la supériorité des performances de notre modèle par rapport à la recherche d'information traditionnelle.

Un modèle de Recherche d'Information Sociale pour l'Accès aux Ressources Bib...

Lamjed Ben Jabeur

Andere mochten auch (8)

Web-scale semantic search

(Micro)Blog : un sujet de recherche actuel [08/02/2011]

Barometre RegionsJob/Bringr : les conversations "emploi" sur les réseaux sociaux

Quels facteurs de pertinence pour la recherche de produits e-commerce ?

Moederpresentatie Cross Media Cafe - Uit het Lab

Intégration des facteurs temps et autorité sociale dans un modèle bayésien de...

Un modèle de recherche d’information sociale dans les microblogs : cas de Twi...

Un modèle de Recherche d'Information Sociale pour l'Accès aux Ressources Bib...

Ähnlich wie UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Linking Entities in Tweets

Extracting Key Terms From Noisy and Multi-theme Documents

maria.grineva

NLP & DBpedia

kelbedweihy

ESWC 2014 Tutorial part 3

Miriam Fernandez

Jtelss presentation Paola Monachesi

guestff44453

Knowledge Graph Construction and the Role of DBPedia

Paul Groth

Effective Extraction of Thematically Grouped Key Terms From Text

maria.grineva

This thesis, set at the crossroads of Social Web and Semantic Web, is an attempt to bridge Social tagging-based systems with structured representations such as thesauri or ontologies (in the informatics sense). Folksonomies resulting from the use of social tagging systems suffer from a lack of precision that hinders their potentials to retrieve or exchange information. This thesis proposes supporting the use of folksonomies with formal languages and ontologies from the Semantic Web. Automatic processing of tags allows bootstraping the process by using a combination of a custom method analyzing tags' labels and adapted methods analyzing the structure of folksonomies. The contributions of users are described thanks to our model SRTag, which allows supporting diverging points of view, and captured thanks to our user friendly interface allowing the users to structure tags while searching the folksonomy. Conflicts between individual points of view are detected, solved, and then exploited to help a referent user maintain a global and coherent structuring of the folksonomy, which is in return used to garanty the coherence while enriching individual contributions with the others' contributions. The result of our method allows enhancing the navigation within tag-based knowledge systems, but can also serve as a basis for building thesauri fed by a truly bottom up process.

PhD defense : Multi-points of view semantic enrichment of folksonomies

Freddy Limpens

An Ensemble Model for Cross-Domain Polarity Classification on Twitter

Symeon Papadopoulos

Using Knowledge Graph for Promoting Cognitive Computing

Artificial Intelligence Institute at UofSC

Websci 2018

Christian Bokhove

final_nlp

aphex34

Using DBpedia for Spotting and Disambiguating Entities

Julien PLU

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...

Lucidworks

Linked Data Workshop Stanford University

Talis Consulting

Introduction to Machine Learning

Rahul Jain

The civil rights movement ppt for itc 1 kj 4

hollowaymm

Social semantic web

Vlad Posea

BabelNet 3.0

Shrikrishna Parab

Doctoral seminar (DBIS RWTH Aachen)

Zina Petrushyna

The civil rights movement ppt for itc 1 kj 7

hollowaymm

Ähnlich wie UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Linking Entities in Tweets (20)

Extracting Key Terms From Noisy and Multi-theme Documents

NLP & DBpedia

ESWC 2014 Tutorial part 3

Jtelss presentation Paola Monachesi

Knowledge Graph Construction and the Role of DBPedia

Effective Extraction of Thematically Grouped Key Terms From Text

PhD defense : Multi-points of view semantic enrichment of folksonomies

An Ensemble Model for Cross-Domain Polarity Classification on Twitter

Using Knowledge Graph for Promoting Cognitive Computing

Websci 2018

final_nlp

Using DBpedia for Spotting and Disambiguating Entities

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...

Linked Data Workshop Stanford University

Introduction to Machine Learning

The civil rights movement ppt for itc 1 kj 4

Social semantic web

BabelNet 3.0

Doctoral seminar (DBIS RWTH Aachen)

The civil rights movement ppt for itc 1 kj 7

Mehr von Pierpaolo Basile

In the last few years, the increasing availability of large corpora spanning several time periods has opened new opportunities for the diachronic analysis of language. This type of analysis can bring to the light not only linguistic phenomena related to the shift of word meanings over time, but it can also be used to study the impact that societal and cultural trends have on this language change. This paper introduces a new resource for performing the diachronic analysis of named entities built upon Wikipedia page revisions. This resource enables the analysis over time of changes in the relations between entities (concepts), surface forms (words), and the contexts surrounding entities and surface forms, by analysing the whole history of Wikipedia internal links. We provide some useful use cases that prove the impact of this resource on diachronic studies and delineate some possible future usage.

Diachronic analysis of entities by exploiting wikipedia page revisions

Pierpaolo Basile

Oggi durante l'evento "STEM - Open day / Open mind" presso il Colla coworking ho parlato di come l'industria tecnologica ha cancellato le donne dalla storia. Dalla seconda guerra mondiale alla metà degli anni sessanta, le donne specializzate nell'industria informatica superavano gli uomini. Le donne, durante la guerra, gestivano calcolatori, si occupavano della logistica dell'esercito ed eseguivano calcoli balistici. Dopo la fine della guerra, si occupavano della raccolta dei dati e del funzionamento di calcolatori di numerose istituzioni. Le cose cambiarono negli anni settanta, quando le istituzioni e l'industria si resero conto dell'importanza dei computer. Da questo punto in poi le donne sono state estromesse a vantaggio di uomini pagati meglio e con ruoli di maggior prestigio. La storia dimostra, nel caso ci fossero ancora dubbi, che le donne hanno le stesse capacità degli uomini nelle discipline STEM e che il condiziomento sociale ha allontanato le donne dalle discipline scientifiche e tecnologiche.

Come l'industria tecnologica ha cancellato le donne dalla storia

Pierpaolo Basile

This paper describes the first edition of the “Solving language games” (NLP4FUN) task at the EVALITA 2018 campaign. The task consists in designing an artificial player for “The Guillotine” (La Ghigliottina, in Italian), a challenging language game which demands knowledge covering a broad range of topics. The game consists in finding a word which is semantically correlated with a set of 5 words called clues. Artificial players for that game can take advantage from the availability of open repositories on the web, such as Wikipedia, that provide the system with the cultural and linguistic background needed to find the solution.

EVALITA 2018 NLP4FUN - Solving language games

Pierpaolo Basile

We report the results of an exploratory study aimed at investigating the language of happiness in Italian tweets. Specifically, we conduct a time-wise analysis of the happiness load of tweets by leveraging a lexicon of happiness extracted from 8.6M tweets. Furthermore, we report the results of a statistical linguistic analysis aimed at extracting the most frequent concepts associated with the happy and sad words in our lexicon

Buon appetito! Analyzing Happiness in Italian Tweets

Pierpaolo Basile

During the last decade, the surge in available data spanning different epochs has inspired a new analysis of cultural, social, and linguistic phenomena from a temporal perspective. In this talk, I will describe Temporal Random Indexing (TRI) a method that enables the analysis of the time evolution of the meaning of a word by exploiting large corpora. TRI is able to build WordSpaces that take into account temporal information. This methodology is exploited for building time series that trace how a word changes its meaning over time. I will report some experiments on the Italian language, and I will show the preliminary results obtained during my visiting at the Turing Institute by analysing the UK internet archive corpus.

Detecting semantic shift in large corpora by exploiting temporal random indexing

Pierpaolo Basile

We propose a Deep Learning architecture for sequence labeling based on a state of the art model that exploits both word- and character-level representations through the combination of bidirectional LSTM, CNN and CRF. We evaluate the proposed method on three Natural Language Processing tasks for Italian: PoS-tagging of tweets, Named Entity Recognition and Super-Sense Tagging. Results show that the system is able to achieve state of the art performance in all the tasks and in some cases overcomes the best systems previously developed for the Italian.

Bi-directional LSTM-CNNs-CRF for Italian Sequence Labeling

Pierpaolo Basile

INSERT COIN - Storia dei videogame: da Spacewar a Street Fighter

Pierpaolo Basile

QuestionCube DigithON 2017

Pierpaolo Basile

Diachronic Analysis

Pierpaolo Basile

(Open) data hacking

Pierpaolo Basile

L'obiettivo del laboratorio è soddisfare la curiosità di tutti i geek e svelare loro i segreti della macchina più affascinante dell'universo. La macchina che ha rivoluzionato i meccanismi con cui l'informazione è processata dando vita a quella che oggi chiamiamo informatica e alla macchina delle meraviglie che chiamiamo computer. Il laboratorio non richiederà particolari conoscenze informatiche o matematiche e sarà accompagnato da esempi pratici e divertenti come la macchina di Turing realizzata interamente con mattoncini ‪LEGO‬.

La macchina più geek dell’universo The Turing Machine

Pierpaolo Basile

This presentation describes two approaches to compositional semantics in distributional semantic spaces. Both approaches conceive the semantics of complex structures, such as phrases or sentences, as being other than the sum of its terms. Syntax is the plus used as a glue to compose words. The former kind of approach encodes information about syntactic dependencies directly into distributional spaces, the latter exploits compositional operators reflecting the syntactic role of words.

Building WordSpaces via Random Indexing from simple to complex spaces

Pierpaolo Basile

This work proposes an approach to the construction of WordSpaces which takes into account temporal information. The proposed method is able to build a geometrical space considering several periods of time. This methodology enables the analysis of the time evolution of the meaning of a word. Exploiting this approach, we build a framework, called Temporal Random Indexing (TRI) that provides all the necessary tools for building WordSpaces and performing such linguistic analysis. We propose some examples of usage of our tool by analysing word meanings in two corpora: a collection of Italian books and English scientific papers about computational linguistics. http://clic.humnet.unipi.it/proceedings/Proceedings-CLICit-2014.pdf

Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

Pierpaolo Basile

COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...

Pierpaolo Basile

This paper proposes two approaches to compositional semantics in distributional semantic spaces. Both approaches conceive the semantics of complex structures, such as phrases or sentences, as being other than the sum of its terms. Syntax is the plus used as a glue to compose words. The former kind of approach encodes information about syntactic dependencies directly into distributional spaces, the latter exploits compositional operators reflecting the syntactic role of words. We present a preliminary evaluation performed on GEMS 2011 “Compositional Semantics” dataset, with the aim of understanding the effects of these approaches when applied to simple word pairs of the kind Noun-Noun, Adjective-Noun and Verb-Noun. Experimental results corroborate our conjecture that exploiting syntax can lead to improved distributional models and compositional operators, and suggest new openings for future uses in real-application scenario.

A Study on Compositional Semantics of Words in Distributional Spaces

Pierpaolo Basile

This paper investigates the role of Distributional Semantic Models (DSMs) in Question Answering (QA), and specifically in a QA system called QuestionCube. QuestionCube is a framework for QA that combines several techniques to retrieve passages containing the exact answers for natural language questions. It exploits Information Retrieval models to seek candidate answers and Natural Language Processing algorithms for the analysis of questions and candidate answers both in English and Italian. The data source for the answer is an unstructured text document collection stored in search indices. In this paper we propose to exploit DSMs in the QuestionCube framework. In DSMs words are represented as mathematical points in a geometric space, also known as semantic space. Words are similar if they are close in that space. Our idea is that DSMs approaches can help to compute relatedness between users’ questions and candidate answers by exploiting paradigmatic relations between words. Results of an experimental evaluation carried out on CLEF2010 QA dataset, prove the effectiveness of the proposed approach.

Exploiting Distributional Semantic Models in Question Answering

Pierpaolo Basile

Sst evalita2011 basile_pierpaolo

Pierpaolo Basile

OTTHO (On the Tip of my THOught) is an information seeking system designed for solving a language game which demands knowledge covering a broad range of topics, such as movies, politics, literature, history, proverbs, and popular culture. OTTHO implements a knowledge infusion process in order to provide a background knowledge which allows a deeper understanding of the items it deals with. The knowledge infusion process consists of two steps: 1) extracting and modeling relationships between words extracted from several knowledge sources; 2) reasoning on the induced models in order to generate new knowledge. OTTHO extracts knowledge from several sources, such as a dictionary, news, Wikipedia, and various unstructured repositories and creates a memory of linguistic knowledge and world facts. Starting from some external stimuli (e.g. words) depending on the task to be accomplished, the reasoning mechanism allows retrieving some specific pieces of knowledge from the memory created in the previous step. OTTHO has a great potential for more practical applications besides solving a language game. It could be used for implementing an alternative paradigm for associative information retrieval, for computational advertising and recommender systems.

AI*IA 2012 PAI Workshop OTTHO

Pierpaolo Basile

Word Sense Disambiguation and Intelligent Information Access

Pierpaolo Basile

Distributional approaches are based on a simple hypothesis: the meaning of a word can be inferred from its usage. The application of that idea to the vector space model makes possible the construction of a WordSpace in which words are represented by mathematical points in a geometric space. Similar words are represented close in this space and the definition of ``word usage'' depends on the definition of the context used to build the space, which can be the whole document, the sentence in which the word occurs, a fixed window of words, or a specific syntactic context. However, in its original formulation WordSpace can take into account only one definition of context at a time. We propose an approach based on vector permutation and Random Indexing to encode several syntactic contexts in a single WordSpace. Moreover, we propose some operations in this space and report the results of an evaluation performed using the GEMS 2011 Shared Evaluation data.

Encoding syntactic dependencies by vector permutation

Pierpaolo Basile

Mehr von Pierpaolo Basile (20)

Diachronic analysis of entities by exploiting wikipedia page revisions

Come l'industria tecnologica ha cancellato le donne dalla storia

EVALITA 2018 NLP4FUN - Solving language games

Buon appetito! Analyzing Happiness in Italian Tweets

Detecting semantic shift in large corpora by exploiting temporal random indexing

Bi-directional LSTM-CNNs-CRF for Italian Sequence Labeling

INSERT COIN - Storia dei videogame: da Spacewar a Street Fighter

QuestionCube DigithON 2017

Diachronic Analysis

(Open) data hacking

La macchina più geek dell’universo The Turing Machine

Building WordSpaces via Random Indexing from simple to complex spaces

Analysing Word Meaning over Time by Exploiting Temporal Random Indexing

COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a ...

A Study on Compositional Semantics of Words in Distributional Spaces

Exploiting Distributional Semantic Models in Question Answering

Sst evalita2011 basile_pierpaolo

AI*IA 2012 PAI Workshop OTTHO

Word Sense Disambiguation and Intelligent Information Access

Encoding syntactic dependencies by vector permutation

UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Linking Entities in Tweets

1. UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Linking Entities in Tweets Pierpaolo Basile, Annalina Caputo, Giovanni Semeraro, Fedelucio Narducci {fedelucio.narducci, pierpaolo.basile}@uniba.it #Microposts2015, NEEL Challenge, Florence 18th May 2015

2. The Challenge Just watched Frozen for the first time ever and knew the words to all the songs... How?! #productplacement Problem: Find and link entities in tweets ProductEntity type

3. Our Approach • Entity Recognition • using PoS-tag • relying on n-grams • Disambiguation • knowledge-based method that combines a Distributional Semantic Models (DSM) with prior probability assigned to each DBpedia concept • Type • manual map for all types defined in the dbpedia-owl ontology to the respective types in the task

4. Entity Recognition: Indexing Frozen <dbpedia.org/resource/Frozen_(Madonna_song)> Frozen <dbpedia.org/resource/Frozen_(2013_film)> Apple <dbpedia.org/resource/Apple_Inc.> Apple Inc. <dbpedia.org/resource/Apple_Inc.> Barack Obama <http://dbpedia.org/resource/Barack_Obama> DBpedia titles file and DBpedia NLP resources http://wifo5-04.informatik.uni-mannheim.de/downloads/datasets/ Indexing

5. Entity Recognition… PoS-tagger N-grams generation Tokenization and Normalization Candidate list of surface forms Tweet

6. …Entity Recognition Search and Filtering Search Score Levenshtein Distance Jaccard Index Candidate list of surface forms Candidate entities and list of possible concepts

7. Disambiguation Building the glosses Building the context Semantic Ranking 3-step approach

8. Disambiguation: Building the glosses "Frozen" is a song by American singer- songwriter Madonna… Frozen is a 2013 American 3D computer- animated musical… DBpedia extended abstracts

9. Disambiguation: Building the context Just watched Frozen for the first time ever and knew the words to all the songs... How?! #productplacement <just, watched, first, time, knew, words, all, songs, how, product, placement> Context

10. Disambiguation: Semantic Ranking 1/3 • Words as points in a mathematical space • Close words are similar • Word space is built analyzing word co-occurrences in a large corpus • Vector composition using superposition (+)

11. Disambiguation: Semantic Ranking 2/3 word2vec: https://code.google.com/p/word2vec/ Distributional Semantic Model built on Wikipedia Context • Cosine similarity between the gloss and the context • Linear combination with a function which takes into account the usage of concepts in Wikipedia

12. Disambiguation: Semantic Ranking 3/3 Statistics about the usage of concepts in Wikipedia 𝑝 𝑐𝑖𝑗 𝑒𝑖 = 𝑡 𝑒𝑖, 𝑐𝑖𝑗 + 1 #𝑒𝑖 + |𝐶𝑖| Concept probability given the entity

13. 𝑝 𝑐𝑖𝑗 𝑒𝑖 = 𝑡 𝑒𝑖, 𝑐𝑖𝑗 + 1 #𝑒𝑖 + |𝐶𝑖| Disambiguation: Semantic overlap 3/3 Statistics about the usage of concepts in Wikipedia Number of times ei is linked as cij Number of concepts assigned to ei

14. Evaluation • Development set • 500 manually annotated tweets • Metrics • SLM: Strong Link Match • STMM: Strong Typed Mention Match • MC: Mention Ceaf • System setup • TweetNLP for tokenization and PoS-tagging • word2vec for DSM building: 400 vector dimensions analyzing only terms that occur at least 25 times • Developed in JAVA

15. Results • Low performance in entity recognition • Good results in disambiguation: F=0.825 considering correct recognition and no-NIL instances Entity Recognition F-SLM F-STMM F-MC PoS-tag 0.362 0.267 0.389 N-grams 0.258 0.191 0.306

UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Linking Entities in Tweets

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Linking Entities in Tweets

Ähnlich wie UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Linking Entities in Tweets (20)

Mehr von Pierpaolo Basile

Mehr von Pierpaolo Basile (20)

UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Linking Entities in Tweets