Talk "Learning with the web: spotting named entities on the intersection of nerd and machine learning" event during #MSM'13 (WWW'13), Rio de Janeiro, Brazil
Microposts shared on social platforms instantaneously report facts, opinions or emotions. In these posts, entities are often used but they are continuously changing depending on what is currently trending. In such a scenario, recognising these named entities is a challenging task, for which off-the-shelf approaches are not well equipped. We propose NERD-ML, an approach that unifies the benefits of a crowd entity recognizer through Web entity extractors combined with the linguistic strengths of a machine learning classifier.
Axa Assurance Maroc - Insurer Innovation Award 2024
Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning
1. Learning with the Web: SpottingLearning with the Web: Spotting
Named Entities on the intersectionNamed Entities on the intersection
of NERD and Machine Learningof NERD and Machine Learning
Marieke van Erp, Giuseppe Rizzo, Raphaël Troncy
@giusepperizzo
2. May 13, 2013 2/13Making Sense of Microposts (#MSM2013)
NERD-ML @ MSM'13
3. May 13, 2013 3/13Making Sense of Microposts (#MSM2013)
Preprocessing
➢
Dataset is converted in CoNLL IOB
format
➢
Applied 10 cross-fold validation
➢
Chunked the set of tweets in 50KB parts
in order to comply with NERD filesize
limitations
4. May 13, 2013 4/13Making Sense of Microposts (#MSM2013)
NERD extractors
➢
Retrieves named entities from 10 extractors (Web
APIs)
➢
Harmonizes the classification according to the
NERD Ontology v0.5
http://nerd.eurecom.fr/ontology
➢
75 entity classes mapped to 4 MSM'13 classes
http://nerd.eurecom.fr
5. May 13, 2013 5/13Making Sense of Microposts (#MSM2013)
Ritter et al. (2011)
➢
Off-the-shelf tool tailored to a Twitter
stream based on:
– LabelledLDA (+CRF)
– Textual features (POS,Capitalization,Suffix, etc.)
– Freebase gazetters (names of PER, ORG, LOC)
➢
10 entity classes mapped to 4 classes
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named Entity Recognition in Tweets: An
Experimental Study. In: Empirical Methods in Natural Language Processing
(EMNLP’11) (2011)
6. May 13, 2013 6/13Making Sense of Microposts (#MSM2013)
Stanford CRF
➢
Re-trained on the MSM'13 corpora
➢
Parameters based on
english.conll.4class.distsim.crf.ser.gz
properties file provided with the
Stanford distribution
➢
Baseline of our approach
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. Incorporating Non-local
Information into Information Extraction Systems by Gibbs Sampling. In: 43nd Annual
Meeting of the Association for Computational Linguistics (ACL'05) (2005)
7. May 13, 2013 7/13Making Sense of Microposts (#MSM2013)
Textual features
➢
POS
➢
Capitalisation information
– initial capital
– all capitalized
– proportion of token capitals
➢
Prefix (first three letters of the token)
➢
Suffix (last three letters of the token)
➢
Whether token is at the beginning of at the
end of the micropost
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named Entity Recognition in Tweets: An Experimental
Study. In: Empirical Methods in Natural Language Processing (EMNLP’11) (2011)
8. May 13, 2013 8/13Making Sense of Microposts (#MSM2013)
ML settings
Run01: 7 textual features (POS, initial capital,
proportion of capitals, prefix, sufix, end/start token); 0
extractor; ML=k-NN, k =1, Euclidean distance
Run02: 0 textual feature; 12 extractors (AlchemyAPI,
DBpedia Spotlight, Extractiv, Lupedia, OpenCalais,
Saplo, Yahoo, Textrazor, Wikimeta, Zemanta,
Stanford NER, Ritter et al.); ML=SVM, polynomial
kernel, SMO
Run03: 4 textual features (POS, initial capital, suffix,
Proportion of Capitals); 8 extractors (AlchemyAPI,
DBpedia Spotlight, Extractiv, Opencalais, Textrazor,
Wikimeta, Stanford NER, Ritter et al.); ML=SVM,
polynomial kernel, SMO
9. May 13, 2013 9/13Making Sense of Microposts (#MSM2013)
Precision – MSM'13 training,
10 cross-fold validation
10. May 13, 2013 10/13Making Sense of Microposts (#MSM2013)
Recall - MSM'13 training,
10 cross-fold validation
11. May 13, 2013 11/13Making Sense of Microposts (#MSM2013)
F1 – MSM'13 training,
10 cross-fold validation
12. May 13, 2013 12/13Making Sense of Microposts (#MSM2013)
Lessons learned
➢
MISC class is ambiguously defined
➢
8.1% of the named entities from the
training data occurs in the test data
➢
Best Run03: not all extractors and some
textual features
➢
For the next challenge what about
entity linking?
13. May 13, 2013 13/13Making Sense of Microposts (#MSM2013)
Thanks for your time and attention
http://www.slideshare.net/giusepperizzo
N ERD-ML
http://github.com/giusepperizzo/nerdml