Presentation at NLDB 2012

1. Two-stage Named Entity Recognition using averaged perceptrons Lars Buitinck Maarten Marx Information and Language Processing Systems Informatics Institute University of Amsterdam 17th Int’l Conf. on Applications of NLP to Information Systems Buitinck, Marx Two-stage NER

2. Outline Buitinck, Marx Two-stage NER

3. Named Entity Recognition Find names in text and classify them as belonging to persons, locations, organizations, events, products or “miscellaneous” Use machine learning Buitinck, Marx Two-stage NER

4. Named Entity Recognition Find names in text and classify them as belonging to persons, locations, organizations, events, products or “miscellaneous” Use machine learning Buitinck, Marx Two-stage NER

5. Named Entity Recognition for Dutch State of the art algorithm for Dutch by Desmet and Hoste (2011); voting classiﬁers with GA to train weights Good training sets are just becoming available Many practitioners retrain Stanford CRF-NER tagger Buitinck, Marx Two-stage NER

8. Overview Realize that NER is two problems in one: recognition and classiﬁcation Pipeline solution with two classiﬁers Use custom feature sets for each Do not used precompiled list of names (“gazetteer”) Work at the sentence level (because of how training sets are set up) Buitinck, Marx Two-stage NER

13. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) preﬁxes and sufﬁxes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER

21. Classiﬁcation stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and sufﬁxes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER

29. Learning algorithm Use averaged perceptron for both stages Learns an approximation of max-margin solution (linear SVM) 40 iterations Used the LBJ machine learning toolkit Buitinck, Marx Two-stage NER

33. Evaluation Aim for F1 score, as deﬁned in the CoNLL 2002 shared task on NER Two corpora: CoNLL 2002 and a subset of SoNaR (courtesy Desmet and Hoste) Compare against Stanford and Desmet and Hoste’s algorithm Buitinck, Marx Two-stage NER

36. Results on CoNLL 2002 309.686 tokens containing 19901 names, four categories 65% training, 22% validation and 12% test sets Stanford achieves F1 = 74.72; "miscellaneous" category is hard (< 0.7) We achieve F1 = 75.14; "organization" category is hard Buitinck, Marx Two-stage NER

40. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classiﬁer from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER

46. Conclusion Near-state of the art performance from simple learners with good feature sets No gazetteers, so should be fairly reusable (Side conclusion: SoNaR is more easily learnable than CoNLL) Buitinck, Marx Two-stage NER

49. Future work Being integrated in UvA’s xTAS text analysis pipeline Used to ﬁnd entities in Dutch Hansard corpus (forthcoming) and link entities to Wikipedia Full SoNaR is now available; new evaluation needed Buitinck, Marx Two-stage NER

Presentation at NLDB 2012

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von maartenmarx

Mehr von maartenmarx (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Presentation at NLDB 2012