Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Presentation at NLDB 2012
1. Two-stage Named Entity Recognition using
averaged perceptrons
Lars Buitinck Maarten Marx
Information and Language Processing Systems
Informatics Institute
University of Amsterdam
17th Int’l Conf. on Applications of NLP to Information
Systems
Buitinck, Marx Two-stage NER
3. Named Entity Recognition
Find names in text and classify them as belonging to
persons, locations, organizations, events, products or
“miscellaneous”
Use machine learning
Buitinck, Marx Two-stage NER
4. Named Entity Recognition
Find names in text and classify them as belonging to
persons, locations, organizations, events, products or
“miscellaneous”
Use machine learning
Buitinck, Marx Two-stage NER
5. Named Entity Recognition for Dutch
State of the art algorithm for Dutch by Desmet and Hoste
(2011); voting classifiers with GA to train weights
Good training sets are just becoming available
Many practitioners retrain Stanford CRF-NER tagger
Buitinck, Marx Two-stage NER
6. Named Entity Recognition for Dutch
State of the art algorithm for Dutch by Desmet and Hoste
(2011); voting classifiers with GA to train weights
Good training sets are just becoming available
Many practitioners retrain Stanford CRF-NER tagger
Buitinck, Marx Two-stage NER
7. Named Entity Recognition for Dutch
State of the art algorithm for Dutch by Desmet and Hoste
(2011); voting classifiers with GA to train weights
Good training sets are just becoming available
Many practitioners retrain Stanford CRF-NER tagger
Buitinck, Marx Two-stage NER
8. Overview
Realize that NER is two problems in one: recognition and
classification
Pipeline solution with two classifiers
Use custom feature sets for each
Do not used precompiled list of names (“gazetteer”)
Work at the sentence level (because of how training sets
are set up)
Buitinck, Marx Two-stage NER
9. Overview
Realize that NER is two problems in one: recognition and
classification
Pipeline solution with two classifiers
Use custom feature sets for each
Do not used precompiled list of names (“gazetteer”)
Work at the sentence level (because of how training sets
are set up)
Buitinck, Marx Two-stage NER
10. Overview
Realize that NER is two problems in one: recognition and
classification
Pipeline solution with two classifiers
Use custom feature sets for each
Do not used precompiled list of names (“gazetteer”)
Work at the sentence level (because of how training sets
are set up)
Buitinck, Marx Two-stage NER
11. Overview
Realize that NER is two problems in one: recognition and
classification
Pipeline solution with two classifiers
Use custom feature sets for each
Do not used precompiled list of names (“gazetteer”)
Work at the sentence level (because of how training sets
are set up)
Buitinck, Marx Two-stage NER
12. Overview
Realize that NER is two problems in one: recognition and
classification
Pipeline solution with two classifiers
Use custom feature sets for each
Do not used precompiled list of names (“gazetteer”)
Work at the sentence level (because of how training sets
are set up)
Buitinck, Marx Two-stage NER
13. Recognition stage
Token-level task: is a token the Beginning of, Inside, or
Outside any entity name?
Features:
Word window wi−2 , . . . , wi+2
POS tags for words in window
Conjunction of words and POS tags in window, e.g.
(wi−1 , pi−1 )
Capitalization of tokens in window
(Character) prefixes and suffixes of wi and wi−1
REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
14. Recognition stage
Token-level task: is a token the Beginning of, Inside, or
Outside any entity name?
Features:
Word window wi−2 , . . . , wi+2
POS tags for words in window
Conjunction of words and POS tags in window, e.g.
(wi−1 , pi−1 )
Capitalization of tokens in window
(Character) prefixes and suffixes of wi and wi−1
REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
15. Recognition stage
Token-level task: is a token the Beginning of, Inside, or
Outside any entity name?
Features:
Word window wi−2 , . . . , wi+2
POS tags for words in window
Conjunction of words and POS tags in window, e.g.
(wi−1 , pi−1 )
Capitalization of tokens in window
(Character) prefixes and suffixes of wi and wi−1
REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
16. Recognition stage
Token-level task: is a token the Beginning of, Inside, or
Outside any entity name?
Features:
Word window wi−2 , . . . , wi+2
POS tags for words in window
Conjunction of words and POS tags in window, e.g.
(wi−1 , pi−1 )
Capitalization of tokens in window
(Character) prefixes and suffixes of wi and wi−1
REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
17. Recognition stage
Token-level task: is a token the Beginning of, Inside, or
Outside any entity name?
Features:
Word window wi−2 , . . . , wi+2
POS tags for words in window
Conjunction of words and POS tags in window, e.g.
(wi−1 , pi−1 )
Capitalization of tokens in window
(Character) prefixes and suffixes of wi and wi−1
REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
18. Recognition stage
Token-level task: is a token the Beginning of, Inside, or
Outside any entity name?
Features:
Word window wi−2 , . . . , wi+2
POS tags for words in window
Conjunction of words and POS tags in window, e.g.
(wi−1 , pi−1 )
Capitalization of tokens in window
(Character) prefixes and suffixes of wi and wi−1
REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
19. Recognition stage
Token-level task: is a token the Beginning of, Inside, or
Outside any entity name?
Features:
Word window wi−2 , . . . , wi+2
POS tags for words in window
Conjunction of words and POS tags in window, e.g.
(wi−1 , pi−1 )
Capitalization of tokens in window
(Character) prefixes and suffixes of wi and wi−1
REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
20. Recognition stage
Token-level task: is a token the Beginning of, Inside, or
Outside any entity name?
Features:
Word window wi−2 , . . . , wi+2
POS tags for words in window
Conjunction of words and POS tags in window, e.g.
(wi−1 , pi−1 )
Capitalization of tokens in window
(Character) prefixes and suffixes of wi and wi−1
REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
21. Classification stage
Don’t do this at token-level; we know the entity spans!
Input is a list of tokens considered an entity by the
recognition stage
Features:
The tokens we got from recognition
The four surrounding tokens
Their pre- and suffixes up to length four
Capitalization pattern, as a string on the alphabet (L|U|O)∗
The occurrence of capitalized tokens, digits and dashes in
the entire sentence
Buitinck, Marx Two-stage NER
22. Classification stage
Don’t do this at token-level; we know the entity spans!
Input is a list of tokens considered an entity by the
recognition stage
Features:
The tokens we got from recognition
The four surrounding tokens
Their pre- and suffixes up to length four
Capitalization pattern, as a string on the alphabet (L|U|O)∗
The occurrence of capitalized tokens, digits and dashes in
the entire sentence
Buitinck, Marx Two-stage NER
23. Classification stage
Don’t do this at token-level; we know the entity spans!
Input is a list of tokens considered an entity by the
recognition stage
Features:
The tokens we got from recognition
The four surrounding tokens
Their pre- and suffixes up to length four
Capitalization pattern, as a string on the alphabet (L|U|O)∗
The occurrence of capitalized tokens, digits and dashes in
the entire sentence
Buitinck, Marx Two-stage NER
24. Classification stage
Don’t do this at token-level; we know the entity spans!
Input is a list of tokens considered an entity by the
recognition stage
Features:
The tokens we got from recognition
The four surrounding tokens
Their pre- and suffixes up to length four
Capitalization pattern, as a string on the alphabet (L|U|O)∗
The occurrence of capitalized tokens, digits and dashes in
the entire sentence
Buitinck, Marx Two-stage NER
25. Classification stage
Don’t do this at token-level; we know the entity spans!
Input is a list of tokens considered an entity by the
recognition stage
Features:
The tokens we got from recognition
The four surrounding tokens
Their pre- and suffixes up to length four
Capitalization pattern, as a string on the alphabet (L|U|O)∗
The occurrence of capitalized tokens, digits and dashes in
the entire sentence
Buitinck, Marx Two-stage NER
26. Classification stage
Don’t do this at token-level; we know the entity spans!
Input is a list of tokens considered an entity by the
recognition stage
Features:
The tokens we got from recognition
The four surrounding tokens
Their pre- and suffixes up to length four
Capitalization pattern, as a string on the alphabet (L|U|O)∗
The occurrence of capitalized tokens, digits and dashes in
the entire sentence
Buitinck, Marx Two-stage NER
27. Classification stage
Don’t do this at token-level; we know the entity spans!
Input is a list of tokens considered an entity by the
recognition stage
Features:
The tokens we got from recognition
The four surrounding tokens
Their pre- and suffixes up to length four
Capitalization pattern, as a string on the alphabet (L|U|O)∗
The occurrence of capitalized tokens, digits and dashes in
the entire sentence
Buitinck, Marx Two-stage NER
28. Classification stage
Don’t do this at token-level; we know the entity spans!
Input is a list of tokens considered an entity by the
recognition stage
Features:
The tokens we got from recognition
The four surrounding tokens
Their pre- and suffixes up to length four
Capitalization pattern, as a string on the alphabet (L|U|O)∗
The occurrence of capitalized tokens, digits and dashes in
the entire sentence
Buitinck, Marx Two-stage NER
29. Learning algorithm
Use averaged perceptron for both stages
Learns an approximation of max-margin solution (linear
SVM)
40 iterations
Used the LBJ machine learning toolkit
Buitinck, Marx Two-stage NER
30. Learning algorithm
Use averaged perceptron for both stages
Learns an approximation of max-margin solution (linear
SVM)
40 iterations
Used the LBJ machine learning toolkit
Buitinck, Marx Two-stage NER
31. Learning algorithm
Use averaged perceptron for both stages
Learns an approximation of max-margin solution (linear
SVM)
40 iterations
Used the LBJ machine learning toolkit
Buitinck, Marx Two-stage NER
32. Learning algorithm
Use averaged perceptron for both stages
Learns an approximation of max-margin solution (linear
SVM)
40 iterations
Used the LBJ machine learning toolkit
Buitinck, Marx Two-stage NER
33. Evaluation
Aim for F1 score, as defined in the CoNLL 2002 shared
task on NER
Two corpora: CoNLL 2002 and a subset of SoNaR
(courtesy Desmet and Hoste)
Compare against Stanford and Desmet and Hoste’s
algorithm
Buitinck, Marx Two-stage NER
34. Evaluation
Aim for F1 score, as defined in the CoNLL 2002 shared
task on NER
Two corpora: CoNLL 2002 and a subset of SoNaR
(courtesy Desmet and Hoste)
Compare against Stanford and Desmet and Hoste’s
algorithm
Buitinck, Marx Two-stage NER
35. Evaluation
Aim for F1 score, as defined in the CoNLL 2002 shared
task on NER
Two corpora: CoNLL 2002 and a subset of SoNaR
(courtesy Desmet and Hoste)
Compare against Stanford and Desmet and Hoste’s
algorithm
Buitinck, Marx Two-stage NER
36. Results on CoNLL 2002
309.686 tokens containing 19901 names, four categories
65% training, 22% validation and 12% test sets
Stanford achieves F1 = 74.72; "miscellaneous" category is
hard (< 0.7)
We achieve F1 = 75.14; "organization" category is hard
Buitinck, Marx Two-stage NER
37. Results on CoNLL 2002
309.686 tokens containing 19901 names, four categories
65% training, 22% validation and 12% test sets
Stanford achieves F1 = 74.72; "miscellaneous" category is
hard (< 0.7)
We achieve F1 = 75.14; "organization" category is hard
Buitinck, Marx Two-stage NER
38. Results on CoNLL 2002
309.686 tokens containing 19901 names, four categories
65% training, 22% validation and 12% test sets
Stanford achieves F1 = 74.72; "miscellaneous" category is
hard (< 0.7)
We achieve F1 = 75.14; "organization" category is hard
Buitinck, Marx Two-stage NER
39. Results on CoNLL 2002
309.686 tokens containing 19901 names, four categories
65% training, 22% validation and 12% test sets
Stanford achieves F1 = 74.72; "miscellaneous" category is
hard (< 0.7)
We achieve F1 = 75.14; "organization" category is hard
Buitinck, Marx Two-stage NER
40. Results on SoNaR
New, large corpus with manual annotations
Used a 200k tokens subset of a preliminary version,
three-fold cross validation
State of the art is Desmet and Hoste (2011) with
F1 = 84.44
Best individual classifier from that paper (CRF) gets 83.77
Our system: 83.56
Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
41. Results on SoNaR
New, large corpus with manual annotations
Used a 200k tokens subset of a preliminary version,
three-fold cross validation
State of the art is Desmet and Hoste (2011) with
F1 = 84.44
Best individual classifier from that paper (CRF) gets 83.77
Our system: 83.56
Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
42. Results on SoNaR
New, large corpus with manual annotations
Used a 200k tokens subset of a preliminary version,
three-fold cross validation
State of the art is Desmet and Hoste (2011) with
F1 = 84.44
Best individual classifier from that paper (CRF) gets 83.77
Our system: 83.56
Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
43. Results on SoNaR
New, large corpus with manual annotations
Used a 200k tokens subset of a preliminary version,
three-fold cross validation
State of the art is Desmet and Hoste (2011) with
F1 = 84.44
Best individual classifier from that paper (CRF) gets 83.77
Our system: 83.56
Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
44. Results on SoNaR
New, large corpus with manual annotations
Used a 200k tokens subset of a preliminary version,
three-fold cross validation
State of the art is Desmet and Hoste (2011) with
F1 = 84.44
Best individual classifier from that paper (CRF) gets 83.77
Our system: 83.56
Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
45. Results on SoNaR
New, large corpus with manual annotations
Used a 200k tokens subset of a preliminary version,
three-fold cross validation
State of the art is Desmet and Hoste (2011) with
F1 = 84.44
Best individual classifier from that paper (CRF) gets 83.77
Our system: 83.56
Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
46. Conclusion
Near-state of the art performance from simple learners
with good feature sets
No gazetteers, so should be fairly reusable
(Side conclusion: SoNaR is more easily learnable than
CoNLL)
Buitinck, Marx Two-stage NER
47. Conclusion
Near-state of the art performance from simple learners
with good feature sets
No gazetteers, so should be fairly reusable
(Side conclusion: SoNaR is more easily learnable than
CoNLL)
Buitinck, Marx Two-stage NER
48. Conclusion
Near-state of the art performance from simple learners
with good feature sets
No gazetteers, so should be fairly reusable
(Side conclusion: SoNaR is more easily learnable than
CoNLL)
Buitinck, Marx Two-stage NER
49. Future work
Being integrated in UvA’s xTAS text analysis pipeline
Used to find entities in Dutch Hansard corpus
(forthcoming) and link entities to Wikipedia
Full SoNaR is now available; new evaluation needed
Buitinck, Marx Two-stage NER
50. Future work
Being integrated in UvA’s xTAS text analysis pipeline
Used to find entities in Dutch Hansard corpus
(forthcoming) and link entities to Wikipedia
Full SoNaR is now available; new evaluation needed
Buitinck, Marx Two-stage NER
51. Future work
Being integrated in UvA’s xTAS text analysis pipeline
Used to find entities in Dutch Hansard corpus
(forthcoming) and link entities to Wikipedia
Full SoNaR is now available; new evaluation needed
Buitinck, Marx Two-stage NER