SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Downloaden Sie, um offline zu lesen
Two-stage Named Entity Recognition using
          averaged perceptrons

            Lars Buitinck           Maarten Marx

          Information and Language Processing Systems
                       Informatics Institute
                     University of Amsterdam


 17th Int’l Conf. on Applications of NLP to Information
                        Systems




                   Buitinck, Marx   Two-stage NER
Outline




          Buitinck, Marx   Two-stage NER
Named Entity Recognition




     Find names in text and classify them as belonging to
     persons, locations, organizations, events, products or
     “miscellaneous”
     Use machine learning




                      Buitinck, Marx   Two-stage NER
Named Entity Recognition




     Find names in text and classify them as belonging to
     persons, locations, organizations, events, products or
     “miscellaneous”
     Use machine learning




                      Buitinck, Marx   Two-stage NER
Named Entity Recognition for Dutch




     State of the art algorithm for Dutch by Desmet and Hoste
     (2011); voting classifiers with GA to train weights
     Good training sets are just becoming available
     Many practitioners retrain Stanford CRF-NER tagger




                      Buitinck, Marx   Two-stage NER
Named Entity Recognition for Dutch




     State of the art algorithm for Dutch by Desmet and Hoste
     (2011); voting classifiers with GA to train weights
     Good training sets are just becoming available
     Many practitioners retrain Stanford CRF-NER tagger




                      Buitinck, Marx   Two-stage NER
Named Entity Recognition for Dutch




     State of the art algorithm for Dutch by Desmet and Hoste
     (2011); voting classifiers with GA to train weights
     Good training sets are just becoming available
     Many practitioners retrain Stanford CRF-NER tagger




                      Buitinck, Marx   Two-stage NER
Overview




     Realize that NER is two problems in one: recognition and
     classification
     Pipeline solution with two classifiers
     Use custom feature sets for each
     Do not used precompiled list of names (“gazetteer”)
     Work at the sentence level (because of how training sets
     are set up)




                       Buitinck, Marx   Two-stage NER
Overview




     Realize that NER is two problems in one: recognition and
     classification
     Pipeline solution with two classifiers
     Use custom feature sets for each
     Do not used precompiled list of names (“gazetteer”)
     Work at the sentence level (because of how training sets
     are set up)




                       Buitinck, Marx   Two-stage NER
Overview




     Realize that NER is two problems in one: recognition and
     classification
     Pipeline solution with two classifiers
     Use custom feature sets for each
     Do not used precompiled list of names (“gazetteer”)
     Work at the sentence level (because of how training sets
     are set up)




                       Buitinck, Marx   Two-stage NER
Overview




     Realize that NER is two problems in one: recognition and
     classification
     Pipeline solution with two classifiers
     Use custom feature sets for each
     Do not used precompiled list of names (“gazetteer”)
     Work at the sentence level (because of how training sets
     are set up)




                       Buitinck, Marx   Two-stage NER
Overview




     Realize that NER is two problems in one: recognition and
     classification
     Pipeline solution with two classifiers
     Use custom feature sets for each
     Do not used precompiled list of names (“gazetteer”)
     Work at the sentence level (because of how training sets
     are set up)




                       Buitinck, Marx   Two-stage NER
Recognition stage



     Token-level task: is a token the Beginning of, Inside, or
     Outside any entity name?
     Features:
         Word window wi−2 , . . . , wi+2
         POS tags for words in window
         Conjunction of words and POS tags in window, e.g.
         (wi−1 , pi−1 )
         Capitalization of tokens in window
         (Character) prefixes and suffixes of wi and wi−1
         REs for digits, Roman numerals and punctuation




                       Buitinck, Marx   Two-stage NER
Recognition stage



     Token-level task: is a token the Beginning of, Inside, or
     Outside any entity name?
     Features:
         Word window wi−2 , . . . , wi+2
         POS tags for words in window
         Conjunction of words and POS tags in window, e.g.
         (wi−1 , pi−1 )
         Capitalization of tokens in window
         (Character) prefixes and suffixes of wi and wi−1
         REs for digits, Roman numerals and punctuation




                       Buitinck, Marx   Two-stage NER
Recognition stage



     Token-level task: is a token the Beginning of, Inside, or
     Outside any entity name?
     Features:
         Word window wi−2 , . . . , wi+2
         POS tags for words in window
         Conjunction of words and POS tags in window, e.g.
         (wi−1 , pi−1 )
         Capitalization of tokens in window
         (Character) prefixes and suffixes of wi and wi−1
         REs for digits, Roman numerals and punctuation




                       Buitinck, Marx   Two-stage NER
Recognition stage



     Token-level task: is a token the Beginning of, Inside, or
     Outside any entity name?
     Features:
         Word window wi−2 , . . . , wi+2
         POS tags for words in window
         Conjunction of words and POS tags in window, e.g.
         (wi−1 , pi−1 )
         Capitalization of tokens in window
         (Character) prefixes and suffixes of wi and wi−1
         REs for digits, Roman numerals and punctuation




                       Buitinck, Marx   Two-stage NER
Recognition stage



     Token-level task: is a token the Beginning of, Inside, or
     Outside any entity name?
     Features:
         Word window wi−2 , . . . , wi+2
         POS tags for words in window
         Conjunction of words and POS tags in window, e.g.
         (wi−1 , pi−1 )
         Capitalization of tokens in window
         (Character) prefixes and suffixes of wi and wi−1
         REs for digits, Roman numerals and punctuation




                       Buitinck, Marx   Two-stage NER
Recognition stage



     Token-level task: is a token the Beginning of, Inside, or
     Outside any entity name?
     Features:
         Word window wi−2 , . . . , wi+2
         POS tags for words in window
         Conjunction of words and POS tags in window, e.g.
         (wi−1 , pi−1 )
         Capitalization of tokens in window
         (Character) prefixes and suffixes of wi and wi−1
         REs for digits, Roman numerals and punctuation




                       Buitinck, Marx   Two-stage NER
Recognition stage



     Token-level task: is a token the Beginning of, Inside, or
     Outside any entity name?
     Features:
         Word window wi−2 , . . . , wi+2
         POS tags for words in window
         Conjunction of words and POS tags in window, e.g.
         (wi−1 , pi−1 )
         Capitalization of tokens in window
         (Character) prefixes and suffixes of wi and wi−1
         REs for digits, Roman numerals and punctuation




                       Buitinck, Marx   Two-stage NER
Recognition stage



     Token-level task: is a token the Beginning of, Inside, or
     Outside any entity name?
     Features:
         Word window wi−2 , . . . , wi+2
         POS tags for words in window
         Conjunction of words and POS tags in window, e.g.
         (wi−1 , pi−1 )
         Capitalization of tokens in window
         (Character) prefixes and suffixes of wi and wi−1
         REs for digits, Roman numerals and punctuation




                       Buitinck, Marx   Two-stage NER
Classification stage



     Don’t do this at token-level; we know the entity spans!
     Input is a list of tokens considered an entity by the
     recognition stage
     Features:
         The tokens we got from recognition
         The four surrounding tokens
         Their pre- and suffixes up to length four
         Capitalization pattern, as a string on the alphabet (L|U|O)∗
         The occurrence of capitalized tokens, digits and dashes in
         the entire sentence




                       Buitinck, Marx   Two-stage NER
Classification stage



     Don’t do this at token-level; we know the entity spans!
     Input is a list of tokens considered an entity by the
     recognition stage
     Features:
         The tokens we got from recognition
         The four surrounding tokens
         Their pre- and suffixes up to length four
         Capitalization pattern, as a string on the alphabet (L|U|O)∗
         The occurrence of capitalized tokens, digits and dashes in
         the entire sentence




                       Buitinck, Marx   Two-stage NER
Classification stage



     Don’t do this at token-level; we know the entity spans!
     Input is a list of tokens considered an entity by the
     recognition stage
     Features:
         The tokens we got from recognition
         The four surrounding tokens
         Their pre- and suffixes up to length four
         Capitalization pattern, as a string on the alphabet (L|U|O)∗
         The occurrence of capitalized tokens, digits and dashes in
         the entire sentence




                       Buitinck, Marx   Two-stage NER
Classification stage



     Don’t do this at token-level; we know the entity spans!
     Input is a list of tokens considered an entity by the
     recognition stage
     Features:
         The tokens we got from recognition
         The four surrounding tokens
         Their pre- and suffixes up to length four
         Capitalization pattern, as a string on the alphabet (L|U|O)∗
         The occurrence of capitalized tokens, digits and dashes in
         the entire sentence




                       Buitinck, Marx   Two-stage NER
Classification stage



     Don’t do this at token-level; we know the entity spans!
     Input is a list of tokens considered an entity by the
     recognition stage
     Features:
         The tokens we got from recognition
         The four surrounding tokens
         Their pre- and suffixes up to length four
         Capitalization pattern, as a string on the alphabet (L|U|O)∗
         The occurrence of capitalized tokens, digits and dashes in
         the entire sentence




                       Buitinck, Marx   Two-stage NER
Classification stage



     Don’t do this at token-level; we know the entity spans!
     Input is a list of tokens considered an entity by the
     recognition stage
     Features:
         The tokens we got from recognition
         The four surrounding tokens
         Their pre- and suffixes up to length four
         Capitalization pattern, as a string on the alphabet (L|U|O)∗
         The occurrence of capitalized tokens, digits and dashes in
         the entire sentence




                       Buitinck, Marx   Two-stage NER
Classification stage



     Don’t do this at token-level; we know the entity spans!
     Input is a list of tokens considered an entity by the
     recognition stage
     Features:
         The tokens we got from recognition
         The four surrounding tokens
         Their pre- and suffixes up to length four
         Capitalization pattern, as a string on the alphabet (L|U|O)∗
         The occurrence of capitalized tokens, digits and dashes in
         the entire sentence




                       Buitinck, Marx   Two-stage NER
Classification stage



     Don’t do this at token-level; we know the entity spans!
     Input is a list of tokens considered an entity by the
     recognition stage
     Features:
         The tokens we got from recognition
         The four surrounding tokens
         Their pre- and suffixes up to length four
         Capitalization pattern, as a string on the alphabet (L|U|O)∗
         The occurrence of capitalized tokens, digits and dashes in
         the entire sentence




                       Buitinck, Marx   Two-stage NER
Learning algorithm




     Use averaged perceptron for both stages
     Learns an approximation of max-margin solution (linear
     SVM)
     40 iterations
     Used the LBJ machine learning toolkit




                      Buitinck, Marx   Two-stage NER
Learning algorithm




     Use averaged perceptron for both stages
     Learns an approximation of max-margin solution (linear
     SVM)
     40 iterations
     Used the LBJ machine learning toolkit




                      Buitinck, Marx   Two-stage NER
Learning algorithm




     Use averaged perceptron for both stages
     Learns an approximation of max-margin solution (linear
     SVM)
     40 iterations
     Used the LBJ machine learning toolkit




                      Buitinck, Marx   Two-stage NER
Learning algorithm




     Use averaged perceptron for both stages
     Learns an approximation of max-margin solution (linear
     SVM)
     40 iterations
     Used the LBJ machine learning toolkit




                      Buitinck, Marx   Two-stage NER
Evaluation




     Aim for F1 score, as defined in the CoNLL 2002 shared
     task on NER
     Two corpora: CoNLL 2002 and a subset of SoNaR
     (courtesy Desmet and Hoste)
     Compare against Stanford and Desmet and Hoste’s
     algorithm




                     Buitinck, Marx   Two-stage NER
Evaluation




     Aim for F1 score, as defined in the CoNLL 2002 shared
     task on NER
     Two corpora: CoNLL 2002 and a subset of SoNaR
     (courtesy Desmet and Hoste)
     Compare against Stanford and Desmet and Hoste’s
     algorithm




                     Buitinck, Marx   Two-stage NER
Evaluation




     Aim for F1 score, as defined in the CoNLL 2002 shared
     task on NER
     Two corpora: CoNLL 2002 and a subset of SoNaR
     (courtesy Desmet and Hoste)
     Compare against Stanford and Desmet and Hoste’s
     algorithm




                     Buitinck, Marx   Two-stage NER
Results on CoNLL 2002




     309.686 tokens containing 19901 names, four categories
     65% training, 22% validation and 12% test sets
     Stanford achieves F1 = 74.72; "miscellaneous" category is
     hard (< 0.7)
     We achieve F1 = 75.14; "organization" category is hard




                      Buitinck, Marx   Two-stage NER
Results on CoNLL 2002




     309.686 tokens containing 19901 names, four categories
     65% training, 22% validation and 12% test sets
     Stanford achieves F1 = 74.72; "miscellaneous" category is
     hard (< 0.7)
     We achieve F1 = 75.14; "organization" category is hard




                      Buitinck, Marx   Two-stage NER
Results on CoNLL 2002




     309.686 tokens containing 19901 names, four categories
     65% training, 22% validation and 12% test sets
     Stanford achieves F1 = 74.72; "miscellaneous" category is
     hard (< 0.7)
     We achieve F1 = 75.14; "organization" category is hard




                      Buitinck, Marx   Two-stage NER
Results on CoNLL 2002




     309.686 tokens containing 19901 names, four categories
     65% training, 22% validation and 12% test sets
     Stanford achieves F1 = 74.72; "miscellaneous" category is
     hard (< 0.7)
     We achieve F1 = 75.14; "organization" category is hard




                      Buitinck, Marx   Two-stage NER
Results on SoNaR



     New, large corpus with manual annotations
     Used a 200k tokens subset of a preliminary version,
     three-fold cross validation
     State of the art is Desmet and Hoste (2011) with
     F1 = 84.44
     Best individual classifier from that paper (CRF) gets 83.77
     Our system: 83.56
     Here, “product” and “miscellaneous” categories are hard




                      Buitinck, Marx   Two-stage NER
Results on SoNaR



     New, large corpus with manual annotations
     Used a 200k tokens subset of a preliminary version,
     three-fold cross validation
     State of the art is Desmet and Hoste (2011) with
     F1 = 84.44
     Best individual classifier from that paper (CRF) gets 83.77
     Our system: 83.56
     Here, “product” and “miscellaneous” categories are hard




                      Buitinck, Marx   Two-stage NER
Results on SoNaR



     New, large corpus with manual annotations
     Used a 200k tokens subset of a preliminary version,
     three-fold cross validation
     State of the art is Desmet and Hoste (2011) with
     F1 = 84.44
     Best individual classifier from that paper (CRF) gets 83.77
     Our system: 83.56
     Here, “product” and “miscellaneous” categories are hard




                      Buitinck, Marx   Two-stage NER
Results on SoNaR



     New, large corpus with manual annotations
     Used a 200k tokens subset of a preliminary version,
     three-fold cross validation
     State of the art is Desmet and Hoste (2011) with
     F1 = 84.44
     Best individual classifier from that paper (CRF) gets 83.77
     Our system: 83.56
     Here, “product” and “miscellaneous” categories are hard




                      Buitinck, Marx   Two-stage NER
Results on SoNaR



     New, large corpus with manual annotations
     Used a 200k tokens subset of a preliminary version,
     three-fold cross validation
     State of the art is Desmet and Hoste (2011) with
     F1 = 84.44
     Best individual classifier from that paper (CRF) gets 83.77
     Our system: 83.56
     Here, “product” and “miscellaneous” categories are hard




                      Buitinck, Marx   Two-stage NER
Results on SoNaR



     New, large corpus with manual annotations
     Used a 200k tokens subset of a preliminary version,
     three-fold cross validation
     State of the art is Desmet and Hoste (2011) with
     F1 = 84.44
     Best individual classifier from that paper (CRF) gets 83.77
     Our system: 83.56
     Here, “product” and “miscellaneous” categories are hard




                      Buitinck, Marx   Two-stage NER
Conclusion




     Near-state of the art performance from simple learners
     with good feature sets
     No gazetteers, so should be fairly reusable
     (Side conclusion: SoNaR is more easily learnable than
     CoNLL)




                      Buitinck, Marx   Two-stage NER
Conclusion




     Near-state of the art performance from simple learners
     with good feature sets
     No gazetteers, so should be fairly reusable
     (Side conclusion: SoNaR is more easily learnable than
     CoNLL)




                      Buitinck, Marx   Two-stage NER
Conclusion




     Near-state of the art performance from simple learners
     with good feature sets
     No gazetteers, so should be fairly reusable
     (Side conclusion: SoNaR is more easily learnable than
     CoNLL)




                      Buitinck, Marx   Two-stage NER
Future work




     Being integrated in UvA’s xTAS text analysis pipeline
     Used to find entities in Dutch Hansard corpus
     (forthcoming) and link entities to Wikipedia
     Full SoNaR is now available; new evaluation needed




                      Buitinck, Marx   Two-stage NER
Future work




     Being integrated in UvA’s xTAS text analysis pipeline
     Used to find entities in Dutch Hansard corpus
     (forthcoming) and link entities to Wikipedia
     Full SoNaR is now available; new evaluation needed




                      Buitinck, Marx   Two-stage NER
Future work




     Being integrated in UvA’s xTAS text analysis pipeline
     Used to find entities in Dutch Hansard corpus
     (forthcoming) and link entities to Wikipedia
     Full SoNaR is now available; new evaluation needed




                      Buitinck, Marx   Two-stage NER

Weitere ähnliche Inhalte

Mehr von maartenmarx

Haagse Hogeschool 2012-09-13
Haagse Hogeschool 2012-09-13Haagse Hogeschool 2012-09-13
Haagse Hogeschool 2012-09-13maartenmarx
 
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13maartenmarx
 
Economie van de aandacht
  Economie van de aandacht  Economie van de aandacht
Economie van de aandachtmaartenmarx
 
Dans dataprijs2012
Dans dataprijs2012Dans dataprijs2012
Dans dataprijs2012maartenmarx
 
College sicco van-sas-2012_10_08
College sicco van-sas-2012_10_08College sicco van-sas-2012_10_08
College sicco van-sas-2012_10_08maartenmarx
 
Women in Dutch parliament: what they did
Women in Dutch parliament: what they didWomen in Dutch parliament: what they did
Women in Dutch parliament: what they didmaartenmarx
 
Keynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official PublicationsKeynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official Publicationsmaartenmarx
 
Namescape 2012 03 06
Namescape 2012 03 06Namescape 2012 03 06
Namescape 2012 03 06maartenmarx
 
voting advice slides
 voting advice slides voting advice slides
voting advice slidesmaartenmarx
 
TV-slant presentatie_politicologen_etmaal
TV-slant presentatie_politicologen_etmaalTV-slant presentatie_politicologen_etmaal
TV-slant presentatie_politicologen_etmaalmaartenmarx
 
Groningen nl pgroep
Groningen nl pgroepGroningen nl pgroep
Groningen nl pgroepmaartenmarx
 
networks inparliament-ccct
 networks inparliament-ccct networks inparliament-ccct
networks inparliament-ccctmaartenmarx
 
Screen biographischportaal2010 12-10
Screen biographischportaal2010 12-10Screen biographischportaal2010 12-10
Screen biographischportaal2010 12-10maartenmarx
 

Mehr von maartenmarx (13)

Haagse Hogeschool 2012-09-13
Haagse Hogeschool 2012-09-13Haagse Hogeschool 2012-09-13
Haagse Hogeschool 2012-09-13
 
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
Expertmeeting, E-humanities en politieke geschiedenis, Nijmegen, 2013-09-13
 
Economie van de aandacht
  Economie van de aandacht  Economie van de aandacht
Economie van de aandacht
 
Dans dataprijs2012
Dans dataprijs2012Dans dataprijs2012
Dans dataprijs2012
 
College sicco van-sas-2012_10_08
College sicco van-sas-2012_10_08College sicco van-sas-2012_10_08
College sicco van-sas-2012_10_08
 
Women in Dutch parliament: what they did
Women in Dutch parliament: what they didWomen in Dutch parliament: what they did
Women in Dutch parliament: what they did
 
Keynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official PublicationsKeynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official Publications
 
Namescape 2012 03 06
Namescape 2012 03 06Namescape 2012 03 06
Namescape 2012 03 06
 
voting advice slides
 voting advice slides voting advice slides
voting advice slides
 
TV-slant presentatie_politicologen_etmaal
TV-slant presentatie_politicologen_etmaalTV-slant presentatie_politicologen_etmaal
TV-slant presentatie_politicologen_etmaal
 
Groningen nl pgroep
Groningen nl pgroepGroningen nl pgroep
Groningen nl pgroep
 
networks inparliament-ccct
 networks inparliament-ccct networks inparliament-ccct
networks inparliament-ccct
 
Screen biographischportaal2010 12-10
Screen biographischportaal2010 12-10Screen biographischportaal2010 12-10
Screen biographischportaal2010 12-10
 

Kürzlich hochgeladen

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

Presentation at NLDB 2012

  • 1. Two-stage Named Entity Recognition using averaged perceptrons Lars Buitinck Maarten Marx Information and Language Processing Systems Informatics Institute University of Amsterdam 17th Int’l Conf. on Applications of NLP to Information Systems Buitinck, Marx Two-stage NER
  • 2. Outline Buitinck, Marx Two-stage NER
  • 3. Named Entity Recognition Find names in text and classify them as belonging to persons, locations, organizations, events, products or “miscellaneous” Use machine learning Buitinck, Marx Two-stage NER
  • 4. Named Entity Recognition Find names in text and classify them as belonging to persons, locations, organizations, events, products or “miscellaneous” Use machine learning Buitinck, Marx Two-stage NER
  • 5. Named Entity Recognition for Dutch State of the art algorithm for Dutch by Desmet and Hoste (2011); voting classifiers with GA to train weights Good training sets are just becoming available Many practitioners retrain Stanford CRF-NER tagger Buitinck, Marx Two-stage NER
  • 6. Named Entity Recognition for Dutch State of the art algorithm for Dutch by Desmet and Hoste (2011); voting classifiers with GA to train weights Good training sets are just becoming available Many practitioners retrain Stanford CRF-NER tagger Buitinck, Marx Two-stage NER
  • 7. Named Entity Recognition for Dutch State of the art algorithm for Dutch by Desmet and Hoste (2011); voting classifiers with GA to train weights Good training sets are just becoming available Many practitioners retrain Stanford CRF-NER tagger Buitinck, Marx Two-stage NER
  • 8. Overview Realize that NER is two problems in one: recognition and classification Pipeline solution with two classifiers Use custom feature sets for each Do not used precompiled list of names (“gazetteer”) Work at the sentence level (because of how training sets are set up) Buitinck, Marx Two-stage NER
  • 9. Overview Realize that NER is two problems in one: recognition and classification Pipeline solution with two classifiers Use custom feature sets for each Do not used precompiled list of names (“gazetteer”) Work at the sentence level (because of how training sets are set up) Buitinck, Marx Two-stage NER
  • 10. Overview Realize that NER is two problems in one: recognition and classification Pipeline solution with two classifiers Use custom feature sets for each Do not used precompiled list of names (“gazetteer”) Work at the sentence level (because of how training sets are set up) Buitinck, Marx Two-stage NER
  • 11. Overview Realize that NER is two problems in one: recognition and classification Pipeline solution with two classifiers Use custom feature sets for each Do not used precompiled list of names (“gazetteer”) Work at the sentence level (because of how training sets are set up) Buitinck, Marx Two-stage NER
  • 12. Overview Realize that NER is two problems in one: recognition and classification Pipeline solution with two classifiers Use custom feature sets for each Do not used precompiled list of names (“gazetteer”) Work at the sentence level (because of how training sets are set up) Buitinck, Marx Two-stage NER
  • 13. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  • 14. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  • 15. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  • 16. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  • 17. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  • 18. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  • 19. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  • 20. Recognition stage Token-level task: is a token the Beginning of, Inside, or Outside any entity name? Features: Word window wi−2 , . . . , wi+2 POS tags for words in window Conjunction of words and POS tags in window, e.g. (wi−1 , pi−1 ) Capitalization of tokens in window (Character) prefixes and suffixes of wi and wi−1 REs for digits, Roman numerals and punctuation Buitinck, Marx Two-stage NER
  • 21. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  • 22. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  • 23. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  • 24. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  • 25. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  • 26. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  • 27. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  • 28. Classification stage Don’t do this at token-level; we know the entity spans! Input is a list of tokens considered an entity by the recognition stage Features: The tokens we got from recognition The four surrounding tokens Their pre- and suffixes up to length four Capitalization pattern, as a string on the alphabet (L|U|O)∗ The occurrence of capitalized tokens, digits and dashes in the entire sentence Buitinck, Marx Two-stage NER
  • 29. Learning algorithm Use averaged perceptron for both stages Learns an approximation of max-margin solution (linear SVM) 40 iterations Used the LBJ machine learning toolkit Buitinck, Marx Two-stage NER
  • 30. Learning algorithm Use averaged perceptron for both stages Learns an approximation of max-margin solution (linear SVM) 40 iterations Used the LBJ machine learning toolkit Buitinck, Marx Two-stage NER
  • 31. Learning algorithm Use averaged perceptron for both stages Learns an approximation of max-margin solution (linear SVM) 40 iterations Used the LBJ machine learning toolkit Buitinck, Marx Two-stage NER
  • 32. Learning algorithm Use averaged perceptron for both stages Learns an approximation of max-margin solution (linear SVM) 40 iterations Used the LBJ machine learning toolkit Buitinck, Marx Two-stage NER
  • 33. Evaluation Aim for F1 score, as defined in the CoNLL 2002 shared task on NER Two corpora: CoNLL 2002 and a subset of SoNaR (courtesy Desmet and Hoste) Compare against Stanford and Desmet and Hoste’s algorithm Buitinck, Marx Two-stage NER
  • 34. Evaluation Aim for F1 score, as defined in the CoNLL 2002 shared task on NER Two corpora: CoNLL 2002 and a subset of SoNaR (courtesy Desmet and Hoste) Compare against Stanford and Desmet and Hoste’s algorithm Buitinck, Marx Two-stage NER
  • 35. Evaluation Aim for F1 score, as defined in the CoNLL 2002 shared task on NER Two corpora: CoNLL 2002 and a subset of SoNaR (courtesy Desmet and Hoste) Compare against Stanford and Desmet and Hoste’s algorithm Buitinck, Marx Two-stage NER
  • 36. Results on CoNLL 2002 309.686 tokens containing 19901 names, four categories 65% training, 22% validation and 12% test sets Stanford achieves F1 = 74.72; "miscellaneous" category is hard (< 0.7) We achieve F1 = 75.14; "organization" category is hard Buitinck, Marx Two-stage NER
  • 37. Results on CoNLL 2002 309.686 tokens containing 19901 names, four categories 65% training, 22% validation and 12% test sets Stanford achieves F1 = 74.72; "miscellaneous" category is hard (< 0.7) We achieve F1 = 75.14; "organization" category is hard Buitinck, Marx Two-stage NER
  • 38. Results on CoNLL 2002 309.686 tokens containing 19901 names, four categories 65% training, 22% validation and 12% test sets Stanford achieves F1 = 74.72; "miscellaneous" category is hard (< 0.7) We achieve F1 = 75.14; "organization" category is hard Buitinck, Marx Two-stage NER
  • 39. Results on CoNLL 2002 309.686 tokens containing 19901 names, four categories 65% training, 22% validation and 12% test sets Stanford achieves F1 = 74.72; "miscellaneous" category is hard (< 0.7) We achieve F1 = 75.14; "organization" category is hard Buitinck, Marx Two-stage NER
  • 40. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  • 41. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  • 42. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  • 43. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  • 44. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  • 45. Results on SoNaR New, large corpus with manual annotations Used a 200k tokens subset of a preliminary version, three-fold cross validation State of the art is Desmet and Hoste (2011) with F1 = 84.44 Best individual classifier from that paper (CRF) gets 83.77 Our system: 83.56 Here, “product” and “miscellaneous” categories are hard Buitinck, Marx Two-stage NER
  • 46. Conclusion Near-state of the art performance from simple learners with good feature sets No gazetteers, so should be fairly reusable (Side conclusion: SoNaR is more easily learnable than CoNLL) Buitinck, Marx Two-stage NER
  • 47. Conclusion Near-state of the art performance from simple learners with good feature sets No gazetteers, so should be fairly reusable (Side conclusion: SoNaR is more easily learnable than CoNLL) Buitinck, Marx Two-stage NER
  • 48. Conclusion Near-state of the art performance from simple learners with good feature sets No gazetteers, so should be fairly reusable (Side conclusion: SoNaR is more easily learnable than CoNLL) Buitinck, Marx Two-stage NER
  • 49. Future work Being integrated in UvA’s xTAS text analysis pipeline Used to find entities in Dutch Hansard corpus (forthcoming) and link entities to Wikipedia Full SoNaR is now available; new evaluation needed Buitinck, Marx Two-stage NER
  • 50. Future work Being integrated in UvA’s xTAS text analysis pipeline Used to find entities in Dutch Hansard corpus (forthcoming) and link entities to Wikipedia Full SoNaR is now available; new evaluation needed Buitinck, Marx Two-stage NER
  • 51. Future work Being integrated in UvA’s xTAS text analysis pipeline Used to find entities in Dutch Hansard corpus (forthcoming) and link entities to Wikipedia Full SoNaR is now available; new evaluation needed Buitinck, Marx Two-stage NER