SlideShare ist ein Scribd-Unternehmen logo
1 von 2
Identification of Entities in Swedish
                  Andreas Salomonsson⇤                  Svetoslav Marinov⇤              Pierre Nugues†
                                           Findwise AB, Drottninggatan 5
                                               ⇤

                                          SE-411 14 Gothenburg, Sweden
                       {andreas.salomonsson,svetoslav.marinov}@findwise.com
                                †
                                  Department of Computer Science, Lund University
                                              S-221 00 Lund, Sweden
                                         pierre.nugues@cs.lth.se

1. Introduction                                                  entities: person, organization, location, work, event, and
A crucial aspect of search applications is the possibility to    other. It reports an F-score of 0.92 measured on test cor-
identify named entities in free-form text and provide func-      pus of 1800 words. On a test set of 40 000 words without
tionality for entity-based, complex queries towards the in-      gazetteers, the recall of the system drops from 0.91 to 0.53.
dexed data. By enriching each entity with semantically rel-      No information of precision is given by the authors.
evant information acquired from outside the text, one can
create the foundation for an advanced search application.        3. System Description
Thus, given a document about Denmark, where neither of           3.1   Named Entity Recognition
the words Copenhagen, country, nor capital are mentioned,
                                                                 We have tackled the NER problem by relying entirely on
it should be possible to retrieve the document by querying
                                                                 machine–learning techniques. We use a linear classifier,
for Copenhagen or European country.
                                                                 LIBLINEAR (Fan et al., 2008), and train and test the sys-
    In this paper, we report how we have tackled this prob-
                                                                 tem on the Stockholm-Ume˚ Corpus 2.0 (SUC) (Ejerhed et
                                                                                              a
lem. We will, however, concentrate only on the two tasks
                                                                 al., 2006).
which are central to the solution, namely named entity
                                                                    In addition to the POS tags of the tokens, SUC con-
recognition (NER) and enrichment of the discovered en-
                                                                 tains information about nine different categories of named
tities by relying on linked data from knowledge bases such
                                                                 entities: person, place, institution, work, animal, product,
as YAGO2 and DBpedia. We remain agnostic to all other
                                                                 myth, event, and other. Following the standards in CoNLL
details of the search application, which can be implemented
                                                                 2003, we chose to identify four categories: person, organi-
in a relatively straight-forward way by using, e.g. Apache
                                                                 zation, location, and miscellaneous, thus merging the prod-
Solr1 .
                                                                 uct, myth, event, animal, work and other classes in a miscel-
    The work deals only with Swedish and is restricted to
                                                                 laneous category and mapping institution to organization,
two domains: news articles2 and medical texts3 . As a
                                                                 and place to location.
byproduct, our method achieves state-of-the-art results for
                                                                    The most important features of the classifier are: the POS
Swedish NER and to our knowledge there are no previously
                                                                 tags of the surrounding and the current token, the word to-
published works on employing linked data for Swedish for
                                                                 kens themselves, and the previous two named entity tags.
the two domains at hand.
                                                                 Other features include Booleans to describe the initial capi-
2. Named Entity Recognition for Swedish                          talization, if the word contains periods, and contains digits.
                                                                 As advocated by Ratinov and Roth (2009), we employ the
Named entity recognition is an already well-established re-
                                                                 BILOU (Begin Inside Last Outside Unique) annotation for
search area with a number of conferences and evaluations
                                                                 the entities.
dedicated to the task (e.g. CoNLL 2003 (Tjong et al.,
2003)). While many systems have been created for English         3.2   Linked Data
and other languages, much fewer works have been reported
on Swedish.                                                      Linked Data is a part of the semantic web initiative. In
                   ˚ o
   Dalianis and Astr¨ m (2001) and Johannessen et al.            its core, it consists of concepts interlinked with each other
(2005) are examples of NER systems for Swedish. Dalianis         by RDF triples. This allows us to augment the discov-
      ˚ o
and Astr¨ m (2001) employ rules, lexicons, and machine–          ered entities with additional information related to them.
learning techniques in order to recognize four types of enti-    We have used the semantic network YAGO2 (Hoffart et
ties: person, location, organization, and time. It achieves an   al., 2011) and the DBpedia knowledge base. Each named
overall F-score of 0.49 on 100 manually tagged news texts.       entity we extract from the NER module is mapped to an
   The Swedish system described in Johannessen et al.            identifier in YAGO2 if the entity exists in the semantic
(2005) is rule-based, but relies on shallow parsing and em-      network. We can then use information about the entity
ploys gazetteers as well. It can recognize six types of          and its relations to other entities. YAGO2 is stored and
                                                                 queried using Sesame RDF repository from openRDF.org,
   1
     http://lucene/apache.org/solr/                              and one can use SPARQL as the query language. One
   2
     Articles crawled from dn.se                                 of the most important predicates in YAGO2 for our work
   3
     Articles from 1177.se                                       is the isCalled predicate. Given an entity E we can
Category           Exist   Found     Correct     Precision (%)    Recall (%)     F1 (%)
                   Persons           15128    17198      13626          78.17          90.47        83.87
                   Organizations      6332     4540       3089          68.33          47.49        56.03
                   Locations          8773     8974       6926          76.97          78.32        77.64
                   Miscellaneous      3956     2051       1249          64.73          29.60        40.62
                   Total             34189    32763      24890          75.77          72.35        74.02
                   Unlabelled        34189    32922      30655          93.11          89.66        91.36

                                                 Table 1: NER evaluation



use SELECT ?id WHERE {?id isCalled "E"}?                          their default values. While our system achieves very good
to get one or several unique identifiers. If there is only         results for Swedish (see Table 1), it is difficult to compare
one, we map the entity with its identifier. When multiple          it to previous systems due to the differences in test data
identifiers are found, we use a very simple method for dis-        and number of categories. Yet, we see more room for im-
ambiguation: We take the identifier with most information          provement by adding gazetteers and using word clusters as
related to it.                                                    features. In addition, we have noticed inconsistencies in the
   While YAGO2 was used primarily with the news articles,         annotation of entities in SUC. Titles are sometimes part of
DBPedia was employed for the medical texts. First each            the entities and sometimes not.
medical term in the SweMESH4 taxonomy was mapped to                  Finally, by enriching entities with related information al-
an unique identifier from DBPedia. We then searched in             lows us to retrieve more documents, cluster them in a better
the document only for those medical terms which are in            way or populate ontologies by using such data. As the ex-
SweMESH and use the unique identifier to extract more              tracted information resides in a META field of an indexed
information about them.                                           document and the field often gets a higher score, the docu-
                                                                  ment will get an overall higher rank in the result list.
4. Architecture Overview
The core of the system which identifies and augments               6. References
named entities is implemented as a pipeline. The text of                                            ˚ o
                                                                  Hercules Dalianis and Erik Astr¨ m. 2001. SweNam–
each document passes through the following stages: tok-              A Swedish named entity recognizer. Technical report,
enization, sentence detection, POS tagging, NER, extrac-             KTH.
tion of semantic information. At the end of the pipeline,         Eva Ejerhed, Gunnel K¨ llgren, and Benny Brodda. 2006.
                                                                                             a
the augmented document is sent for indexing.                         Stockholm-ume˚ corpus version 2.0.
                                                                                       a
   The tokenizer was implemented by modifying the                 Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui
Lucene tokenizer used by Apache Solr. It uses regular ex-            Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library
pressions to define what a token is. The sentence detector is         for large linear classification. Journal of Machine Learn-
rule-based and uses information about the types of the to-           ing Research, 9:1871–1874.
kens, and the tokens themselves as identified in the previous      P´ ter Hal´ csy, Andr´ s Kornai, and Csaba Oravecz. 2007.
                                                                   e         a           a
step. The POS tagging stage employs HunPos (Hal´ csy et
                                                     a               HunPos: an open source trigram tagger. In Proceedings
al., 2007) which has been trained on SUC. The NER stage              of the ACL 2007 Demo and Poster Sessions.
is the system which was described in Section 3.1 above.           Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich,
Finally, all named entities are enriched with semantic in-           Edwin Lewis-Kelham, Gerard de Melo, and Gerhard
formation if such is available from the knowledge bases.             Weikum. 2011. Yago2: Exploring and querying world
   Given the sentence Danmark ligger i Nordeuropa., the              knowledge in time, space, context, and many languages.
two locations, Danmark and Nordeuropa, are identified.                In Proceedings of the 20th International Conference
We then proceed by mapping the two entities to their                 Companion on World Wide Web (WWW 2011). ACM.
corresponding identifiers in the semantic network. The                                                            ˚
                                                                  Janne Bondi Johannessen, Kristin Hagen, Asne Haaland,
YAGO2 identifier for Danmark is http://www.mpii.                      Andra Bj¨ rk J´ nsdottir, Anders Nøklestad, Dimitris
                                                                                 o     o
de/yago/resource/Denmark and we use it to extract                    Kokkinakis, Paul Meurer, Eckhard Bick, and Dorte Hal-
information, e.g. Denmark is a European country; its capi-           trup. 2005. Named Entity Recognition for the Mainland
tal is Copenhagen, with which we then augment the META-              Scandinavian Languages. Literary and Linguistic Com-
data of the document.                                                puting, 20(1).
                                                                  Lev Ratinov and Dan Roth. 2009. Design challenges and
5. Results and Discussion                                            misconceptions in named entity recognition. In CoNLL
We have evaluated the performance of the NER system with             ’09: Proceedings of the Thirteenth Conference on Com-
a 10-fold cross-validation on SUC. We used L2-regularized            putational Natural Language Learning. ACL.
L2-loss support vector classification (primal) as the solver       Erik F. Tjong, Kim Sang, and Fien De Meulder. 2003. In-
type in LIBLINEAR and all other parameters were set to               troduction to the CoNLL-2003 Shared Task: Language-
                                                                     Independent Named Entity Recognition. In Proceedings
   4
   http://mesh.kib.ki.se/swemesh/swemesh_                            of CoNLL-2003.
se.cfm

Weitere ähnliche Inhalte

Ähnlich wie Identification of Entities in Swedish

Adaptive named entity recognition for social network analysis and domain onto...
Adaptive named entity recognition for social network analysis and domain onto...Adaptive named entity recognition for social network analysis and domain onto...
Adaptive named entity recognition for social network analysis and domain onto...Cuong Tran Van
 
Question Answering with Lydia
Question Answering with LydiaQuestion Answering with Lydia
Question Answering with LydiaJae Hong Kil
 
Chapter 1 - Concepts for Object Databases.ppt
Chapter 1 - Concepts for Object Databases.pptChapter 1 - Concepts for Object Databases.ppt
Chapter 1 - Concepts for Object Databases.pptShemse Shukre
 
Domain Specific Named Entity Recognition Using Supervised Approach
Domain Specific Named Entity Recognition Using Supervised ApproachDomain Specific Named Entity Recognition Using Supervised Approach
Domain Specific Named Entity Recognition Using Supervised ApproachWaqas Tariq
 
20140327 - Hashing Object Embedding
20140327 - Hashing Object Embedding20140327 - Hashing Object Embedding
20140327 - Hashing Object EmbeddingJacob Xu
 
DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...
DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...
DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...ijnlc
 
Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678Editor IJARCET
 
Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1iotest
 
2D Features-based Detector and Descriptor Selection System for Hierarchical R...
2D Features-based Detector and Descriptor Selection System for Hierarchical R...2D Features-based Detector and Descriptor Selection System for Hierarchical R...
2D Features-based Detector and Descriptor Selection System for Hierarchical R...gerogepatton
 
6. kr paper journal nov 11, 2017 (edit a)
6. kr paper journal nov 11, 2017 (edit a)6. kr paper journal nov 11, 2017 (edit a)
6. kr paper journal nov 11, 2017 (edit a)IAESIJEECS
 
2D FEATURES-BASED DETECTOR AND DESCRIPTOR SELECTION SYSTEM FOR HIERARCHICAL R...
2D FEATURES-BASED DETECTOR AND DESCRIPTOR SELECTION SYSTEM FOR HIERARCHICAL R...2D FEATURES-BASED DETECTOR AND DESCRIPTOR SELECTION SYSTEM FOR HIERARCHICAL R...
2D FEATURES-BASED DETECTOR AND DESCRIPTOR SELECTION SYSTEM FOR HIERARCHICAL R...gerogepatton
 
2D FEATURES-BASED DETECTOR AND DESCRIPTOR SELECTION SYSTEM FOR HIERARCHICAL R...
2D FEATURES-BASED DETECTOR AND DESCRIPTOR SELECTION SYSTEM FOR HIERARCHICAL R...2D FEATURES-BASED DETECTOR AND DESCRIPTOR SELECTION SYSTEM FOR HIERARCHICAL R...
2D FEATURES-BASED DETECTOR AND DESCRIPTOR SELECTION SYSTEM FOR HIERARCHICAL R...ijaia
 
NERD: an open source platform for extracting and disambiguating named entitie...
NERD: an open source platform for extracting and disambiguating named entitie...NERD: an open source platform for extracting and disambiguating named entitie...
NERD: an open source platform for extracting and disambiguating named entitie...Raphael Troncy
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET Journal
 
Integrated approach for domain dimensional information retrieval system by us...
Integrated approach for domain dimensional information retrieval system by us...Integrated approach for domain dimensional information retrieval system by us...
Integrated approach for domain dimensional information retrieval system by us...Alexander Decker
 

Ähnlich wie Identification of Entities in Swedish (20)

Adaptive named entity recognition for social network analysis and domain onto...
Adaptive named entity recognition for social network analysis and domain onto...Adaptive named entity recognition for social network analysis and domain onto...
Adaptive named entity recognition for social network analysis and domain onto...
 
Question Answering with Lydia
Question Answering with LydiaQuestion Answering with Lydia
Question Answering with Lydia
 
D017422528
D017422528D017422528
D017422528
 
Chapter 1 - Concepts for Object Databases.ppt
Chapter 1 - Concepts for Object Databases.pptChapter 1 - Concepts for Object Databases.ppt
Chapter 1 - Concepts for Object Databases.ppt
 
Domain Specific Named Entity Recognition Using Supervised Approach
Domain Specific Named Entity Recognition Using Supervised ApproachDomain Specific Named Entity Recognition Using Supervised Approach
Domain Specific Named Entity Recognition Using Supervised Approach
 
20140327 - Hashing Object Embedding
20140327 - Hashing Object Embedding20140327 - Hashing Object Embedding
20140327 - Hashing Object Embedding
 
B017441015
B017441015B017441015
B017441015
 
DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...
DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...
DUTCH NAMED ENTITY RECOGNITION AND DEIDENTIFICATION METHODS FOR THE HUMAN RES...
 
PggLas12
PggLas12PggLas12
PggLas12
 
Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678Ijarcet vol-2-issue-2-676-678
Ijarcet vol-2-issue-2-676-678
 
Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1
 
PowerMagpie
PowerMagpiePowerMagpie
PowerMagpie
 
SpectralClassificationOfStars
SpectralClassificationOfStarsSpectralClassificationOfStars
SpectralClassificationOfStars
 
2D Features-based Detector and Descriptor Selection System for Hierarchical R...
2D Features-based Detector and Descriptor Selection System for Hierarchical R...2D Features-based Detector and Descriptor Selection System for Hierarchical R...
2D Features-based Detector and Descriptor Selection System for Hierarchical R...
 
6. kr paper journal nov 11, 2017 (edit a)
6. kr paper journal nov 11, 2017 (edit a)6. kr paper journal nov 11, 2017 (edit a)
6. kr paper journal nov 11, 2017 (edit a)
 
2D FEATURES-BASED DETECTOR AND DESCRIPTOR SELECTION SYSTEM FOR HIERARCHICAL R...
2D FEATURES-BASED DETECTOR AND DESCRIPTOR SELECTION SYSTEM FOR HIERARCHICAL R...2D FEATURES-BASED DETECTOR AND DESCRIPTOR SELECTION SYSTEM FOR HIERARCHICAL R...
2D FEATURES-BASED DETECTOR AND DESCRIPTOR SELECTION SYSTEM FOR HIERARCHICAL R...
 
2D FEATURES-BASED DETECTOR AND DESCRIPTOR SELECTION SYSTEM FOR HIERARCHICAL R...
2D FEATURES-BASED DETECTOR AND DESCRIPTOR SELECTION SYSTEM FOR HIERARCHICAL R...2D FEATURES-BASED DETECTOR AND DESCRIPTOR SELECTION SYSTEM FOR HIERARCHICAL R...
2D FEATURES-BASED DETECTOR AND DESCRIPTOR SELECTION SYSTEM FOR HIERARCHICAL R...
 
NERD: an open source platform for extracting and disambiguating named entitie...
NERD: an open source platform for extracting and disambiguating named entitie...NERD: an open source platform for extracting and disambiguating named entitie...
NERD: an open source platform for extracting and disambiguating named entitie...
 
IRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and PythonIRJET - Object Detection using Deep Learning with OpenCV and Python
IRJET - Object Detection using Deep Learning with OpenCV and Python
 
Integrated approach for domain dimensional information retrieval system by us...
Integrated approach for domain dimensional information retrieval system by us...Integrated approach for domain dimensional information retrieval system by us...
Integrated approach for domain dimensional information retrieval system by us...
 

Mehr von Findwise

White Arkitekter - Findability Day Roadshow 2017
White Arkitekter - Findability Day Roadshow 2017White Arkitekter - Findability Day Roadshow 2017
White Arkitekter - Findability Day Roadshow 2017Findwise
 
AI och maskininlärning - Findability Day Roadshow 2017
AI och maskininlärning - Findability Day Roadshow 2017AI och maskininlärning - Findability Day Roadshow 2017
AI och maskininlärning - Findability Day Roadshow 2017Findwise
 
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017De kognitiva eran med IBM Watson - Findability Day Roadshow 2017
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017Findwise
 
Findwise and IBM Watson
Findwise and IBM WatsonFindwise and IBM Watson
Findwise and IBM WatsonFindwise
 
Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findwise
 
Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findwise
 
Findability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learningFindability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learningFindwise
 
Findability Day 2016 - Enterprise social collaboration
Findability Day 2016 - Enterprise social collaborationFindability Day 2016 - Enterprise social collaboration
Findability Day 2016 - Enterprise social collaborationFindwise
 
Findability Day 2016 - SKF case study
Findability Day 2016 - SKF case studyFindability Day 2016 - SKF case study
Findability Day 2016 - SKF case studyFindwise
 
Findability Day 2016 - Structuring content for user experience
Findability Day 2016 - Structuring content for user experienceFindability Day 2016 - Structuring content for user experience
Findability Day 2016 - Structuring content for user experienceFindwise
 
Findability Day 2016 - Augmented intelligence
Findability Day 2016 - Augmented intelligenceFindability Day 2016 - Augmented intelligence
Findability Day 2016 - Augmented intelligenceFindwise
 
Findability Day 2016 - What is GDPR?
Findability Day 2016 - What is GDPR?Findability Day 2016 - What is GDPR?
Findability Day 2016 - What is GDPR?Findwise
 
Findability Day 2016 - Get started with GDPR
Findability Day 2016 - Get started with GDPRFindability Day 2016 - Get started with GDPR
Findability Day 2016 - Get started with GDPRFindwise
 
Digital workplace och informationshantering i office 365
Digital workplace och informationshantering i office 365Digital workplace och informationshantering i office 365
Digital workplace och informationshantering i office 365Findwise
 
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...Findwise
 
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any mess
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any messFindability Day 2015 - Abby Covert - Keynote - How to make sense of any mess
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any messFindwise
 
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...Findwise
 
Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...
Findability Day 2015   Mattias Ellison - Findwise - Enterprise Search and fin...Findability Day 2015   Mattias Ellison - Findwise - Enterprise Search and fin...
Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...Findwise
 
Findability Day 2015 - Martin White - The future is search!
Findability Day 2015 - Martin White - The future is search!Findability Day 2015 - Martin White - The future is search!
Findability Day 2015 - Martin White - The future is search!Findwise
 
Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...
Findability Day 2015   Liam Holley - Dassault systems - Insight and discovery...Findability Day 2015   Liam Holley - Dassault systems - Insight and discovery...
Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...Findwise
 

Mehr von Findwise (20)

White Arkitekter - Findability Day Roadshow 2017
White Arkitekter - Findability Day Roadshow 2017White Arkitekter - Findability Day Roadshow 2017
White Arkitekter - Findability Day Roadshow 2017
 
AI och maskininlärning - Findability Day Roadshow 2017
AI och maskininlärning - Findability Day Roadshow 2017AI och maskininlärning - Findability Day Roadshow 2017
AI och maskininlärning - Findability Day Roadshow 2017
 
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017De kognitiva eran med IBM Watson - Findability Day Roadshow 2017
De kognitiva eran med IBM Watson - Findability Day Roadshow 2017
 
Findwise and IBM Watson
Findwise and IBM WatsonFindwise and IBM Watson
Findwise and IBM Watson
 
Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016
 
Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016Findability Day 2016 - Enterprise Search and Findability Survey 2016
Findability Day 2016 - Enterprise Search and Findability Survey 2016
 
Findability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learningFindability Day 2016 - Big data analytics and machine learning
Findability Day 2016 - Big data analytics and machine learning
 
Findability Day 2016 - Enterprise social collaboration
Findability Day 2016 - Enterprise social collaborationFindability Day 2016 - Enterprise social collaboration
Findability Day 2016 - Enterprise social collaboration
 
Findability Day 2016 - SKF case study
Findability Day 2016 - SKF case studyFindability Day 2016 - SKF case study
Findability Day 2016 - SKF case study
 
Findability Day 2016 - Structuring content for user experience
Findability Day 2016 - Structuring content for user experienceFindability Day 2016 - Structuring content for user experience
Findability Day 2016 - Structuring content for user experience
 
Findability Day 2016 - Augmented intelligence
Findability Day 2016 - Augmented intelligenceFindability Day 2016 - Augmented intelligence
Findability Day 2016 - Augmented intelligence
 
Findability Day 2016 - What is GDPR?
Findability Day 2016 - What is GDPR?Findability Day 2016 - What is GDPR?
Findability Day 2016 - What is GDPR?
 
Findability Day 2016 - Get started with GDPR
Findability Day 2016 - Get started with GDPRFindability Day 2016 - Get started with GDPR
Findability Day 2016 - Get started with GDPR
 
Digital workplace och informationshantering i office 365
Digital workplace och informationshantering i office 365Digital workplace och informationshantering i office 365
Digital workplace och informationshantering i office 365
 
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...
Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...
 
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any mess
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any messFindability Day 2015 - Abby Covert - Keynote - How to make sense of any mess
Findability Day 2015 - Abby Covert - Keynote - How to make sense of any mess
 
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...
Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...
 
Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...
Findability Day 2015   Mattias Ellison - Findwise - Enterprise Search and fin...Findability Day 2015   Mattias Ellison - Findwise - Enterprise Search and fin...
Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...
 
Findability Day 2015 - Martin White - The future is search!
Findability Day 2015 - Martin White - The future is search!Findability Day 2015 - Martin White - The future is search!
Findability Day 2015 - Martin White - The future is search!
 
Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...
Findability Day 2015   Liam Holley - Dassault systems - Insight and discovery...Findability Day 2015   Liam Holley - Dassault systems - Insight and discovery...
Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...
 

Kürzlich hochgeladen

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Identification of Entities in Swedish

  • 1. Identification of Entities in Swedish Andreas Salomonsson⇤ Svetoslav Marinov⇤ Pierre Nugues† Findwise AB, Drottninggatan 5 ⇤ SE-411 14 Gothenburg, Sweden {andreas.salomonsson,svetoslav.marinov}@findwise.com † Department of Computer Science, Lund University S-221 00 Lund, Sweden pierre.nugues@cs.lth.se 1. Introduction entities: person, organization, location, work, event, and A crucial aspect of search applications is the possibility to other. It reports an F-score of 0.92 measured on test cor- identify named entities in free-form text and provide func- pus of 1800 words. On a test set of 40 000 words without tionality for entity-based, complex queries towards the in- gazetteers, the recall of the system drops from 0.91 to 0.53. dexed data. By enriching each entity with semantically rel- No information of precision is given by the authors. evant information acquired from outside the text, one can create the foundation for an advanced search application. 3. System Description Thus, given a document about Denmark, where neither of 3.1 Named Entity Recognition the words Copenhagen, country, nor capital are mentioned, We have tackled the NER problem by relying entirely on it should be possible to retrieve the document by querying machine–learning techniques. We use a linear classifier, for Copenhagen or European country. LIBLINEAR (Fan et al., 2008), and train and test the sys- In this paper, we report how we have tackled this prob- tem on the Stockholm-Ume˚ Corpus 2.0 (SUC) (Ejerhed et a lem. We will, however, concentrate only on the two tasks al., 2006). which are central to the solution, namely named entity In addition to the POS tags of the tokens, SUC con- recognition (NER) and enrichment of the discovered en- tains information about nine different categories of named tities by relying on linked data from knowledge bases such entities: person, place, institution, work, animal, product, as YAGO2 and DBpedia. We remain agnostic to all other myth, event, and other. Following the standards in CoNLL details of the search application, which can be implemented 2003, we chose to identify four categories: person, organi- in a relatively straight-forward way by using, e.g. Apache zation, location, and miscellaneous, thus merging the prod- Solr1 . uct, myth, event, animal, work and other classes in a miscel- The work deals only with Swedish and is restricted to laneous category and mapping institution to organization, two domains: news articles2 and medical texts3 . As a and place to location. byproduct, our method achieves state-of-the-art results for The most important features of the classifier are: the POS Swedish NER and to our knowledge there are no previously tags of the surrounding and the current token, the word to- published works on employing linked data for Swedish for kens themselves, and the previous two named entity tags. the two domains at hand. Other features include Booleans to describe the initial capi- 2. Named Entity Recognition for Swedish talization, if the word contains periods, and contains digits. As advocated by Ratinov and Roth (2009), we employ the Named entity recognition is an already well-established re- BILOU (Begin Inside Last Outside Unique) annotation for search area with a number of conferences and evaluations the entities. dedicated to the task (e.g. CoNLL 2003 (Tjong et al., 2003)). While many systems have been created for English 3.2 Linked Data and other languages, much fewer works have been reported on Swedish. Linked Data is a part of the semantic web initiative. In ˚ o Dalianis and Astr¨ m (2001) and Johannessen et al. its core, it consists of concepts interlinked with each other (2005) are examples of NER systems for Swedish. Dalianis by RDF triples. This allows us to augment the discov- ˚ o and Astr¨ m (2001) employ rules, lexicons, and machine– ered entities with additional information related to them. learning techniques in order to recognize four types of enti- We have used the semantic network YAGO2 (Hoffart et ties: person, location, organization, and time. It achieves an al., 2011) and the DBpedia knowledge base. Each named overall F-score of 0.49 on 100 manually tagged news texts. entity we extract from the NER module is mapped to an The Swedish system described in Johannessen et al. identifier in YAGO2 if the entity exists in the semantic (2005) is rule-based, but relies on shallow parsing and em- network. We can then use information about the entity ploys gazetteers as well. It can recognize six types of and its relations to other entities. YAGO2 is stored and queried using Sesame RDF repository from openRDF.org, 1 http://lucene/apache.org/solr/ and one can use SPARQL as the query language. One 2 Articles crawled from dn.se of the most important predicates in YAGO2 for our work 3 Articles from 1177.se is the isCalled predicate. Given an entity E we can
  • 2. Category Exist Found Correct Precision (%) Recall (%) F1 (%) Persons 15128 17198 13626 78.17 90.47 83.87 Organizations 6332 4540 3089 68.33 47.49 56.03 Locations 8773 8974 6926 76.97 78.32 77.64 Miscellaneous 3956 2051 1249 64.73 29.60 40.62 Total 34189 32763 24890 75.77 72.35 74.02 Unlabelled 34189 32922 30655 93.11 89.66 91.36 Table 1: NER evaluation use SELECT ?id WHERE {?id isCalled "E"}? their default values. While our system achieves very good to get one or several unique identifiers. If there is only results for Swedish (see Table 1), it is difficult to compare one, we map the entity with its identifier. When multiple it to previous systems due to the differences in test data identifiers are found, we use a very simple method for dis- and number of categories. Yet, we see more room for im- ambiguation: We take the identifier with most information provement by adding gazetteers and using word clusters as related to it. features. In addition, we have noticed inconsistencies in the While YAGO2 was used primarily with the news articles, annotation of entities in SUC. Titles are sometimes part of DBPedia was employed for the medical texts. First each the entities and sometimes not. medical term in the SweMESH4 taxonomy was mapped to Finally, by enriching entities with related information al- an unique identifier from DBPedia. We then searched in lows us to retrieve more documents, cluster them in a better the document only for those medical terms which are in way or populate ontologies by using such data. As the ex- SweMESH and use the unique identifier to extract more tracted information resides in a META field of an indexed information about them. document and the field often gets a higher score, the docu- ment will get an overall higher rank in the result list. 4. Architecture Overview The core of the system which identifies and augments 6. References named entities is implemented as a pipeline. The text of ˚ o Hercules Dalianis and Erik Astr¨ m. 2001. SweNam– each document passes through the following stages: tok- A Swedish named entity recognizer. Technical report, enization, sentence detection, POS tagging, NER, extrac- KTH. tion of semantic information. At the end of the pipeline, Eva Ejerhed, Gunnel K¨ llgren, and Benny Brodda. 2006. a the augmented document is sent for indexing. Stockholm-ume˚ corpus version 2.0. a The tokenizer was implemented by modifying the Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Lucene tokenizer used by Apache Solr. It uses regular ex- Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library pressions to define what a token is. The sentence detector is for large linear classification. Journal of Machine Learn- rule-based and uses information about the types of the to- ing Research, 9:1871–1874. kens, and the tokens themselves as identified in the previous P´ ter Hal´ csy, Andr´ s Kornai, and Csaba Oravecz. 2007. e a a step. The POS tagging stage employs HunPos (Hal´ csy et a HunPos: an open source trigram tagger. In Proceedings al., 2007) which has been trained on SUC. The NER stage of the ACL 2007 Demo and Poster Sessions. is the system which was described in Section 3.1 above. Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, Finally, all named entities are enriched with semantic in- Edwin Lewis-Kelham, Gerard de Melo, and Gerhard formation if such is available from the knowledge bases. Weikum. 2011. Yago2: Exploring and querying world Given the sentence Danmark ligger i Nordeuropa., the knowledge in time, space, context, and many languages. two locations, Danmark and Nordeuropa, are identified. In Proceedings of the 20th International Conference We then proceed by mapping the two entities to their Companion on World Wide Web (WWW 2011). ACM. corresponding identifiers in the semantic network. The ˚ Janne Bondi Johannessen, Kristin Hagen, Asne Haaland, YAGO2 identifier for Danmark is http://www.mpii. Andra Bj¨ rk J´ nsdottir, Anders Nøklestad, Dimitris o o de/yago/resource/Denmark and we use it to extract Kokkinakis, Paul Meurer, Eckhard Bick, and Dorte Hal- information, e.g. Denmark is a European country; its capi- trup. 2005. Named Entity Recognition for the Mainland tal is Copenhagen, with which we then augment the META- Scandinavian Languages. Literary and Linguistic Com- data of the document. puting, 20(1). Lev Ratinov and Dan Roth. 2009. Design challenges and 5. Results and Discussion misconceptions in named entity recognition. In CoNLL We have evaluated the performance of the NER system with ’09: Proceedings of the Thirteenth Conference on Com- a 10-fold cross-validation on SUC. We used L2-regularized putational Natural Language Learning. ACL. L2-loss support vector classification (primal) as the solver Erik F. Tjong, Kim Sang, and Fien De Meulder. 2003. In- type in LIBLINEAR and all other parameters were set to troduction to the CoNLL-2003 Shared Task: Language- Independent Named Entity Recognition. In Proceedings 4 http://mesh.kib.ki.se/swemesh/swemesh_ of CoNLL-2003. se.cfm