Sotiria bampatzani wi_mlds_presentation_20200203

Named Entity Recognition (NER) from a business point of view :
coupling a rule-based approach with Machine Learning algorithms
Sotiria Bampatzani
NLP Data Engineer - QWAM Content Intelligence
Paris, France

CONTENTS
• Presentation
• Named Entity Recognition (NER)
• The rule-based approach
• Adding Machine Learning to the mix
• Use case example
• Not stopping there…
• Conclusion

PRESENTATION
QWAM Content Intelligence
• QWAM Content Intelligence is a solutions editor, who provides innovative software
solutions for analyzing textual data and extracting insights with its AI and Semantics
technologies.
Search engine to manage textual and press/media content
SaaS solution for real-time web information monitoring
Analytics platform for extracting key information from
textual data
3

NAMED ENTITY RECOGNITION
4
Brief introduction
• 1987 : first studies on information extraction (IE)
• 1991 : first study on Named Entity Recognition (NER)
• 1995 : NER becomes one of the basic Natural Language Processing (NLP) tasks
Named Entity Recognition
Entity Identification Entity Classification
How…
• Rule-based approach : Annotation rules
• Learning approach : word embeddings, statistical models, neural networks, etc.
• Hybrid approach : combination of the rule-based and learning approaches

THE RULE-BASED APPROACH
5
Advantages of this approach
• Robust
• Accurate results
• Adaptable to new types of entities
Drawbacks of this approach
• Based on non-contextual grammars and lexicon (gazetteer) lists, whose maintenance
and update is costly
• Impossible to treat all spelling variants and the resulting ambiguity
• Discovering new entities is very difficult, if not impossible

ADDING ML TO THE MIX…
6
Hybrid approach
• Creation of a dataset containing over 40M news articles with the use of one of
QWAM’s solutions, Ask’n’Read
• Annotation of the aforementioned dataset with the annotation rules developed by our
Text Analytics team
• Use of this annotated dataset in order to train ML models
o RNN/LSTM, word embeddings (word2vec), BERT…
However…
Data preprocessing and filtering do not result in a 100% “clean” dataset.
The training set also contains errors or missing annotations !

DATASET EXAMPLE
7
source : https://www.phonandroid.com/samsung-annonce-arrivee-smartphones-ecran-enroulable-coulissant.html

ADDING ML TO THE MIX…
8
Evaluation
• The ML model correctly identified and classified new entities, that are added to our
gazetteer lists.
• Following statistical evaluation, it appeared that a number of errors resulted from
specific annotation rules. These annotation rules were later improved.
• The dataset is then reannotated with the enhanced annotation rules and the cycle
starts anew…
…And what of client data ?

USE CASE - EXAMPLE
9
Client data
• The need to identify new types of entities arises. Extracting key information, not
limited to predefined categories (person names, locations, organizations, etc.) is
crucial in order to thoroughly analyze the data.
• The size, oftentimes sensitive nature of the dataset, as well as the time allocated to
the project, may not allow for machine learning.
QWAM’s solution…
• Preprocessing and annotating of the data with the “standard” application.
• Identification of a priori “interesting” entities in the data, thanks to an annotation rule
used for “discovering” potentially interesting information.
• Use of these annotations to build a dedicated ontology.

USE CASE - EXAMPLE
10
QWAM Ontology Manager

USE CASE - EXAMPLE
12
How ML is a part of
Ontology Manager
• Suggestions of a machine
learning algorithm are
incorporated in the platform,
and proposed to users in
order to promote and
facilitate ontology evolution

NOT STOPPING THERE…
13
Establishing relations between entities
• Once named entity recognition and concept recognition are in place, the next step is
to establish a link between them.
o “Atos finalise le rachat de la société canadienne In Fidem”
Company-buys-Company
o TESSI signe un partenariat stratégique avec NEHS DIGITAL”
Company-partners with-Company
• Like named entity and concept recognition, the same methods are implemented.
o A “standard” gazetteer with these expressions already exists, which allows for an
initial annotation and recognition.
o Another annotation rules is used in order to discover new expressions and
relations between different types of entities.
o A ML model is trained for further exploration.

USE CASE - EXAMPLE
14
• Unlike entities or concepts, suggestions are calculated and proposed
based on the content of the whole category.

CONCLUSION
15
Conclusion
• A rule-based approach is not sufficient if we wish to discover new entities “on the fly”.
• The size, oftentimes sensitive nature of a client’s dataset, as well as the time allocated
to the project, may not allow for machine learning in a business setting.
• A hybrid approach seems to be the most efficient method
o Adaptable to different client’s data
• When doing an NLP task such as named entity recognition, more often than not,
errors are cumulated at every step.
• The biggest advantage of the method we use at QWAM, is an NLP Engineer’s rapid
and active involvement at every step of the way.

Sotiria bampatzani wi_mlds_presentation_20200203

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Sotiria bampatzani wi_mlds_presentation_20200203

Ähnlich wie Sotiria bampatzani wi_mlds_presentation_20200203 (20)

Mehr von Paris Women in Machine Learning and Data Science

Mehr von Paris Women in Machine Learning and Data Science (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Sotiria bampatzani wi_mlds_presentation_20200203