Veronika Snizhko: Оцінка якості NLP проєкту: чому автоматичних метрик може бути недостатньо

Evaluation of NLP projects
Why automatic metrics is not always
enough
Veronika Snizhko

Automatic Evaluation Metrics
● Objective and consistent
● Time-saving
● Cost-effective
● Repeatable and scalable
● Benchmarking
● Standardization
● Transparency

Categories of automatic evaluation metrics
Generic
can be applied to a variety
of situations and datasets,
such as precision, accuracy,
perplexity
Task-specific
are limited to a given task,
such as Machine
Translation (often evaluated
using metrics BLEU or
ROUGE) or Named Entity
Recognition (often
evaluated with seqeval).
Dataset-specific
aim to measure model
performance on specific
benchmarks: for instance,
the GLUE benchmark has a
dedicated evaluation metric,
or SQUAD.
● Linguistic
● Semantic
● Diversity
● Factual correctness
● Engagement

Limitations of Automatic Evaluation Metrics
● Lack of Coverage: Automatic evaluation metrics may not capture the full range of nuances and complexities in
natural language.
● Divergence from Human Quality: Automatic evaluation metrics are often based on human judgments of quality,
but these judgments are not always consistent or reliable. There may be cases where the metric gives a high score,
but humans perceive the output as poor or low quality.
● Lack of Domain Speciﬁcity: Automatic evaluation metrics may not capture the domain-speciﬁc nuances or
knowledge required for certain NLP tasks, such as medical or legal text, where the language is highly technical and
specialized.
● Lack of Context: Automatic evaluation metrics may not be able to capture the context in which the generated text
is used, which can be critical for assessing the quality of the output. For example, a generated sentence may be
grammatically correct and semantically coherent, but it may not be appropriate in the context of a larger document or
conversation.
● Lack of Creativity: Automatic evaluation metrics may not capture the creativity and novelty of the generated text,
which can be important for certain applications such as creative writing or advertising.

Human Evaluation
● Human evaluation provides a more comprehensive understanding of the quality of NLP models. By
having humans assess the output, we can capture aspects of language that are diﬃcult for
machines to measure, such as humor, sarcasm, and irony.
● Human evaluation can identify errors or mistakes in the output that may be missed by automatic
metrics. Humans are better able to understand the context and can pick up on nuances that may not
be captured by machines.
● Human evaluation can help to improve the usability and acceptability of NLP models. By assessing
the naturalness and coherence of the generated text, we can ensure that NLP models produce
outputs that meet the needs of users.
● Human evaluation can help to address issues of bias and fairness in NLP models. By having diverse
groups of humans evaluate the output, we can identify biases that may be present in the model and
work to address them.

A language model may have a good perplexity score, but if the generated text is
repetitive or fails to provide accurate or relevant information, it may be
considered poor by human evaluators.

https://www.semanticscholar.org/paper/Human-Evaluation-of-Conversations-is-an-Open-the-of-Smith-Hsu/94f02394a8f019d7ece7eb9612e96253ba97f30c

Limitations of Human Evaluation
● Cost and time: Human evaluation can be expensive and time-consuming, especially when large
amounts of data need to be evaluated.
● Subjectivity: Human evaluators may have different interpretations and opinions, which can lead to
inconsistencies in the evaluation process.
● Biases: Human evaluators may have biases based on factors like cultural background, gender, or
personal preferences. These biases can influence their evaluations and lead to unfair assessments
of NLP models.
● Scale: It can be difficult to scale human evaluation to large datasets or applications, which can limit
its usefulness in some contexts.
● Reproducibility: Human evaluation can be difficult to reproduce, as different evaluators may have
different interpretations and opinions. This can make it difficult to compare results across different
studies or evaluations.

Conclusion
To overcome the limitations of automatic evaluation metrics, it is essential to combine them
with human evaluation. Human evaluation can provide valuable insights into the NLP
system's performance, such as the naturalness and ﬂuency of its output. Additionally, human
evaluation can help to identify the areas where the NLP system needs improvement.

Thank you!
https://www.linkedin.com/in/veronika-snizhko/

Veronika Snizhko: Оцінка якості NLP проєкту: чому автоматичних метрик може бути недостатньо

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Veronika Snizhko: Оцінка якості NLP проєкту: чому автоматичних метрик може бути недостатньо

Ähnlich wie Veronika Snizhko: Оцінка якості NLP проєкту: чому автоматичних метрик може бути недостатньо (20)

Mehr von Lviv Startup Club

Mehr von Lviv Startup Club (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Veronika Snizhko: Оцінка якості NLP проєкту: чому автоматичних метрик може бути недостатньо