SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Evaluation of NLP projects
Why automatic metrics is not always
enough
Veronika Snizhko
Automatic Evaluation Metrics
● Objective and consistent
● Time-saving
● Cost-effective
● Repeatable and scalable
● Benchmarking
● Standardization
● Transparency
Categories of automatic evaluation metrics
Generic
can be applied to a variety
of situations and datasets,
such as precision, accuracy,
perplexity
Task-specific
are limited to a given task,
such as Machine
Translation (often evaluated
using metrics BLEU or
ROUGE) or Named Entity
Recognition (often
evaluated with seqeval).
Dataset-specific
aim to measure model
performance on specific
benchmarks: for instance,
the GLUE benchmark has a
dedicated evaluation metric,
or SQUAD.
● Linguistic
● Semantic
● Diversity
● Factual correctness
● Engagement
Limitations of Automatic Evaluation Metrics
● Lack of Coverage: Automatic evaluation metrics may not capture the full range of nuances and complexities in
natural language.
● Divergence from Human Quality: Automatic evaluation metrics are often based on human judgments of quality,
but these judgments are not always consistent or reliable. There may be cases where the metric gives a high score,
but humans perceive the output as poor or low quality.
● Lack of Domain Specificity: Automatic evaluation metrics may not capture the domain-specific nuances or
knowledge required for certain NLP tasks, such as medical or legal text, where the language is highly technical and
specialized.
● Lack of Context: Automatic evaluation metrics may not be able to capture the context in which the generated text
is used, which can be critical for assessing the quality of the output. For example, a generated sentence may be
grammatically correct and semantically coherent, but it may not be appropriate in the context of a larger document or
conversation.
● Lack of Creativity: Automatic evaluation metrics may not capture the creativity and novelty of the generated text,
which can be important for certain applications such as creative writing or advertising.
Human Evaluation
● Human evaluation provides a more comprehensive understanding of the quality of NLP models. By
having humans assess the output, we can capture aspects of language that are difficult for
machines to measure, such as humor, sarcasm, and irony.
● Human evaluation can identify errors or mistakes in the output that may be missed by automatic
metrics. Humans are better able to understand the context and can pick up on nuances that may not
be captured by machines.
● Human evaluation can help to improve the usability and acceptability of NLP models. By assessing
the naturalness and coherence of the generated text, we can ensure that NLP models produce
outputs that meet the needs of users.
● Human evaluation can help to address issues of bias and fairness in NLP models. By having diverse
groups of humans evaluate the output, we can identify biases that may be present in the model and
work to address them.
Human Evaluation
A language model may have a good perplexity score, but if the generated text is
repetitive or fails to provide accurate or relevant information, it may be
considered poor by human evaluators.
https://www.semanticscholar.org/paper/Human-Evaluation-of-Conversations-is-an-Open-the-of-Smith-Hsu/94f02394a8f019d7ece7eb9612e96253ba97f30c
Limitations of Human Evaluation
● Cost and time: Human evaluation can be expensive and time-consuming, especially when large
amounts of data need to be evaluated.
● Subjectivity: Human evaluators may have different interpretations and opinions, which can lead to
inconsistencies in the evaluation process.
● Biases: Human evaluators may have biases based on factors like cultural background, gender, or
personal preferences. These biases can influence their evaluations and lead to unfair assessments
of NLP models.
● Scale: It can be difficult to scale human evaluation to large datasets or applications, which can limit
its usefulness in some contexts.
● Reproducibility: Human evaluation can be difficult to reproduce, as different evaluators may have
different interpretations and opinions. This can make it difficult to compare results across different
studies or evaluations.
Conclusion
To overcome the limitations of automatic evaluation metrics, it is essential to combine them
with human evaluation. Human evaluation can provide valuable insights into the NLP
system's performance, such as the naturalness and fluency of its output. Additionally, human
evaluation can help to identify the areas where the NLP system needs improvement.
Thank you!
https://www.linkedin.com/in/veronika-snizhko/

Weitere ähnliche Inhalte

Ähnlich wie Veronika Snizhko: Оцінка якості NLP проєкту: чому автоматичних метрик може бути недостатньо

Assessing d ls (1)
Assessing d ls (1)Assessing d ls (1)
Assessing d ls (1)
dmathis136
 

Ähnlich wie Veronika Snizhko: Оцінка якості NLP проєкту: чому автоматичних метрик може бути недостатньо (20)

Breaking Into Product and Tech by Microsoft Product Leader
Breaking Into Product and Tech by Microsoft Product LeaderBreaking Into Product and Tech by Microsoft Product Leader
Breaking Into Product and Tech by Microsoft Product Leader
 
Rating scale
Rating scaleRating scale
Rating scale
 
How to Avoid Common Mistakes When Hiring Remote Developers
How to Avoid Common Mistakes When Hiring Remote DevelopersHow to Avoid Common Mistakes When Hiring Remote Developers
How to Avoid Common Mistakes When Hiring Remote Developers
 
Tech capabilities with_sa
Tech capabilities with_saTech capabilities with_sa
Tech capabilities with_sa
 
Top Natural Language Processing |aitech.studio
Top Natural Language Processing |aitech.studioTop Natural Language Processing |aitech.studio
Top Natural Language Processing |aitech.studio
 
Social Media Sentiments Analysis
Social Media Sentiments AnalysisSocial Media Sentiments Analysis
Social Media Sentiments Analysis
 
LENGUAGE TESTING (II Bimestre Abril Agosto 2011)
LENGUAGE TESTING (II Bimestre Abril Agosto 2011)LENGUAGE TESTING (II Bimestre Abril Agosto 2011)
LENGUAGE TESTING (II Bimestre Abril Agosto 2011)
 
How to Enhance NLP’s Accuracy with Large Language Models - A Comprehensive Gu...
How to Enhance NLP’s Accuracy with Large Language Models - A Comprehensive Gu...How to Enhance NLP’s Accuracy with Large Language Models - A Comprehensive Gu...
How to Enhance NLP’s Accuracy with Large Language Models - A Comprehensive Gu...
 
TAUS Best Practices Adequacy/Fluency Guidelines
TAUS Best Practices Adequacy/Fluency GuidelinesTAUS Best Practices Adequacy/Fluency Guidelines
TAUS Best Practices Adequacy/Fluency Guidelines
 
Generative AI in Edtech: Trends from the Pipeline
Generative AI in Edtech: Trends from the PipelineGenerative AI in Edtech: Trends from the Pipeline
Generative AI in Edtech: Trends from the Pipeline
 
Cracking The Cognitive Assessment Code
Cracking The Cognitive Assessment CodeCracking The Cognitive Assessment Code
Cracking The Cognitive Assessment Code
 
Customer review using sentiment analysis.pptx
Customer review using sentiment analysis.pptxCustomer review using sentiment analysis.pptx
Customer review using sentiment analysis.pptx
 
Primary data | Manuscript editing service | Primary and secondary data
Primary data | Manuscript editing service | Primary and secondary dataPrimary data | Manuscript editing service | Primary and secondary data
Primary data | Manuscript editing service | Primary and secondary data
 
IRJET- Modeling Student’s Vocabulary Knowledge with Natural
IRJET-  	  Modeling Student’s Vocabulary Knowledge with NaturalIRJET-  	  Modeling Student’s Vocabulary Knowledge with Natural
IRJET- Modeling Student’s Vocabulary Knowledge with Natural
 
AI_attachment.pptx prepared for all students
AI_attachment.pptx prepared for all  studentsAI_attachment.pptx prepared for all  students
AI_attachment.pptx prepared for all students
 
Assessing d ls (1)
Assessing d ls (1)Assessing d ls (1)
Assessing d ls (1)
 
Building a Peer Evaluation Program
Building a Peer Evaluation ProgramBuilding a Peer Evaluation Program
Building a Peer Evaluation Program
 
Rating scale
Rating scaleRating scale
Rating scale
 
Psychometric tests
Psychometric testsPsychometric tests
Psychometric tests
 
Rating scale
Rating scaleRating scale
Rating scale
 

Mehr von Lviv Startup Club

Mehr von Lviv Startup Club (20)

Artem Bykovets: 4 Вершники апокаліпсису робочих стосунків (+антидоти до них) ...
Artem Bykovets: 4 Вершники апокаліпсису робочих стосунків (+антидоти до них) ...Artem Bykovets: 4 Вершники апокаліпсису робочих стосунків (+антидоти до них) ...
Artem Bykovets: 4 Вершники апокаліпсису робочих стосунків (+антидоти до них) ...
 
Dmytro Khudenko: Challenges of implementing task managers in the corporate an...
Dmytro Khudenko: Challenges of implementing task managers in the corporate an...Dmytro Khudenko: Challenges of implementing task managers in the corporate an...
Dmytro Khudenko: Challenges of implementing task managers in the corporate an...
 
Sergii Melnichenko: Лідерство в Agile командах: ТОП-5 основних психологічних ...
Sergii Melnichenko: Лідерство в Agile командах: ТОП-5 основних психологічних ...Sergii Melnichenko: Лідерство в Agile командах: ТОП-5 основних психологічних ...
Sergii Melnichenko: Лідерство в Agile командах: ТОП-5 основних психологічних ...
 
Mariia Rashkevych: Підвищення ефективності розроблення та реалізації освітніх...
Mariia Rashkevych: Підвищення ефективності розроблення та реалізації освітніх...Mariia Rashkevych: Підвищення ефективності розроблення та реалізації освітніх...
Mariia Rashkevych: Підвищення ефективності розроблення та реалізації освітніх...
 
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
Mykhailo Hryhorash: What can be good in a "bad" project? (UA)
 
Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)
Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)
Oleksii Kyselov: Що заважає ПМу зростати? Розбір практичних кейсів (UA)
 
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
Yaroslav Osolikhin: «Неідеальний» проєктний менеджер: People Management під ч...
 
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
Mariya Yeremenko: Вплив Генеративного ШІ на сучасний світ та на особисту ефек...
 
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
Petro Nikolaiev & Dmytro Kisov: ТОП-5 методів дослідження клієнтів для успіху...
 
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
Maksym Stelmakh : Державні електронні послуги та сервіси: чому бізнесу варто ...
 
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
Alexander Marchenko: Проблеми росту продуктової екосистеми (UA)
 
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
Oleksandr Grytsenko: Save your Job або прокачай скіли до Engineering Manageme...
 
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
Yuliia Pieskova: Фідбек: не лише "як", але й "коли" і "навіщо" (UA)
 
Nataliya Kryvonis: Essential soft skills to lead your team (UA)
Nataliya Kryvonis: Essential soft skills to lead your team (UA)Nataliya Kryvonis: Essential soft skills to lead your team (UA)
Nataliya Kryvonis: Essential soft skills to lead your team (UA)
 
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
Volodymyr Salyha: Stakeholder Alchemy: Transforming Analysis into Meaningful ...
 
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
Anna Chalyuk: 7 інструментів та принципів, які допоможуть зробити вашу команд...
 
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
Oksana Smilka: Цінності, цілі та (де) мотивація (UA)
 
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
Yaroslav Rozhankivskyy: Три складові і три передумови максимальної продуктивн...
 
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
Andrii Skoromnyi: Чому не працює методика "5 Чому?" – і яка є альтернатива? (UA)
 
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
Maryna Sokyrko & Oleksandr Chugui: Building Product Passion: Developing AI ch...
 

Kürzlich hochgeladen

The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
daisycvs
 
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
lizamodels9
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
lizamodels9
 
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 

Kürzlich hochgeladen (20)

The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
 
Falcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to ProsperityFalcon's Invoice Discounting: Your Path to Prosperity
Falcon's Invoice Discounting: Your Path to Prosperity
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Century
 
Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...
Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...
Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with Culture
 
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
Russian Call Girls In Gurgaon ❤️8448577510 ⊹Best Escorts Service In 24/7 Delh...
 
Business Model Canvas (BMC)- A new venture concept
Business Model Canvas (BMC)-  A new venture conceptBusiness Model Canvas (BMC)-  A new venture concept
Business Model Canvas (BMC)- A new venture concept
 
Phases of Negotiation .pptx
 Phases of Negotiation .pptx Phases of Negotiation .pptx
Phases of Negotiation .pptx
 
Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
 
Falcon Invoice Discounting: Unlock Your Business Potential
Falcon Invoice Discounting: Unlock Your Business PotentialFalcon Invoice Discounting: Unlock Your Business Potential
Falcon Invoice Discounting: Unlock Your Business Potential
 
PHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation Final
 
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Nelamangala Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
BAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
BAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRLBAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
BAGALUR CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
 
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
(Anamika) VIP Call Girls Napur Call Now 8617697112 Napur Escorts 24x7
 
Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...
Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...
Unveiling Falcon Invoice Discounting: Leading the Way as India's Premier Bill...
 

Veronika Snizhko: Оцінка якості NLP проєкту: чому автоматичних метрик може бути недостатньо

  • 1. Evaluation of NLP projects Why automatic metrics is not always enough Veronika Snizhko
  • 2. Automatic Evaluation Metrics ● Objective and consistent ● Time-saving ● Cost-effective ● Repeatable and scalable ● Benchmarking ● Standardization ● Transparency
  • 3. Categories of automatic evaluation metrics Generic can be applied to a variety of situations and datasets, such as precision, accuracy, perplexity Task-specific are limited to a given task, such as Machine Translation (often evaluated using metrics BLEU or ROUGE) or Named Entity Recognition (often evaluated with seqeval). Dataset-specific aim to measure model performance on specific benchmarks: for instance, the GLUE benchmark has a dedicated evaluation metric, or SQUAD. ● Linguistic ● Semantic ● Diversity ● Factual correctness ● Engagement
  • 4. Limitations of Automatic Evaluation Metrics ● Lack of Coverage: Automatic evaluation metrics may not capture the full range of nuances and complexities in natural language. ● Divergence from Human Quality: Automatic evaluation metrics are often based on human judgments of quality, but these judgments are not always consistent or reliable. There may be cases where the metric gives a high score, but humans perceive the output as poor or low quality. ● Lack of Domain Specificity: Automatic evaluation metrics may not capture the domain-specific nuances or knowledge required for certain NLP tasks, such as medical or legal text, where the language is highly technical and specialized. ● Lack of Context: Automatic evaluation metrics may not be able to capture the context in which the generated text is used, which can be critical for assessing the quality of the output. For example, a generated sentence may be grammatically correct and semantically coherent, but it may not be appropriate in the context of a larger document or conversation. ● Lack of Creativity: Automatic evaluation metrics may not capture the creativity and novelty of the generated text, which can be important for certain applications such as creative writing or advertising.
  • 5. Human Evaluation ● Human evaluation provides a more comprehensive understanding of the quality of NLP models. By having humans assess the output, we can capture aspects of language that are difficult for machines to measure, such as humor, sarcasm, and irony. ● Human evaluation can identify errors or mistakes in the output that may be missed by automatic metrics. Humans are better able to understand the context and can pick up on nuances that may not be captured by machines. ● Human evaluation can help to improve the usability and acceptability of NLP models. By assessing the naturalness and coherence of the generated text, we can ensure that NLP models produce outputs that meet the needs of users. ● Human evaluation can help to address issues of bias and fairness in NLP models. By having diverse groups of humans evaluate the output, we can identify biases that may be present in the model and work to address them.
  • 7. A language model may have a good perplexity score, but if the generated text is repetitive or fails to provide accurate or relevant information, it may be considered poor by human evaluators.
  • 9.
  • 10. Limitations of Human Evaluation ● Cost and time: Human evaluation can be expensive and time-consuming, especially when large amounts of data need to be evaluated. ● Subjectivity: Human evaluators may have different interpretations and opinions, which can lead to inconsistencies in the evaluation process. ● Biases: Human evaluators may have biases based on factors like cultural background, gender, or personal preferences. These biases can influence their evaluations and lead to unfair assessments of NLP models. ● Scale: It can be difficult to scale human evaluation to large datasets or applications, which can limit its usefulness in some contexts. ● Reproducibility: Human evaluation can be difficult to reproduce, as different evaluators may have different interpretations and opinions. This can make it difficult to compare results across different studies or evaluations.
  • 11. Conclusion To overcome the limitations of automatic evaluation metrics, it is essential to combine them with human evaluation. Human evaluation can provide valuable insights into the NLP system's performance, such as the naturalness and fluency of its output. Additionally, human evaluation can help to identify the areas where the NLP system needs improvement.