SlideShare ist ein Scribd-Unternehmen logo
1 von 33
John Blake
University of Aizu, Japan
Inter-annotator agreement:
By hook or by crook
www.orau.org
Overview
• Background
• Case study
– Annotation of scientific research abstracts
– Strategic decision points
• Findings
– Methodological improvements
– Statistical smoke and rhetorical mirrors
• Conclusions
2
Subjectivity in annotation
3
POS tagging,
Phonetic
transcription
etc.
Annotation
guidelines with
discussion of
boundary
cases
Basic
annotation
guidelines
Speaker
intuition, e.g.
discourse
annotation,
pragmatics,
etc.
Problem:
Vagueness and ambiguity in natural languages
Manning (2011) 97.321 / 10021 = 56.28 %
Automated and manual
annotation compared
4
Automated annotation Manual annotation
Subjective agent Software developer Annotator
Subjective stage Prior to annotation During annotation
Replicability (near) Perfect Variable
Initial set up cost High (if new software) Low
On-going cost (near) Zero High
Scalable Yes No
Dependent
condition
Availability of training
set
Availability of
annotators (contingent
on time/money)
Speed (near) Instantaneous Variable
Factors
considered
Endogeneric Endo- and exogeneric
Strength Grammatical parsing Semantic parsing
Inter-annotator agreement
5
Crucial issue: Are the annotations correct?
We are interested in validity
• Ability to discriminate without error by placing item into appropriate category
But there is no “Ground truth”
• Linguistic categories are determined by human judgement
 Implication: We cannot measure correctness directly
So we measure reliability , e.g. reproducibility.
• Intra-annotator reliability
• Inter-annotator reliability
i.e. whether human coders/annotators consistently make same decisions
 Assumption 1: lack of reliability rules out validity (text/training issues)
 Assumption 2: high reliability implies validity
Terminology credit (Artsein & Poesio, 2008)
Idea adapted from Boldea & Evert (2009) : https://clseslli09.files.wordpress.com/2009/07/02_iaa-
slides2.pdf/
Simple example 1
6
(abbreviated for length to increase readability)
Sentence Coder
1
Coder
2
Agreement
We address the problem of …… recognition I P 
Our aim is to …recognize [x] from [y]. P P 
[A] is set up as prior information, and its pose is
determined by three parameters, which are [j,k and l].
M M 
An efficient local gradient-based method is proposed to
…, which is combined into … framework to estimate [V
and W] by iterative evolution
P R 
It is shown that the local gradient-based method can
evaluate accurately and efficiently [V and W] .
R R 
Observed agreement between 1 and 2 is 60%
IAA measures: Kappa coefficient
7
Inter-annotator agreement of 60% in previous example, but
chance agreement figure is 20%. Agreement measures must
be corrected for chance agreement (Carletta, 1996).
Kappa coefficient (Cohen 1960 for 2, Fleiss for 2+)
e.g. Corrected measure: K =
P A −P E
1−𝑃(𝐸)
1 (agreement) 0 (no correlation) -1(disagreement)
Interpretation of Kappa
• Landis and Koch (1977) 0.6-0.79 substantial; 0.8+ perfect
• Krippendorff (1980) 0.67-0.79 tentative; 0.8+ good
• Green (1997) 0.4-0.74 fair/good; 0.75 high
IAA measures: Sophisticated
8
e.g. Typical measures used in computational linguistics built
into NLP pipelines, such as NLTK and GATE
Rather than measuring agreement alone, we can measure
both agreement and disagreement, e.g. using Measuring
agreement on set-valued items (MASI) and/or Jaccard
distance. Both MASI (Passonneau, 2006) and Jaccard distance
make use of the union and intersection between sets.
Jaccard formula (Jaccard, 1908 cited in Dunn & Everitt, 2004)
is:
Case study overview
• Moves in scientific research abstracts
• Scientific disciplines
• Core corpus specifications
• Example abstract
• Tagset
• Strategic decision points (tag #IAA extraction)
NB: By convention this far-from-linear study is
presented in a linear fashion when in fact there
were numerous forks, dead-ends and iterations.
9
Moves in scientific research abstracts
10
Move definition
“a discoursal or rhetorical unit that performs a coherent
communicative function in a written or spoken discourse”.
(Swales, 2004, p.228)
Move sequences
Example (very short) abstract
5-move code Introduction Purpose Method Results Discussion
Scientific disciplines
11
Science
Fundamental
Empirical
Natural
Physical Materials science
Life Botany
Social Linguistics
Theoretical Formal
Information
theory
Applied
Engineering
Evolutionary
computation
Knowledge & data
engineering
Image processing
Wireless
computing
Electronic
engineering
Healthcare Medical
Core 1000 corpus specifications
12
Code Journal name #
abstracts
#
words
1 EC Transactions on Evolutionary Computation 100 17,433
2 KDE Transactions on Knowledge and Data Engineering 100 18,407
3 IP Transactions on Image Processing 100 16,859
4 IT Transactions on Information Theory 100 15,982
5 WC Transactions on Wireless Communications 100 15,971
6 Mat Advanced materials 100 6.078
7 Bot The plant cell 100 19,981
8 Ling App. Ling; Journal of Comm; J of Cog. Neurosc. 100 13,587
9 Eng Transactions on Industrial Electronics 100 14,569
10 Med British Medical Journal 100 29,437
Total 1000 162,232
First 100 abstracts of research articles from top-tier journals published
from Jan 2012.
We study the detection error probability associated with a balanced
binary relay tree, where the leaves of the tree correspond to N
identical and independent sensors. The root of the tree represents a
fusion center that makes the overall detection decision. Each of the
other nodes in the tree is a relay node that combines two binary
messages to form a single output binary message. Only the leaves are
sensors. In this way, the information from the sensors is aggregated
into the fusion center via the relay nodes. In this context, we describe
the evolution of the Type I and Type II error probabilities of the binary
data as it propagates from the leaves toward the root. Tight upper and
lower bounds for the total error probability at the fusion center as
functions of N are derived. These characterize how fast the total error
probability converges to 0 with respect to N , even if the individual
sensors have error probabilities that converge to 1/2.
[IT 120616]
Standard abstract (IT)
13
Tagset
14
Manual annotation using UAM Corpus Tool 2.X and 3.X (O`Donnell, 2015)
This layer of annotation is for rhetorical moves.
There are 5 choices of moves and 6 choices of submoves.
In short, each ontological unit is assigned to one of 9 choices.
The “uncertain” tag is designed as a temporary label.
#IAA theme extraction
Strategic decision points
• Research log was kept using themes, e.g. #meth,
#stats, #IAA
• 142 notes relating to #IAA written between 2012-
2017 were identified.
• The findings presented are the notes that are the
most important and generalizable to other
projects.
15
Findings overview:
Three types of strategic decisions affecting IAA
1. Methodological decisions
2. Statistical decisions
3. Rhetorical decisions
16
Findings (1)
Methodological choices to enhance IAA
A. Ontological unit
B. Tagset size
C. Tag clarity of demarcation
D. Catch-all tags
E. Detailed coding booklet
F. Pre-selection, training and testing
G. Easy-to-use tools
H. Monitoring, feedback and regular meetings
I. Pilot studies and small trials
17
Finding 1a: Ontological unit
18
Fixed ontological units (i.e. what you code), e.g. each
word, each sentence, simplify calculation of IAA and
increase the IAA since boundaries of each unit are
identical.
Variable ontological units provide researchers with
additional choices on how to calculate (manipulate?)
IAA – identical, subsumed, cross-over. How do you
calculate by character (inc. white space?), letter,
word, what unit?
I love you. 8 letters, 3 words, 11 characters
I love him. Agreement ratio 0.62, 0.67, 0.72
Finding 1b: Tagset size
The more tags, the less agreement
Rissanen (1989, as cited in Archer, 2012, n.p.) points out the
“mystery of vanishing reliability”
i.e. the statistical unreliability of annotation that is too detailed.
Obvious with hindsight, but researchers tend to develop tags
that will inform their research rather than result in higher IAA.
1 tag = total agreement (but probably no reason to code)
10 tags = less agreement
100 tags = much less agreement
1000 tags = almost no chance of high IAA
19
Finding 1c:
Tagset clarity of demarcation
Pilot studies of possible tags and tagsets
Pilot study:
Tagged 100 abstracts using IMRD move and CARS move tags
Difficulty:
1. prevalence of method in IMRD positions
2. demarcation of boundary cases  created SOP, codified in
coding booklet
Final selection:
Dropped both sets of tags and selected Hyland (2004, p.67)
IPMPC tagset20
Finding 1d: Catch-all tags
21
Tags Description
Fuzzy Used when difficult to assign to tag in
existing tagset
Multiple Used when more than one tag applies
Portmanteau Used when item transcends two tag
domains
Problematic Used when impossible to assign tag
Archer (2012, n.p.) describes four tag types, all of which
increase IAA by providing easy-to-code options for
boundary cases.
My “uncertain” tag is a catch-all. Calculating IAA
including “uncertain” results in higher IAA.
Finding 1e:
Annotation (coding) booklet
22
Standard operating procedure
• Guidelines, Rules, Examples, Borderline cases disambiguated
Finding 1f:
Training course and test
23
Course based on annotation booklet
• Face-to-face and/or online
Test based on annotation booklet
• Serialist tests
• Holistic tests
Qualification cut-off points
• e.g. 90% can start annotating
• e.g. 61% needs additional training
• e.g. 60% discontinue training
Finding 1g:
Easy-to-use annotation tools
24
• Tool and instructions!
• UAM Corpus Tool – help forum in Spanish
• Wrote project-specific instruction booklet for annotators
Finding 1h: monitoring,
feedback and regular meetings
25
These three aspects I believe led to greater retention of
annotators and higher accuracy.
• More monitoring in initial stages (real-time is possible in GATE)
– to identify problems early
• Constructive actionable feedback
– to retain annotator and increase accuracy
• Regular meetings
– annotators who cancelled meetings tended to have a
problem (either with annotation or in their life).
I helped with annotation issues.
Finding 1i: Pilot studies
26
Various pilot studies and small-scale trials.
Enables researcher to discover issues and proactively avert potential problems
• 136 abstracts SFL annotation of process, participant and circumstance
• 136 abstracts SFL annotation of sub-categories of circumstance
• 10 abstracts Multimethod
• 500 abstracts Lexicogrammatical
• 40 abstracts Specialist vs linguist IMRaD annotation
• 100 abstracts Tagset selection (CARS vs IMRaD)
• 3 people Development of Coding booklet
• 10 abstracts Examples vs. Coding booklet
• 2 people Development of training course
• 500 abstracts Rhetorical moves using coding booklet by self
• 1000 abstracts Rhetorical moves using coding booklet by self & annotators
• 2500 abstracts Rhetorical moves using coding booklet by annotators
Findings (2)
Statistical choices to enhance IAA
A. Cherry-picking population-sample size ratio
B. Random vs systematic
C. Dealing with outliers (annotators)
• Omit [+justify?]; replace with mean [?]
D. Sample selection:
• early vs later coding
• pre-discussion vs. post-discussion
E. Granularity (see next slide)
• Reducing granularity by merging units; fewer
categories, higher agreement
27
Finding 2e: Granularity
28
Measures of IAA increase greatly as granularity decreases
Lower
IAA
Higher
IAA
Findings (3)
Rhetorical choices to enhance IAA
Claim high IAA with no further details
+ gold standard with no further details and/or
+ provide a simple ratio or percentage and/or
+ provide details of sample size
Rely on vagueness and ambiguity to allow reader to
infer higher IAA than found or actual high IAA.
29
Conclusion
High IAA may be due to
• sound or cogent methodological choices;
but it could also be due to manipulating the
• statistical smoke
(i.e. selecting parameters leading to higher IAA)
and
• rhetorical mirrors.
(i.e. using vagueness/ambiguity to infer IAA is high)
In most publications in applied linguistics, sufficient
detail is not provided.
30
Best practice suggestions
• Annotate using tags at one level more finely.
• Create annotation booklet with clear rules,
examples and discussion of boundary cases.
• Develop, trial and require all annotators to
complete a training course.
• Set a benchmark standard.
• Monitor and provide constructive actionable
feedback to annotators.
• Report IAA in sufficient detail to convince
skeptical readers.
31
Beware of the
skeleton in the cupboard
• Researchers aim to
portray their work as
sound or cogent.
• Actual IAA may differ
from reported IAA
• Be wary of statistical
smoke and
rhetorical mirrors
32
Any questions, suggestions or
comments?
John Blake
jblake@u-aizu.ac.jp

Weitere ähnliche Inhalte

Was ist angesagt?

Sentiment Analysis Using Twitter
Sentiment Analysis Using TwitterSentiment Analysis Using Twitter
Sentiment Analysis Using Twitter
piya chauhan
 
Propositional logic
Propositional logicPropositional logic
Propositional logic
Rushdi Shams
 
Sentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhySentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and Why
Davide Feltoni Gurini
 

Was ist angesagt? (20)

Twitter sentiment analysis
Twitter sentiment analysisTwitter sentiment analysis
Twitter sentiment analysis
 
Amazon sentimental analysis
Amazon sentimental analysisAmazon sentimental analysis
Amazon sentimental analysis
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
Sentiment Analysis Using Twitter
Sentiment Analysis Using TwitterSentiment Analysis Using Twitter
Sentiment Analysis Using Twitter
 
Text Classification
Text ClassificationText Classification
Text Classification
 
Lecture 05 problem solving through ai
Lecture 05 problem solving through aiLecture 05 problem solving through ai
Lecture 05 problem solving through ai
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Performance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning AlgorithmsPerformance Metrics for Machine Learning Algorithms
Performance Metrics for Machine Learning Algorithms
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
 
Propositional logic
Propositional logicPropositional logic
Propositional logic
 
Sentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhySentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and Why
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Sentiment analysis presentation
Sentiment analysis presentationSentiment analysis presentation
Sentiment analysis presentation
 
Principal component analysis and lda
Principal component analysis and ldaPrincipal component analysis and lda
Principal component analysis and lda
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysis
 
Text classification presentation
Text classification presentationText classification presentation
Text classification presentation
 
Introduction to Statistical Machine Learning
Introduction to Statistical Machine LearningIntroduction to Statistical Machine Learning
Introduction to Statistical Machine Learning
 
Language models
Language modelsLanguage models
Language models
 
Assessment of Constraints to Data Use
Assessment of Constraints to Data UseAssessment of Constraints to Data Use
Assessment of Constraints to Data Use
 

Ähnlich wie Interannotator Agreement

Don't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptxDon't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptx
Förderverein Technische Fakultät
 
Argument Papers (5-7 pages in length)1. Do schools perpe.docx
Argument Papers (5-7 pages in length)1. Do schools perpe.docxArgument Papers (5-7 pages in length)1. Do schools perpe.docx
Argument Papers (5-7 pages in length)1. Do schools perpe.docx
fredharris32
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similarity
pathsproject
 
The Use Of Decision Trees For Adaptive Item
The Use Of Decision Trees For Adaptive ItemThe Use Of Decision Trees For Adaptive Item
The Use Of Decision Trees For Adaptive Item
barthriley
 
Elsevier Industry Talk - WSDM 2020
Elsevier Industry Talk - WSDM 2020Elsevier Industry Talk - WSDM 2020
Elsevier Industry Talk - WSDM 2020
Daniel Kershaw
 

Ähnlich wie Interannotator Agreement (20)

ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
 
Data analysis
Data analysisData analysis
Data analysis
 
Reference Domain Ontologies and Large Medical Language Models.pptx
Reference Domain Ontologies and Large Medical Language Models.pptxReference Domain Ontologies and Large Medical Language Models.pptx
Reference Domain Ontologies and Large Medical Language Models.pptx
 
Don't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptxDon't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptx
 
Icsm19.ppt
Icsm19.pptIcsm19.ppt
Icsm19.ppt
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
 
Utility of topic extraction on customer experience data
Utility of topic extraction on customer experience dataUtility of topic extraction on customer experience data
Utility of topic extraction on customer experience data
 
An Introduction to SPSS
An Introduction to SPSSAn Introduction to SPSS
An Introduction to SPSS
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
Multimodal Learning Analytics
Multimodal Learning AnalyticsMultimodal Learning Analytics
Multimodal Learning Analytics
 
Argument Papers (5-7 pages in length)1. Do schools perpe.docx
Argument Papers (5-7 pages in length)1. Do schools perpe.docxArgument Papers (5-7 pages in length)1. Do schools perpe.docx
Argument Papers (5-7 pages in length)1. Do schools perpe.docx
 
Evidence-based Semantic Web Just a Dream or the Way to Go?
Evidence-based Semantic WebJust a Dream or the Way to Go?Evidence-based Semantic WebJust a Dream or the Way to Go?
Evidence-based Semantic Web Just a Dream or the Way to Go?
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similarity
 
1st sem
1st sem1st sem
1st sem
 
1st sem
1st sem1st sem
1st sem
 
Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...Creating a dataset of peer review in computer science conferences published b...
Creating a dataset of peer review in computer science conferences published b...
 
The Use Of Decision Trees For Adaptive Item
The Use Of Decision Trees For Adaptive ItemThe Use Of Decision Trees For Adaptive Item
The Use Of Decision Trees For Adaptive Item
 
Elsevier Industry Talk - WSDM 2020
Elsevier Industry Talk - WSDM 2020Elsevier Industry Talk - WSDM 2020
Elsevier Industry Talk - WSDM 2020
 
Optimization of Mechanical Design Problems Using Improved Differential Evolut...
Optimization of Mechanical Design Problems Using Improved Differential Evolut...Optimization of Mechanical Design Problems Using Improved Differential Evolut...
Optimization of Mechanical Design Problems Using Improved Differential Evolut...
 
Optimization of Mechanical Design Problems Using Improved Differential Evolut...
Optimization of Mechanical Design Problems Using Improved Differential Evolut...Optimization of Mechanical Design Problems Using Improved Differential Evolut...
Optimization of Mechanical Design Problems Using Improved Differential Evolut...
 

Mehr von john6938

Mehr von john6938 (20)

Social Media Ethics.pptx
Social Media Ethics.pptxSocial Media Ethics.pptx
Social Media Ethics.pptx
 
Future of Information Ethics.pptx
Future of Information Ethics.pptxFuture of Information Ethics.pptx
Future of Information Ethics.pptx
 
Bioethics.pptx
Bioethics.pptxBioethics.pptx
Bioethics.pptx
 
Surveillance and security.pptx
Surveillance and security.pptxSurveillance and security.pptx
Surveillance and security.pptx
 
Introduction to Expert Systems.pptx
Introduction to Expert Systems.pptxIntroduction to Expert Systems.pptx
Introduction to Expert Systems.pptx
 
Starbuck.pptx
Starbuck.pptxStarbuck.pptx
Starbuck.pptx
 
Unit 4 Problem breakdown.pptx
Unit 4 Problem breakdown.pptxUnit 4 Problem breakdown.pptx
Unit 4 Problem breakdown.pptx
 
Image_recognition.pptx
Image_recognition.pptxImage_recognition.pptx
Image_recognition.pptx
 
Algorithms.pptx
Algorithms.pptxAlgorithms.pptx
Algorithms.pptx
 
Artificial_intelligence.pptx
Artificial_intelligence.pptxArtificial_intelligence.pptx
Artificial_intelligence.pptx
 
Image_generation.pptx
Image_generation.pptxImage_generation.pptx
Image_generation.pptx
 
Computer_Graphics.pptx
Computer_Graphics.pptxComputer_Graphics.pptx
Computer_Graphics.pptx
 
Security.pptx
Security.pptxSecurity.pptx
Security.pptx
 
Gravitational_wave_detection.pptx
Gravitational_wave_detection.pptxGravitational_wave_detection.pptx
Gravitational_wave_detection.pptx
 
Embedded_Systems.pptx
Embedded_Systems.pptxEmbedded_Systems.pptx
Embedded_Systems.pptx
 
Software_engineering.pptx
Software_engineering.pptxSoftware_engineering.pptx
Software_engineering.pptx
 
Quantum_computers.pptx
Quantum_computers.pptxQuantum_computers.pptx
Quantum_computers.pptx
 
NLP.pptx
NLP.pptxNLP.pptx
NLP.pptx
 
Sensors_SLAM.pptx
Sensors_SLAM.pptxSensors_SLAM.pptx
Sensors_SLAM.pptx
 
Maths.pptx
Maths.pptxMaths.pptx
Maths.pptx
 

Kürzlich hochgeladen

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Role Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptxRole Of Transgenic Animal In Target Validation-1.pptx
Role Of Transgenic Animal In Target Validation-1.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 

Interannotator Agreement

  • 1. John Blake University of Aizu, Japan Inter-annotator agreement: By hook or by crook www.orau.org
  • 2. Overview • Background • Case study – Annotation of scientific research abstracts – Strategic decision points • Findings – Methodological improvements – Statistical smoke and rhetorical mirrors • Conclusions 2
  • 3. Subjectivity in annotation 3 POS tagging, Phonetic transcription etc. Annotation guidelines with discussion of boundary cases Basic annotation guidelines Speaker intuition, e.g. discourse annotation, pragmatics, etc. Problem: Vagueness and ambiguity in natural languages Manning (2011) 97.321 / 10021 = 56.28 %
  • 4. Automated and manual annotation compared 4 Automated annotation Manual annotation Subjective agent Software developer Annotator Subjective stage Prior to annotation During annotation Replicability (near) Perfect Variable Initial set up cost High (if new software) Low On-going cost (near) Zero High Scalable Yes No Dependent condition Availability of training set Availability of annotators (contingent on time/money) Speed (near) Instantaneous Variable Factors considered Endogeneric Endo- and exogeneric Strength Grammatical parsing Semantic parsing
  • 5. Inter-annotator agreement 5 Crucial issue: Are the annotations correct? We are interested in validity • Ability to discriminate without error by placing item into appropriate category But there is no “Ground truth” • Linguistic categories are determined by human judgement  Implication: We cannot measure correctness directly So we measure reliability , e.g. reproducibility. • Intra-annotator reliability • Inter-annotator reliability i.e. whether human coders/annotators consistently make same decisions  Assumption 1: lack of reliability rules out validity (text/training issues)  Assumption 2: high reliability implies validity Terminology credit (Artsein & Poesio, 2008) Idea adapted from Boldea & Evert (2009) : https://clseslli09.files.wordpress.com/2009/07/02_iaa- slides2.pdf/
  • 6. Simple example 1 6 (abbreviated for length to increase readability) Sentence Coder 1 Coder 2 Agreement We address the problem of …… recognition I P  Our aim is to …recognize [x] from [y]. P P  [A] is set up as prior information, and its pose is determined by three parameters, which are [j,k and l]. M M  An efficient local gradient-based method is proposed to …, which is combined into … framework to estimate [V and W] by iterative evolution P R  It is shown that the local gradient-based method can evaluate accurately and efficiently [V and W] . R R  Observed agreement between 1 and 2 is 60%
  • 7. IAA measures: Kappa coefficient 7 Inter-annotator agreement of 60% in previous example, but chance agreement figure is 20%. Agreement measures must be corrected for chance agreement (Carletta, 1996). Kappa coefficient (Cohen 1960 for 2, Fleiss for 2+) e.g. Corrected measure: K = P A −P E 1−𝑃(𝐸) 1 (agreement) 0 (no correlation) -1(disagreement) Interpretation of Kappa • Landis and Koch (1977) 0.6-0.79 substantial; 0.8+ perfect • Krippendorff (1980) 0.67-0.79 tentative; 0.8+ good • Green (1997) 0.4-0.74 fair/good; 0.75 high
  • 8. IAA measures: Sophisticated 8 e.g. Typical measures used in computational linguistics built into NLP pipelines, such as NLTK and GATE Rather than measuring agreement alone, we can measure both agreement and disagreement, e.g. using Measuring agreement on set-valued items (MASI) and/or Jaccard distance. Both MASI (Passonneau, 2006) and Jaccard distance make use of the union and intersection between sets. Jaccard formula (Jaccard, 1908 cited in Dunn & Everitt, 2004) is:
  • 9. Case study overview • Moves in scientific research abstracts • Scientific disciplines • Core corpus specifications • Example abstract • Tagset • Strategic decision points (tag #IAA extraction) NB: By convention this far-from-linear study is presented in a linear fashion when in fact there were numerous forks, dead-ends and iterations. 9
  • 10. Moves in scientific research abstracts 10 Move definition “a discoursal or rhetorical unit that performs a coherent communicative function in a written or spoken discourse”. (Swales, 2004, p.228) Move sequences Example (very short) abstract 5-move code Introduction Purpose Method Results Discussion
  • 11. Scientific disciplines 11 Science Fundamental Empirical Natural Physical Materials science Life Botany Social Linguistics Theoretical Formal Information theory Applied Engineering Evolutionary computation Knowledge & data engineering Image processing Wireless computing Electronic engineering Healthcare Medical
  • 12. Core 1000 corpus specifications 12 Code Journal name # abstracts # words 1 EC Transactions on Evolutionary Computation 100 17,433 2 KDE Transactions on Knowledge and Data Engineering 100 18,407 3 IP Transactions on Image Processing 100 16,859 4 IT Transactions on Information Theory 100 15,982 5 WC Transactions on Wireless Communications 100 15,971 6 Mat Advanced materials 100 6.078 7 Bot The plant cell 100 19,981 8 Ling App. Ling; Journal of Comm; J of Cog. Neurosc. 100 13,587 9 Eng Transactions on Industrial Electronics 100 14,569 10 Med British Medical Journal 100 29,437 Total 1000 162,232 First 100 abstracts of research articles from top-tier journals published from Jan 2012.
  • 13. We study the detection error probability associated with a balanced binary relay tree, where the leaves of the tree correspond to N identical and independent sensors. The root of the tree represents a fusion center that makes the overall detection decision. Each of the other nodes in the tree is a relay node that combines two binary messages to form a single output binary message. Only the leaves are sensors. In this way, the information from the sensors is aggregated into the fusion center via the relay nodes. In this context, we describe the evolution of the Type I and Type II error probabilities of the binary data as it propagates from the leaves toward the root. Tight upper and lower bounds for the total error probability at the fusion center as functions of N are derived. These characterize how fast the total error probability converges to 0 with respect to N , even if the individual sensors have error probabilities that converge to 1/2. [IT 120616] Standard abstract (IT) 13
  • 14. Tagset 14 Manual annotation using UAM Corpus Tool 2.X and 3.X (O`Donnell, 2015) This layer of annotation is for rhetorical moves. There are 5 choices of moves and 6 choices of submoves. In short, each ontological unit is assigned to one of 9 choices. The “uncertain” tag is designed as a temporary label.
  • 15. #IAA theme extraction Strategic decision points • Research log was kept using themes, e.g. #meth, #stats, #IAA • 142 notes relating to #IAA written between 2012- 2017 were identified. • The findings presented are the notes that are the most important and generalizable to other projects. 15
  • 16. Findings overview: Three types of strategic decisions affecting IAA 1. Methodological decisions 2. Statistical decisions 3. Rhetorical decisions 16
  • 17. Findings (1) Methodological choices to enhance IAA A. Ontological unit B. Tagset size C. Tag clarity of demarcation D. Catch-all tags E. Detailed coding booklet F. Pre-selection, training and testing G. Easy-to-use tools H. Monitoring, feedback and regular meetings I. Pilot studies and small trials 17
  • 18. Finding 1a: Ontological unit 18 Fixed ontological units (i.e. what you code), e.g. each word, each sentence, simplify calculation of IAA and increase the IAA since boundaries of each unit are identical. Variable ontological units provide researchers with additional choices on how to calculate (manipulate?) IAA – identical, subsumed, cross-over. How do you calculate by character (inc. white space?), letter, word, what unit? I love you. 8 letters, 3 words, 11 characters I love him. Agreement ratio 0.62, 0.67, 0.72
  • 19. Finding 1b: Tagset size The more tags, the less agreement Rissanen (1989, as cited in Archer, 2012, n.p.) points out the “mystery of vanishing reliability” i.e. the statistical unreliability of annotation that is too detailed. Obvious with hindsight, but researchers tend to develop tags that will inform their research rather than result in higher IAA. 1 tag = total agreement (but probably no reason to code) 10 tags = less agreement 100 tags = much less agreement 1000 tags = almost no chance of high IAA 19
  • 20. Finding 1c: Tagset clarity of demarcation Pilot studies of possible tags and tagsets Pilot study: Tagged 100 abstracts using IMRD move and CARS move tags Difficulty: 1. prevalence of method in IMRD positions 2. demarcation of boundary cases  created SOP, codified in coding booklet Final selection: Dropped both sets of tags and selected Hyland (2004, p.67) IPMPC tagset20
  • 21. Finding 1d: Catch-all tags 21 Tags Description Fuzzy Used when difficult to assign to tag in existing tagset Multiple Used when more than one tag applies Portmanteau Used when item transcends two tag domains Problematic Used when impossible to assign tag Archer (2012, n.p.) describes four tag types, all of which increase IAA by providing easy-to-code options for boundary cases. My “uncertain” tag is a catch-all. Calculating IAA including “uncertain” results in higher IAA.
  • 22. Finding 1e: Annotation (coding) booklet 22 Standard operating procedure • Guidelines, Rules, Examples, Borderline cases disambiguated
  • 23. Finding 1f: Training course and test 23 Course based on annotation booklet • Face-to-face and/or online Test based on annotation booklet • Serialist tests • Holistic tests Qualification cut-off points • e.g. 90% can start annotating • e.g. 61% needs additional training • e.g. 60% discontinue training
  • 24. Finding 1g: Easy-to-use annotation tools 24 • Tool and instructions! • UAM Corpus Tool – help forum in Spanish • Wrote project-specific instruction booklet for annotators
  • 25. Finding 1h: monitoring, feedback and regular meetings 25 These three aspects I believe led to greater retention of annotators and higher accuracy. • More monitoring in initial stages (real-time is possible in GATE) – to identify problems early • Constructive actionable feedback – to retain annotator and increase accuracy • Regular meetings – annotators who cancelled meetings tended to have a problem (either with annotation or in their life). I helped with annotation issues.
  • 26. Finding 1i: Pilot studies 26 Various pilot studies and small-scale trials. Enables researcher to discover issues and proactively avert potential problems • 136 abstracts SFL annotation of process, participant and circumstance • 136 abstracts SFL annotation of sub-categories of circumstance • 10 abstracts Multimethod • 500 abstracts Lexicogrammatical • 40 abstracts Specialist vs linguist IMRaD annotation • 100 abstracts Tagset selection (CARS vs IMRaD) • 3 people Development of Coding booklet • 10 abstracts Examples vs. Coding booklet • 2 people Development of training course • 500 abstracts Rhetorical moves using coding booklet by self • 1000 abstracts Rhetorical moves using coding booklet by self & annotators • 2500 abstracts Rhetorical moves using coding booklet by annotators
  • 27. Findings (2) Statistical choices to enhance IAA A. Cherry-picking population-sample size ratio B. Random vs systematic C. Dealing with outliers (annotators) • Omit [+justify?]; replace with mean [?] D. Sample selection: • early vs later coding • pre-discussion vs. post-discussion E. Granularity (see next slide) • Reducing granularity by merging units; fewer categories, higher agreement 27
  • 28. Finding 2e: Granularity 28 Measures of IAA increase greatly as granularity decreases Lower IAA Higher IAA
  • 29. Findings (3) Rhetorical choices to enhance IAA Claim high IAA with no further details + gold standard with no further details and/or + provide a simple ratio or percentage and/or + provide details of sample size Rely on vagueness and ambiguity to allow reader to infer higher IAA than found or actual high IAA. 29
  • 30. Conclusion High IAA may be due to • sound or cogent methodological choices; but it could also be due to manipulating the • statistical smoke (i.e. selecting parameters leading to higher IAA) and • rhetorical mirrors. (i.e. using vagueness/ambiguity to infer IAA is high) In most publications in applied linguistics, sufficient detail is not provided. 30
  • 31. Best practice suggestions • Annotate using tags at one level more finely. • Create annotation booklet with clear rules, examples and discussion of boundary cases. • Develop, trial and require all annotators to complete a training course. • Set a benchmark standard. • Monitor and provide constructive actionable feedback to annotators. • Report IAA in sufficient detail to convince skeptical readers. 31
  • 32. Beware of the skeleton in the cupboard • Researchers aim to portray their work as sound or cogent. • Actual IAA may differ from reported IAA • Be wary of statistical smoke and rhetorical mirrors 32
  • 33. Any questions, suggestions or comments? John Blake jblake@u-aizu.ac.jp