SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Performance Impact Caused by
Hidden Bias of Training Data for RTE
Masatoshi Tsuchiya (Toyohashi University of Technology)
1
Brief summary of my presentation
2
The SNLI corpus, which is widely used for
English RTE task, has HIDDEN BIAS.
It enables us to estimate TE labels of
hypothesis sentences, even if no context
information given by premise sentences.
Definition of RTE task for SemEval/SNLI
3
 A task to partition
relationships between
a premise sentence and
a hypothesis sentence
into three categories
 E: Entailment
 N: Neutral
 C: Contradiction
Sentence
𝑠1 Two boys are
swimming in the pool.
E
𝑠2 Two girls are playing
the basketball.
N
𝑠3 Two women are
swimming in the pool.
C
ℎ Two children are
swimming in the pool.
Suppose the case that the same hypothesis
sentence and three premise sentences
4
 When 𝑠1 is given as a
premise sentence, the
relationship between 𝑠1
and ℎ is labeled as E.
Sentence
𝑠1 Two boys are
swimming in the pool.
E
𝑠2 Two girls are playing
the basketball.
N
𝑠3 Two women are
swimming in the pool.
C
ℎ Two children are
swimming in the pool.
Suppose the case that the same hypothesis
sentence and three premise sentences
5
 When 𝑠2 is given as a
premise sentence, the
relationship between 𝑠2
and ℎ is labeled as N.
Sentence
𝑠1 Two boys are
swimming in the pool.
E
𝑠2 Two girls are playing
the basketball.
N
𝑠3 Two women are
swimming in the pool.
C
ℎ Two children are
swimming in the pool.
Suppose the case that the same hypothesis
sentence and three premise sentences
6
 When 𝑠3 is given as a
premise sentence, the
relationship between 𝑠3
and ℎ is labeled as C.
Sentence
𝑠1 Two boys are
swimming in the pool.
E
𝑠2 Two girls are playing
the basketball.
N
𝑠3 Two women are
swimming in the pool.
C
ℎ Two children are
swimming in the pool.
Suppose the case that the same hypothesis
sentence and three premise sentences
7
 These examples indicate
that the TE label is
determinable
if and only if context
information is given by
a premise sentence.
Sentence
𝑠1 Two boys are
swimming in the pool.
E
𝑠2 Two girls are playing
the basketball.
N
𝑠3 Two women are
swimming in the pool.
C
ℎ Two children are
swimming in the pool.
(Extremely unacceptable) Null hypothesis
8
TE label of hypothesis sentences are
determinable, even if no context information
is given by premise sentences.
If this hypothesis is not rejected for a certain corpus,
the corpus has a hidden bias.
TE label prediction model
9
 It is designed to check the null hypothesis.
 It estimates TE label for hypothesis sentences, without
context information given by premise sentences.
 Naive-Bayes model is employed.
𝑦 = argmax
𝑦
𝑃(𝑦)
𝑖=1
𝑛
𝑃(𝑥𝑖|𝑦)
 𝑦 is a TE label
 𝑥𝑖 is a feature. All word unigrams in a hypothesis sentence are
used as features.
Baseline model
10
 Suppose the case that no information given by
either premise sentences or hypothesis sentences.
 The baseline model assigns TE labels for hypothesis
sentences, only based on explicit label bias.
𝑦 = argmax
𝑦
𝑃(𝑦)
Explicit bias of TE labels
11
SNLI SICK
Train Devel. Test Train Devel. Test
Entailment 33.4% 33.8% 34.3% 28.9% 28.8% 28.7%
Neutral 33.3% 32.9% 32.8% 56.4% 56.4% 56.7%
Contradiction 33.4% 33.3% 33.0% 14.8% 14.8% 14.6%
 The SNLI corpus is balanced for TE labels.
 The SICK corpus is not balanced, and has an explicit
label bias for TE labels.
Statistical test of the null hypothesis
12
If the TE label prediction model
achieves statistically significant better
performance than the baseline model,
the null hypothesis is not rejected.
Experimental result of statistical test
13
 The statistical test of the SNLI corpus does not
reject the null hypothesis.
 The statistical test of the SICK corpus rejects the
null hypothesis.
Corpus TE label prediction model Baseline model
SNLI 63.3% 34.3%
SICK 56.7% 56.7%
Confusion matrices
14
SNLI
Predicted
labels
Corpus labels
E N C
E 2275 644 706
N 508 1976 563
C 585 599 1968
 The TE label prediction model trained and tested on the
SNLI corpus tries to predict an appropriate TE label for
each individual hypothesis sentence.
 The TE label prediction model trained and tested on the
SICK corpus simply outputs the major TE label, “neutral”.
SICK
Predicted
labels
Corpus labels
E N C
E 3 3 2
N 1411 2790 718
C 0 0 0
Intermediate conclusion of the first half
15
 The null hypothesis is that TE labels of
hypothesis sentences are determinable
without context information given by
premise sentences.
 It is not rejected for the SNLI corpus !
 It is rejected for the SICK corpus !
Why ?
Good !
Topics of the latter half
16
 Source of the hidden bias of the SNLI
corpus
 Performance impact caused by the hidden
bias
Difference between
the SNLI corpus and the SICK corpus
17
Construction procedure
Both of them use Amazon Mechanical Turk.
Comparison of construction procedure
The SNLI corpus The SICK corpus
18
① Sentences of the Flicker
corpus are provided to
human workers as premise
sentences.
② Human workers are asked
to compose three
hypothesis sentences for
each premise sentence.
① Sentences of the Flicker
corpus are simplified using
hand-crafted rules.
② Sentence pairs selected
based on similarity are
provided to human workers.
③ Human workers are asked
to classify the sentence
pairs to three categories.
Difference between
the SNLI corpus and the SICK corpus
19
 They are different in works asked to human workers.
 For the SNLI corpus, human workers were asked to
compose hypothesis sentences.
 For the SICK corpus, human workers were asked to
annotate sentence pairs.
 Thus, hypothesis sentences of the SNLI corpus may
contain human bias?
Prominent words to estimate TE labels
20
 The TE label prediction model is defined as follows:
𝑦 = argmax
𝑦
𝑃(𝑦)
𝑖=1
𝑛
𝑃(𝑥𝑖|𝑦)
 𝑦 is a TE label.
 𝑥𝑖 is a feature. All word unigrams of hypothesis sentence are
employed as features.
 The TE label probability 𝑃(𝑦|𝑥) conditioned by a word
can be computed from this model using Bayes rule.
 These prominent words are useful to estimate TE labels.
Prominent words on 𝑷 𝒚 𝒙
21
Entailment Neutral Contradiction
Top-5words
proximity 0.9570 joyously 0.9871 nobody 0.9949
least 0.9318 impress 0.9563 alll 0.9718
bvoy 0.8848 championship 0.9398 mars 0.9630
interacting 0.8760 playoff 0.9371 mashed 0.9433
mammals 0.8712 siblings 0.9160 frowning 0.9388
Bottom-5words
funeral 0.0081 empty-handed 0.0267 mammals 0.0277
mars 0.0071 frowning 0.0242 impress 0.0241
joyously 0.0067 mute 0.0228 proximity 0.0152
championship 0.0032 alll 0.0129 least 0.0119
nobody 0.0009 nobody 0.0042 joyously 0.0062
Prominent words on 𝑷 𝒚 𝒙
22
Entailment Neutral Contradiction
Top-5words
proximity 0.9570 joyously 0.9871 nobody 0.9949
least 0.9318 impress 0.9563 alll 0.9718
bvoy 0.8848 championship 0.9398 mars 0.9630
interacting 0.8760 playoff 0.9371 mashed 0.9433
mammals 0.8712 siblings 0.9160 frowning 0.9388
Bottom-5words
funeral 0.0081 empty-handed 0.0267 mammals 0.0277
mars 0.0071 frowning 0.0242 impress 0.0241
joyously 0.0067 mute 0.0228 proximity 0.0152
championship 0.0032 alll 0.0129 least 0.0119
nobody 0.0009 nobody 0.0042 joyously 0.0062
Examples using “nobody”
23
Premise sentences Hypothesis sentences
C
A man and a woman are standing next to
sculptures, talking while another man looks
at other sculptures.
Nobody is standing.
A woman is walking across the street eating
a banana, while a man is following with his
briefcase.
Nobody has food.
A group of young girls playing jump rope in
the street.
Nobody is playing
jump rope.
N
Three young girls posing for a picture in an
outdoor amphitheater, surrounded by adults
watching a conference.
Nobody is wearing a
hat.
E
Lacrosse players struggling for control of the
ball.
Nobody is in control
of the ball.
The manual of the SNLI corpus prohibits human workers to compose contradiction
sentences by inserting “not”. These examples suggests this prohibition is insufficient.
Prominent words on 𝑷 𝒚 𝒙
24
Entailment Neutral Contradiction
Top-5words
proximity 0.9570 joyously 0.9871 nobody 0.9949
least 0.9318 impress 0.9563 alll 0.9718
bvoy 0.8848 championship 0.9398 mars 0.9630
interacting 0.8760 playoff 0.9371 mashed 0.9433
mammals 0.8712 siblings 0.9160 frowning 0.9388
Bottom-5words
funeral 0.0081 empty-handed 0.0267 mammals 0.0277
mars 0.0071 frowning 0.0242 impress 0.0241
joyously 0.0067 mute 0.0228 proximity 0.0152
championship 0.0032 alll 0.0129 least 0.0119
nobody 0.0009 nobody 0.0042 joyously 0.0062
Examples using “championship”
25
Premise sentences Hypothesis sentences
N
A soccer match between a team with
white jerseys, and a team with yellow
jerseys.
The teams are in a
championship match.
Two soccer teams are competing on a
soccer field.
Two skilled soccer teams
are competing against
one another for the
championship.
There is a baseball player standing at
home plate, the catcher behind him has
his hand up in the air with his glove, and
the umpire is standing behind him, and
many people in the stands.
The final game of the
championship is being
played while many fans
are in the stands.
“Championship” is used to create unrelated entities against sport game
entities referred in premise sentences.
Several words provide negative clues
26
E N C
funeral 0.0081 0.3804 0.6115
stole 0.0106 0.5607 0.4287
stationary 0.2668 0.0421 0.6911
soaring 0.4340 0.0522 0.5139
human 0.6296 0.3372 0.0332
higher 0.4123 0.5477 0.0400
Examples using “higher”
27
Premise sentences Hypothesis sentences
E
A speed boat pulling a waterskier
along a jump.
The skier is going higher in
the water.
Top of the stands looking down at the
baseball stadium.
The baseball stadium seats
are higher than the field.
N
Two men, one in a circuit city t-shirt,
the other in an M&Ms t-shirt, operate
video game guns.
One man has a higher
score than the other .
A young smiling woman is having fun
on a rustic looking swing.
A woman is trying to swing
higher than her friend.
C
Red objects fall on men standing
behind a red wall.
The men are higher than
the wall.
Human workers can create entailment sentences and neutral sentences by
using “higher” between two entities which are referred in premise sentences.
The source of hidden bias
28
Human workers’ bias in their word selection
when composing hypothesis sentences.
Performance impact caused by
the hidden bias of the SNLI corpus
29
How much performance impact is caused by
the hidden bias of the SNLI corpus?
Empirical classification of the SNLI corpus
using the TE label prediction model
30
 𝐸𝑒:Empirical easy test set
 A subset covers all test pairs
whose TE labels are
predicted correctly by the
TE label prediction model.
𝐸𝑒 𝐻𝑒
Entailment 2,275 (36.6%) 1,093 (30.3%)
Neutral 1,976 (31.8%) 1,243 (34.5%)
Contradiction 1,968 (31.6%) 1,269 (35.2%)
6,219 3,605
 𝐻𝑒:Empirical hard test set
 The complement subset of 𝐸𝑒.
NN models for RTE task
31
 Encoder-decoder model (Tim Rocktaschel et al,
ICLR2016)
 Encoder using LSTM converts a premise sentence into a
vector representation.
 Decoder using LSTM inferences based on the above
vector representation and a hypothesis sentence.
 Attention Based Convolutional NN (Wenpeng Yin et
al, TACL2016)
 Tree-based convolution model (LiLi et al, ACL2016)
Parallel LSTM Model (Bowman et al, 2015)
32
LSTM
𝑝1
𝑊𝑒
label
LSTM
𝑝2
𝑊𝑒
LSTM
𝑝3
𝑊𝑒
LSTM
𝑝4
𝑊𝑒
LSTM
ℎ1
𝑊𝑒
LSTM
ℎ2
𝑊𝑒
LSTM
ℎ3
𝑊𝑒
LSTM
ℎ4
𝑊𝑒
Premise
Sentence
Hypothesis
Sentence
Sequential LSTM Model (Rocktashchel et al, 2015)
33
LSTM
𝑝1
LSTM
𝑝2
LSTM
𝑝3
LSTM
𝑝4
LSTM
ℎ1
LSTM
ℎ2
LSTM
ℎ3
label
Premise sentence Hypothesis sentence
Performance drop caused by hidden bias
34
 Both NN models achieve high accuracy for the empirical easy
test set 𝐸𝑒.
 However, they achieve drastic low accuracy for the empirical
hard test set 𝐻𝑒.
 These results suggest that the large portion of the high
accuracy for the whole test set 𝐸𝑒 ∪ 𝐻𝑒 benefits from the
empirical easy test set 𝐸𝑒.
𝐸𝑒 ∪ 𝐻𝑒 𝐸𝑒 𝐻𝑒
Parallel LSTM model 76.8% 87.8% 57.8%
Sequential LSTM model 81.4% 90.1% 65.6%
Replace all premise words to UNK symbols,
to remove context information
35
label
LSTM
ℎ1
𝑊𝑒
LSTM
ℎ2
𝑊𝑒
LSTM
ℎ3
𝑊𝑒
LSTM
ℎ4
𝑊𝑒
Premise
Sentence
Hypothesis
Sentence
LSTM
𝑊𝑒
UNK
LSTM
𝑊𝑒
UNK
LSTM
𝑊𝑒
UNK
LSTM
𝑊𝑒
UNK
Performance of NN models for RTE
when context information is removed
36
Performance of NN models for RTE
𝐸𝑒 ∪ 𝐻𝑒 𝐸𝑒 𝐻𝑒
Parallel LSTM model 54.1% 66.0% 33.7%
Sequential LSTM model 48.6% 56.7% 34.7%
𝐸𝑒 𝐻𝑒
Entailment 2,275 (36.6%) 1,093 (30.3%)
Neutral 1,976 (31.8%) 1,243 (34.5%)
Contradiction 1,968 (31.6%) 1,269 (35.2%)
6,219 3,605
Statistics of empirical classification
They are close
to chance ratios.
They are different
to chance ratios.
Conclusion
37
 The SNLI corpus has the hidden bias which allows us
to estimate TE labels of hypothesis sentences without
context information given by premise sentences.
 NN models trained on the SNLI corpus does not
work as an RTE model for the empirical easy test set
𝐸𝑒, but work as a TE label prediction model.
38
Questions and answers in my presentation
39
 Did you try more
complexed label
prediction models?
 I tried a NN model and a
decision tree model for the
TE label prediction model.
However, the performance
differences between them
and the NB model were
quite small. Thus, I
reported the results of the
NB model.
 How about the other data
sets?
 I have evaluation results for
several corpora including
the MultiNLI corpus.
Because the MultiNLI
corpus is also constructed
in similar procedure to the
SNLI corpus, it also has a
hidden bias.
Prominent words on 𝑷 𝒚 𝒙
40
Entailment Neutral Contradiction
Top-5words
proximity 0.9570 joyously 0.9871 nobody 0.9949
least 0.9318 impress 0.9563 alll 0.9718
bvoy 0.8848 championship 0.9398 mars 0.9630
interacting 0.8760 playoff 0.9371 mashed 0.9433
mammals 0.8712 siblings 0.9160 frowning 0.9388
Bottom-5words
funeral 0.0081 empty-handed 0.0267 mammals 0.0277
mars 0.0071 frowning 0.0242 impress 0.0241
joyously 0.0067 mute 0.0228 proximity 0.0152
championship 0.0032 alll 0.0129 least 0.0119
nobody 0.0009 nobody 0.0042 joyously 0.0062
Examples using “proximity”
41
Promise sentences Hypothesis sentences
E
A bride and groom dance surrounded
by people at the reception.
A married couple is in the
proximity of other humans.
Many people are dunking to support
special olympics.
Several people are in close
proximity to each other.
A bull charges at a man within a
stadium while an audience watches.
Onlookers view a person
and an animal in close
proximity to each other.
N
Child playing in waves with sun on
the horizon.
A child is playing in the
water with her mother in
close proximity.
“Proximity” is convenient for human workers to compose entailment sentences
when multiple person entities appear in premise sentences.
Statistics of
the SNLI corpus and the SICK corpus
42
SNLI SICK
# of training pairs 55K 4500
# of development pairs 10K 500
# of test pairs 10K 4927
Vocabulary size of training pairs 36427 2178
OOV ratio of test pairs (v.s. training pairs) 0.24% 0.29%
OOV ratio of test pairs (v.s. training pairs
of the opposite corpus)
10.3% 0.15%
• SNLI training set is enough large to cover SICK test set as well as
SNLI test set.
• SICK training set covers its own test set, but does not cover SNLI
test set.

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 

Kürzlich hochgeladen (20)

Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 

Empfohlen

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 

Empfohlen (20)

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 

Performance Impact Caused by Hidden Bias of Traning Data for Recognizing Textual Entailment (LREC2018)

  • 1. Performance Impact Caused by Hidden Bias of Training Data for RTE Masatoshi Tsuchiya (Toyohashi University of Technology) 1
  • 2. Brief summary of my presentation 2 The SNLI corpus, which is widely used for English RTE task, has HIDDEN BIAS. It enables us to estimate TE labels of hypothesis sentences, even if no context information given by premise sentences.
  • 3. Definition of RTE task for SemEval/SNLI 3  A task to partition relationships between a premise sentence and a hypothesis sentence into three categories  E: Entailment  N: Neutral  C: Contradiction Sentence 𝑠1 Two boys are swimming in the pool. E 𝑠2 Two girls are playing the basketball. N 𝑠3 Two women are swimming in the pool. C ℎ Two children are swimming in the pool.
  • 4. Suppose the case that the same hypothesis sentence and three premise sentences 4  When 𝑠1 is given as a premise sentence, the relationship between 𝑠1 and ℎ is labeled as E. Sentence 𝑠1 Two boys are swimming in the pool. E 𝑠2 Two girls are playing the basketball. N 𝑠3 Two women are swimming in the pool. C ℎ Two children are swimming in the pool.
  • 5. Suppose the case that the same hypothesis sentence and three premise sentences 5  When 𝑠2 is given as a premise sentence, the relationship between 𝑠2 and ℎ is labeled as N. Sentence 𝑠1 Two boys are swimming in the pool. E 𝑠2 Two girls are playing the basketball. N 𝑠3 Two women are swimming in the pool. C ℎ Two children are swimming in the pool.
  • 6. Suppose the case that the same hypothesis sentence and three premise sentences 6  When 𝑠3 is given as a premise sentence, the relationship between 𝑠3 and ℎ is labeled as C. Sentence 𝑠1 Two boys are swimming in the pool. E 𝑠2 Two girls are playing the basketball. N 𝑠3 Two women are swimming in the pool. C ℎ Two children are swimming in the pool.
  • 7. Suppose the case that the same hypothesis sentence and three premise sentences 7  These examples indicate that the TE label is determinable if and only if context information is given by a premise sentence. Sentence 𝑠1 Two boys are swimming in the pool. E 𝑠2 Two girls are playing the basketball. N 𝑠3 Two women are swimming in the pool. C ℎ Two children are swimming in the pool.
  • 8. (Extremely unacceptable) Null hypothesis 8 TE label of hypothesis sentences are determinable, even if no context information is given by premise sentences. If this hypothesis is not rejected for a certain corpus, the corpus has a hidden bias.
  • 9. TE label prediction model 9  It is designed to check the null hypothesis.  It estimates TE label for hypothesis sentences, without context information given by premise sentences.  Naive-Bayes model is employed. 𝑦 = argmax 𝑦 𝑃(𝑦) 𝑖=1 𝑛 𝑃(𝑥𝑖|𝑦)  𝑦 is a TE label  𝑥𝑖 is a feature. All word unigrams in a hypothesis sentence are used as features.
  • 10. Baseline model 10  Suppose the case that no information given by either premise sentences or hypothesis sentences.  The baseline model assigns TE labels for hypothesis sentences, only based on explicit label bias. 𝑦 = argmax 𝑦 𝑃(𝑦)
  • 11. Explicit bias of TE labels 11 SNLI SICK Train Devel. Test Train Devel. Test Entailment 33.4% 33.8% 34.3% 28.9% 28.8% 28.7% Neutral 33.3% 32.9% 32.8% 56.4% 56.4% 56.7% Contradiction 33.4% 33.3% 33.0% 14.8% 14.8% 14.6%  The SNLI corpus is balanced for TE labels.  The SICK corpus is not balanced, and has an explicit label bias for TE labels.
  • 12. Statistical test of the null hypothesis 12 If the TE label prediction model achieves statistically significant better performance than the baseline model, the null hypothesis is not rejected.
  • 13. Experimental result of statistical test 13  The statistical test of the SNLI corpus does not reject the null hypothesis.  The statistical test of the SICK corpus rejects the null hypothesis. Corpus TE label prediction model Baseline model SNLI 63.3% 34.3% SICK 56.7% 56.7%
  • 14. Confusion matrices 14 SNLI Predicted labels Corpus labels E N C E 2275 644 706 N 508 1976 563 C 585 599 1968  The TE label prediction model trained and tested on the SNLI corpus tries to predict an appropriate TE label for each individual hypothesis sentence.  The TE label prediction model trained and tested on the SICK corpus simply outputs the major TE label, “neutral”. SICK Predicted labels Corpus labels E N C E 3 3 2 N 1411 2790 718 C 0 0 0
  • 15. Intermediate conclusion of the first half 15  The null hypothesis is that TE labels of hypothesis sentences are determinable without context information given by premise sentences.  It is not rejected for the SNLI corpus !  It is rejected for the SICK corpus ! Why ? Good !
  • 16. Topics of the latter half 16  Source of the hidden bias of the SNLI corpus  Performance impact caused by the hidden bias
  • 17. Difference between the SNLI corpus and the SICK corpus 17 Construction procedure Both of them use Amazon Mechanical Turk.
  • 18. Comparison of construction procedure The SNLI corpus The SICK corpus 18 ① Sentences of the Flicker corpus are provided to human workers as premise sentences. ② Human workers are asked to compose three hypothesis sentences for each premise sentence. ① Sentences of the Flicker corpus are simplified using hand-crafted rules. ② Sentence pairs selected based on similarity are provided to human workers. ③ Human workers are asked to classify the sentence pairs to three categories.
  • 19. Difference between the SNLI corpus and the SICK corpus 19  They are different in works asked to human workers.  For the SNLI corpus, human workers were asked to compose hypothesis sentences.  For the SICK corpus, human workers were asked to annotate sentence pairs.  Thus, hypothesis sentences of the SNLI corpus may contain human bias?
  • 20. Prominent words to estimate TE labels 20  The TE label prediction model is defined as follows: 𝑦 = argmax 𝑦 𝑃(𝑦) 𝑖=1 𝑛 𝑃(𝑥𝑖|𝑦)  𝑦 is a TE label.  𝑥𝑖 is a feature. All word unigrams of hypothesis sentence are employed as features.  The TE label probability 𝑃(𝑦|𝑥) conditioned by a word can be computed from this model using Bayes rule.  These prominent words are useful to estimate TE labels.
  • 21. Prominent words on 𝑷 𝒚 𝒙 21 Entailment Neutral Contradiction Top-5words proximity 0.9570 joyously 0.9871 nobody 0.9949 least 0.9318 impress 0.9563 alll 0.9718 bvoy 0.8848 championship 0.9398 mars 0.9630 interacting 0.8760 playoff 0.9371 mashed 0.9433 mammals 0.8712 siblings 0.9160 frowning 0.9388 Bottom-5words funeral 0.0081 empty-handed 0.0267 mammals 0.0277 mars 0.0071 frowning 0.0242 impress 0.0241 joyously 0.0067 mute 0.0228 proximity 0.0152 championship 0.0032 alll 0.0129 least 0.0119 nobody 0.0009 nobody 0.0042 joyously 0.0062
  • 22. Prominent words on 𝑷 𝒚 𝒙 22 Entailment Neutral Contradiction Top-5words proximity 0.9570 joyously 0.9871 nobody 0.9949 least 0.9318 impress 0.9563 alll 0.9718 bvoy 0.8848 championship 0.9398 mars 0.9630 interacting 0.8760 playoff 0.9371 mashed 0.9433 mammals 0.8712 siblings 0.9160 frowning 0.9388 Bottom-5words funeral 0.0081 empty-handed 0.0267 mammals 0.0277 mars 0.0071 frowning 0.0242 impress 0.0241 joyously 0.0067 mute 0.0228 proximity 0.0152 championship 0.0032 alll 0.0129 least 0.0119 nobody 0.0009 nobody 0.0042 joyously 0.0062
  • 23. Examples using “nobody” 23 Premise sentences Hypothesis sentences C A man and a woman are standing next to sculptures, talking while another man looks at other sculptures. Nobody is standing. A woman is walking across the street eating a banana, while a man is following with his briefcase. Nobody has food. A group of young girls playing jump rope in the street. Nobody is playing jump rope. N Three young girls posing for a picture in an outdoor amphitheater, surrounded by adults watching a conference. Nobody is wearing a hat. E Lacrosse players struggling for control of the ball. Nobody is in control of the ball. The manual of the SNLI corpus prohibits human workers to compose contradiction sentences by inserting “not”. These examples suggests this prohibition is insufficient.
  • 24. Prominent words on 𝑷 𝒚 𝒙 24 Entailment Neutral Contradiction Top-5words proximity 0.9570 joyously 0.9871 nobody 0.9949 least 0.9318 impress 0.9563 alll 0.9718 bvoy 0.8848 championship 0.9398 mars 0.9630 interacting 0.8760 playoff 0.9371 mashed 0.9433 mammals 0.8712 siblings 0.9160 frowning 0.9388 Bottom-5words funeral 0.0081 empty-handed 0.0267 mammals 0.0277 mars 0.0071 frowning 0.0242 impress 0.0241 joyously 0.0067 mute 0.0228 proximity 0.0152 championship 0.0032 alll 0.0129 least 0.0119 nobody 0.0009 nobody 0.0042 joyously 0.0062
  • 25. Examples using “championship” 25 Premise sentences Hypothesis sentences N A soccer match between a team with white jerseys, and a team with yellow jerseys. The teams are in a championship match. Two soccer teams are competing on a soccer field. Two skilled soccer teams are competing against one another for the championship. There is a baseball player standing at home plate, the catcher behind him has his hand up in the air with his glove, and the umpire is standing behind him, and many people in the stands. The final game of the championship is being played while many fans are in the stands. “Championship” is used to create unrelated entities against sport game entities referred in premise sentences.
  • 26. Several words provide negative clues 26 E N C funeral 0.0081 0.3804 0.6115 stole 0.0106 0.5607 0.4287 stationary 0.2668 0.0421 0.6911 soaring 0.4340 0.0522 0.5139 human 0.6296 0.3372 0.0332 higher 0.4123 0.5477 0.0400
  • 27. Examples using “higher” 27 Premise sentences Hypothesis sentences E A speed boat pulling a waterskier along a jump. The skier is going higher in the water. Top of the stands looking down at the baseball stadium. The baseball stadium seats are higher than the field. N Two men, one in a circuit city t-shirt, the other in an M&Ms t-shirt, operate video game guns. One man has a higher score than the other . A young smiling woman is having fun on a rustic looking swing. A woman is trying to swing higher than her friend. C Red objects fall on men standing behind a red wall. The men are higher than the wall. Human workers can create entailment sentences and neutral sentences by using “higher” between two entities which are referred in premise sentences.
  • 28. The source of hidden bias 28 Human workers’ bias in their word selection when composing hypothesis sentences.
  • 29. Performance impact caused by the hidden bias of the SNLI corpus 29 How much performance impact is caused by the hidden bias of the SNLI corpus?
  • 30. Empirical classification of the SNLI corpus using the TE label prediction model 30  𝐸𝑒:Empirical easy test set  A subset covers all test pairs whose TE labels are predicted correctly by the TE label prediction model. 𝐸𝑒 𝐻𝑒 Entailment 2,275 (36.6%) 1,093 (30.3%) Neutral 1,976 (31.8%) 1,243 (34.5%) Contradiction 1,968 (31.6%) 1,269 (35.2%) 6,219 3,605  𝐻𝑒:Empirical hard test set  The complement subset of 𝐸𝑒.
  • 31. NN models for RTE task 31  Encoder-decoder model (Tim Rocktaschel et al, ICLR2016)  Encoder using LSTM converts a premise sentence into a vector representation.  Decoder using LSTM inferences based on the above vector representation and a hypothesis sentence.  Attention Based Convolutional NN (Wenpeng Yin et al, TACL2016)  Tree-based convolution model (LiLi et al, ACL2016)
  • 32. Parallel LSTM Model (Bowman et al, 2015) 32 LSTM 𝑝1 𝑊𝑒 label LSTM 𝑝2 𝑊𝑒 LSTM 𝑝3 𝑊𝑒 LSTM 𝑝4 𝑊𝑒 LSTM ℎ1 𝑊𝑒 LSTM ℎ2 𝑊𝑒 LSTM ℎ3 𝑊𝑒 LSTM ℎ4 𝑊𝑒 Premise Sentence Hypothesis Sentence
  • 33. Sequential LSTM Model (Rocktashchel et al, 2015) 33 LSTM 𝑝1 LSTM 𝑝2 LSTM 𝑝3 LSTM 𝑝4 LSTM ℎ1 LSTM ℎ2 LSTM ℎ3 label Premise sentence Hypothesis sentence
  • 34. Performance drop caused by hidden bias 34  Both NN models achieve high accuracy for the empirical easy test set 𝐸𝑒.  However, they achieve drastic low accuracy for the empirical hard test set 𝐻𝑒.  These results suggest that the large portion of the high accuracy for the whole test set 𝐸𝑒 ∪ 𝐻𝑒 benefits from the empirical easy test set 𝐸𝑒. 𝐸𝑒 ∪ 𝐻𝑒 𝐸𝑒 𝐻𝑒 Parallel LSTM model 76.8% 87.8% 57.8% Sequential LSTM model 81.4% 90.1% 65.6%
  • 35. Replace all premise words to UNK symbols, to remove context information 35 label LSTM ℎ1 𝑊𝑒 LSTM ℎ2 𝑊𝑒 LSTM ℎ3 𝑊𝑒 LSTM ℎ4 𝑊𝑒 Premise Sentence Hypothesis Sentence LSTM 𝑊𝑒 UNK LSTM 𝑊𝑒 UNK LSTM 𝑊𝑒 UNK LSTM 𝑊𝑒 UNK
  • 36. Performance of NN models for RTE when context information is removed 36 Performance of NN models for RTE 𝐸𝑒 ∪ 𝐻𝑒 𝐸𝑒 𝐻𝑒 Parallel LSTM model 54.1% 66.0% 33.7% Sequential LSTM model 48.6% 56.7% 34.7% 𝐸𝑒 𝐻𝑒 Entailment 2,275 (36.6%) 1,093 (30.3%) Neutral 1,976 (31.8%) 1,243 (34.5%) Contradiction 1,968 (31.6%) 1,269 (35.2%) 6,219 3,605 Statistics of empirical classification They are close to chance ratios. They are different to chance ratios.
  • 37. Conclusion 37  The SNLI corpus has the hidden bias which allows us to estimate TE labels of hypothesis sentences without context information given by premise sentences.  NN models trained on the SNLI corpus does not work as an RTE model for the empirical easy test set 𝐸𝑒, but work as a TE label prediction model.
  • 38. 38
  • 39. Questions and answers in my presentation 39  Did you try more complexed label prediction models?  I tried a NN model and a decision tree model for the TE label prediction model. However, the performance differences between them and the NB model were quite small. Thus, I reported the results of the NB model.  How about the other data sets?  I have evaluation results for several corpora including the MultiNLI corpus. Because the MultiNLI corpus is also constructed in similar procedure to the SNLI corpus, it also has a hidden bias.
  • 40. Prominent words on 𝑷 𝒚 𝒙 40 Entailment Neutral Contradiction Top-5words proximity 0.9570 joyously 0.9871 nobody 0.9949 least 0.9318 impress 0.9563 alll 0.9718 bvoy 0.8848 championship 0.9398 mars 0.9630 interacting 0.8760 playoff 0.9371 mashed 0.9433 mammals 0.8712 siblings 0.9160 frowning 0.9388 Bottom-5words funeral 0.0081 empty-handed 0.0267 mammals 0.0277 mars 0.0071 frowning 0.0242 impress 0.0241 joyously 0.0067 mute 0.0228 proximity 0.0152 championship 0.0032 alll 0.0129 least 0.0119 nobody 0.0009 nobody 0.0042 joyously 0.0062
  • 41. Examples using “proximity” 41 Promise sentences Hypothesis sentences E A bride and groom dance surrounded by people at the reception. A married couple is in the proximity of other humans. Many people are dunking to support special olympics. Several people are in close proximity to each other. A bull charges at a man within a stadium while an audience watches. Onlookers view a person and an animal in close proximity to each other. N Child playing in waves with sun on the horizon. A child is playing in the water with her mother in close proximity. “Proximity” is convenient for human workers to compose entailment sentences when multiple person entities appear in premise sentences.
  • 42. Statistics of the SNLI corpus and the SICK corpus 42 SNLI SICK # of training pairs 55K 4500 # of development pairs 10K 500 # of test pairs 10K 4927 Vocabulary size of training pairs 36427 2178 OOV ratio of test pairs (v.s. training pairs) 0.24% 0.29% OOV ratio of test pairs (v.s. training pairs of the opposite corpus) 10.3% 0.15% • SNLI training set is enough large to cover SICK test set as well as SNLI test set. • SICK training set covers its own test set, but does not cover SNLI test set.