The document discusses how the hidden bias in the training data of the Stanford Natural Language Inference (SNLI) corpus impacts performance on the recognizing textual entailment (RTE) task. The bias enables predicting entailment labels for hypothesis sentences without using premise context, violating the assumptions of the RTE task. This is due to human annotators introducing bias when composing hypothesis sentences for the SNLI corpus. Neural network models for RTE show significant performance drops on sentence pairs not affected by this bias, demonstrating the negative effect of hidden bias in training data on model performance.
Performance Impact Caused by Hidden Bias of Traning Data for Recognizing Textual Entailment (LREC2018)
1. Performance Impact Caused by
Hidden Bias of Training Data for RTE
Masatoshi Tsuchiya (Toyohashi University of Technology)
1
2. Brief summary of my presentation
2
The SNLI corpus, which is widely used for
English RTE task, has HIDDEN BIAS.
It enables us to estimate TE labels of
hypothesis sentences, even if no context
information given by premise sentences.
3. Definition of RTE task for SemEval/SNLI
3
A task to partition
relationships between
a premise sentence and
a hypothesis sentence
into three categories
E: Entailment
N: Neutral
C: Contradiction
Sentence
𝑠1 Two boys are
swimming in the pool.
E
𝑠2 Two girls are playing
the basketball.
N
𝑠3 Two women are
swimming in the pool.
C
ℎ Two children are
swimming in the pool.
4. Suppose the case that the same hypothesis
sentence and three premise sentences
4
When 𝑠1 is given as a
premise sentence, the
relationship between 𝑠1
and ℎ is labeled as E.
Sentence
𝑠1 Two boys are
swimming in the pool.
E
𝑠2 Two girls are playing
the basketball.
N
𝑠3 Two women are
swimming in the pool.
C
ℎ Two children are
swimming in the pool.
5. Suppose the case that the same hypothesis
sentence and three premise sentences
5
When 𝑠2 is given as a
premise sentence, the
relationship between 𝑠2
and ℎ is labeled as N.
Sentence
𝑠1 Two boys are
swimming in the pool.
E
𝑠2 Two girls are playing
the basketball.
N
𝑠3 Two women are
swimming in the pool.
C
ℎ Two children are
swimming in the pool.
6. Suppose the case that the same hypothesis
sentence and three premise sentences
6
When 𝑠3 is given as a
premise sentence, the
relationship between 𝑠3
and ℎ is labeled as C.
Sentence
𝑠1 Two boys are
swimming in the pool.
E
𝑠2 Two girls are playing
the basketball.
N
𝑠3 Two women are
swimming in the pool.
C
ℎ Two children are
swimming in the pool.
7. Suppose the case that the same hypothesis
sentence and three premise sentences
7
These examples indicate
that the TE label is
determinable
if and only if context
information is given by
a premise sentence.
Sentence
𝑠1 Two boys are
swimming in the pool.
E
𝑠2 Two girls are playing
the basketball.
N
𝑠3 Two women are
swimming in the pool.
C
ℎ Two children are
swimming in the pool.
8. (Extremely unacceptable) Null hypothesis
8
TE label of hypothesis sentences are
determinable, even if no context information
is given by premise sentences.
If this hypothesis is not rejected for a certain corpus,
the corpus has a hidden bias.
9. TE label prediction model
9
It is designed to check the null hypothesis.
It estimates TE label for hypothesis sentences, without
context information given by premise sentences.
Naive-Bayes model is employed.
𝑦 = argmax
𝑦
𝑃(𝑦)
𝑖=1
𝑛
𝑃(𝑥𝑖|𝑦)
𝑦 is a TE label
𝑥𝑖 is a feature. All word unigrams in a hypothesis sentence are
used as features.
10. Baseline model
10
Suppose the case that no information given by
either premise sentences or hypothesis sentences.
The baseline model assigns TE labels for hypothesis
sentences, only based on explicit label bias.
𝑦 = argmax
𝑦
𝑃(𝑦)
11. Explicit bias of TE labels
11
SNLI SICK
Train Devel. Test Train Devel. Test
Entailment 33.4% 33.8% 34.3% 28.9% 28.8% 28.7%
Neutral 33.3% 32.9% 32.8% 56.4% 56.4% 56.7%
Contradiction 33.4% 33.3% 33.0% 14.8% 14.8% 14.6%
The SNLI corpus is balanced for TE labels.
The SICK corpus is not balanced, and has an explicit
label bias for TE labels.
12. Statistical test of the null hypothesis
12
If the TE label prediction model
achieves statistically significant better
performance than the baseline model,
the null hypothesis is not rejected.
13. Experimental result of statistical test
13
The statistical test of the SNLI corpus does not
reject the null hypothesis.
The statistical test of the SICK corpus rejects the
null hypothesis.
Corpus TE label prediction model Baseline model
SNLI 63.3% 34.3%
SICK 56.7% 56.7%
14. Confusion matrices
14
SNLI
Predicted
labels
Corpus labels
E N C
E 2275 644 706
N 508 1976 563
C 585 599 1968
The TE label prediction model trained and tested on the
SNLI corpus tries to predict an appropriate TE label for
each individual hypothesis sentence.
The TE label prediction model trained and tested on the
SICK corpus simply outputs the major TE label, “neutral”.
SICK
Predicted
labels
Corpus labels
E N C
E 3 3 2
N 1411 2790 718
C 0 0 0
15. Intermediate conclusion of the first half
15
The null hypothesis is that TE labels of
hypothesis sentences are determinable
without context information given by
premise sentences.
It is not rejected for the SNLI corpus !
It is rejected for the SICK corpus !
Why ?
Good !
16. Topics of the latter half
16
Source of the hidden bias of the SNLI
corpus
Performance impact caused by the hidden
bias
17. Difference between
the SNLI corpus and the SICK corpus
17
Construction procedure
Both of them use Amazon Mechanical Turk.
18. Comparison of construction procedure
The SNLI corpus The SICK corpus
18
① Sentences of the Flicker
corpus are provided to
human workers as premise
sentences.
② Human workers are asked
to compose three
hypothesis sentences for
each premise sentence.
① Sentences of the Flicker
corpus are simplified using
hand-crafted rules.
② Sentence pairs selected
based on similarity are
provided to human workers.
③ Human workers are asked
to classify the sentence
pairs to three categories.
19. Difference between
the SNLI corpus and the SICK corpus
19
They are different in works asked to human workers.
For the SNLI corpus, human workers were asked to
compose hypothesis sentences.
For the SICK corpus, human workers were asked to
annotate sentence pairs.
Thus, hypothesis sentences of the SNLI corpus may
contain human bias?
20. Prominent words to estimate TE labels
20
The TE label prediction model is defined as follows:
𝑦 = argmax
𝑦
𝑃(𝑦)
𝑖=1
𝑛
𝑃(𝑥𝑖|𝑦)
𝑦 is a TE label.
𝑥𝑖 is a feature. All word unigrams of hypothesis sentence are
employed as features.
The TE label probability 𝑃(𝑦|𝑥) conditioned by a word
can be computed from this model using Bayes rule.
These prominent words are useful to estimate TE labels.
23. Examples using “nobody”
23
Premise sentences Hypothesis sentences
C
A man and a woman are standing next to
sculptures, talking while another man looks
at other sculptures.
Nobody is standing.
A woman is walking across the street eating
a banana, while a man is following with his
briefcase.
Nobody has food.
A group of young girls playing jump rope in
the street.
Nobody is playing
jump rope.
N
Three young girls posing for a picture in an
outdoor amphitheater, surrounded by adults
watching a conference.
Nobody is wearing a
hat.
E
Lacrosse players struggling for control of the
ball.
Nobody is in control
of the ball.
The manual of the SNLI corpus prohibits human workers to compose contradiction
sentences by inserting “not”. These examples suggests this prohibition is insufficient.
25. Examples using “championship”
25
Premise sentences Hypothesis sentences
N
A soccer match between a team with
white jerseys, and a team with yellow
jerseys.
The teams are in a
championship match.
Two soccer teams are competing on a
soccer field.
Two skilled soccer teams
are competing against
one another for the
championship.
There is a baseball player standing at
home plate, the catcher behind him has
his hand up in the air with his glove, and
the umpire is standing behind him, and
many people in the stands.
The final game of the
championship is being
played while many fans
are in the stands.
“Championship” is used to create unrelated entities against sport game
entities referred in premise sentences.
26. Several words provide negative clues
26
E N C
funeral 0.0081 0.3804 0.6115
stole 0.0106 0.5607 0.4287
stationary 0.2668 0.0421 0.6911
soaring 0.4340 0.0522 0.5139
human 0.6296 0.3372 0.0332
higher 0.4123 0.5477 0.0400
27. Examples using “higher”
27
Premise sentences Hypothesis sentences
E
A speed boat pulling a waterskier
along a jump.
The skier is going higher in
the water.
Top of the stands looking down at the
baseball stadium.
The baseball stadium seats
are higher than the field.
N
Two men, one in a circuit city t-shirt,
the other in an M&Ms t-shirt, operate
video game guns.
One man has a higher
score than the other .
A young smiling woman is having fun
on a rustic looking swing.
A woman is trying to swing
higher than her friend.
C
Red objects fall on men standing
behind a red wall.
The men are higher than
the wall.
Human workers can create entailment sentences and neutral sentences by
using “higher” between two entities which are referred in premise sentences.
28. The source of hidden bias
28
Human workers’ bias in their word selection
when composing hypothesis sentences.
29. Performance impact caused by
the hidden bias of the SNLI corpus
29
How much performance impact is caused by
the hidden bias of the SNLI corpus?
30. Empirical classification of the SNLI corpus
using the TE label prediction model
30
𝐸𝑒:Empirical easy test set
A subset covers all test pairs
whose TE labels are
predicted correctly by the
TE label prediction model.
𝐸𝑒 𝐻𝑒
Entailment 2,275 (36.6%) 1,093 (30.3%)
Neutral 1,976 (31.8%) 1,243 (34.5%)
Contradiction 1,968 (31.6%) 1,269 (35.2%)
6,219 3,605
𝐻𝑒:Empirical hard test set
The complement subset of 𝐸𝑒.
31. NN models for RTE task
31
Encoder-decoder model (Tim Rocktaschel et al,
ICLR2016)
Encoder using LSTM converts a premise sentence into a
vector representation.
Decoder using LSTM inferences based on the above
vector representation and a hypothesis sentence.
Attention Based Convolutional NN (Wenpeng Yin et
al, TACL2016)
Tree-based convolution model (LiLi et al, ACL2016)
34. Performance drop caused by hidden bias
34
Both NN models achieve high accuracy for the empirical easy
test set 𝐸𝑒.
However, they achieve drastic low accuracy for the empirical
hard test set 𝐻𝑒.
These results suggest that the large portion of the high
accuracy for the whole test set 𝐸𝑒 ∪ 𝐻𝑒 benefits from the
empirical easy test set 𝐸𝑒.
𝐸𝑒 ∪ 𝐻𝑒 𝐸𝑒 𝐻𝑒
Parallel LSTM model 76.8% 87.8% 57.8%
Sequential LSTM model 81.4% 90.1% 65.6%
35. Replace all premise words to UNK symbols,
to remove context information
35
label
LSTM
ℎ1
𝑊𝑒
LSTM
ℎ2
𝑊𝑒
LSTM
ℎ3
𝑊𝑒
LSTM
ℎ4
𝑊𝑒
Premise
Sentence
Hypothesis
Sentence
LSTM
𝑊𝑒
UNK
LSTM
𝑊𝑒
UNK
LSTM
𝑊𝑒
UNK
LSTM
𝑊𝑒
UNK
36. Performance of NN models for RTE
when context information is removed
36
Performance of NN models for RTE
𝐸𝑒 ∪ 𝐻𝑒 𝐸𝑒 𝐻𝑒
Parallel LSTM model 54.1% 66.0% 33.7%
Sequential LSTM model 48.6% 56.7% 34.7%
𝐸𝑒 𝐻𝑒
Entailment 2,275 (36.6%) 1,093 (30.3%)
Neutral 1,976 (31.8%) 1,243 (34.5%)
Contradiction 1,968 (31.6%) 1,269 (35.2%)
6,219 3,605
Statistics of empirical classification
They are close
to chance ratios.
They are different
to chance ratios.
37. Conclusion
37
The SNLI corpus has the hidden bias which allows us
to estimate TE labels of hypothesis sentences without
context information given by premise sentences.
NN models trained on the SNLI corpus does not
work as an RTE model for the empirical easy test set
𝐸𝑒, but work as a TE label prediction model.
39. Questions and answers in my presentation
39
Did you try more
complexed label
prediction models?
I tried a NN model and a
decision tree model for the
TE label prediction model.
However, the performance
differences between them
and the NB model were
quite small. Thus, I
reported the results of the
NB model.
How about the other data
sets?
I have evaluation results for
several corpora including
the MultiNLI corpus.
Because the MultiNLI
corpus is also constructed
in similar procedure to the
SNLI corpus, it also has a
hidden bias.
41. Examples using “proximity”
41
Promise sentences Hypothesis sentences
E
A bride and groom dance surrounded
by people at the reception.
A married couple is in the
proximity of other humans.
Many people are dunking to support
special olympics.
Several people are in close
proximity to each other.
A bull charges at a man within a
stadium while an audience watches.
Onlookers view a person
and an animal in close
proximity to each other.
N
Child playing in waves with sun on
the horizon.
A child is playing in the
water with her mother in
close proximity.
“Proximity” is convenient for human workers to compose entailment sentences
when multiple person entities appear in premise sentences.
42. Statistics of
the SNLI corpus and the SICK corpus
42
SNLI SICK
# of training pairs 55K 4500
# of development pairs 10K 500
# of test pairs 10K 4927
Vocabulary size of training pairs 36427 2178
OOV ratio of test pairs (v.s. training pairs) 0.24% 0.29%
OOV ratio of test pairs (v.s. training pairs
of the opposite corpus)
10.3% 0.15%
• SNLI training set is enough large to cover SICK test set as well as
SNLI test set.
• SICK training set covers its own test set, but does not cover SNLI
test set.