Significance tests

Signiﬁcance Tests
in NLP
Presented by Jinho D. Choi
University of Colorado at Boulder
September 15th, 2010

Data Type
• Continuous data
• Outputs are from infinitely many possible values (regression).

• e.g., temperatures, document relevancies.

• Each value is relevant to one another.

• One sample t-test, Paired two sample t-test.

• Categorical data
• Outputs are from finitely defined categories (classification).

• e.g,. pos-tags, dependency labels.

• Each value is not relevant to one another.

• Wilcoxon’s signed-rank test, Fisher’s exact test, Pearson’s chi-square
test, McNemar’s test

One sample t-test
• One sample t-test
• The true mean is known, and the normal distribution is assumed.

• Null hypothesis: difference between true mean and our mean is zero.

• Example
• Average ITA score = 84.31% (true mean)
be say get know see our mean
90.88% 89.75% 84.11% 87.57% 88.19% 90.25%

• Calculate t-score:

• Use the t-score to find p-value in the distribution table.
• Degree of freedom: minimal # of values to determine all the data points.

• p ≤ 0.01 → the difference is statistically significant with over 99% confidence.

Paired two sample t-test
• Paired two sample t-test
• Each sample is tested by two players or a player twice.

• Null hypothesis: mean difference between two normally distributed
populations is zero.

• Example
EBC EBN SIN XIN WEB WSJ Mean
LTH 83.36 86.32 86.80 85.50 85.53 87.15 85.88
Clear 84.06 86.77 86.55 85.41 85.70 87.58 86.09

• Calculate t-score:

• Find p-value.
• p = 0.1701→ the difference is not statistically signiﬁcant.

NLP data is often not normally distributed.

Wilcoxon signed-rank test
• Wilcoxon signed-rank test
• Non-parametric test: no distribution is assumed.

• Null hypothesis: median difference between pairs of observations is zero

• Example
EBC EBN SIN XIN WEB WSJ
LTH 83.36 86.32 86.80 85.50 85.53 87.15
Clear 84.06 86.77 86.55 85.41 85.70 87.58
Clear - LTH 0.7 0.45 -0.25 -0.09 0.17 0.43
Singed rank 6 5 -3 -1 2 4

• W+ = 2 + 4 + 5 + 6 = 17, W- = |-1| + |-3| = 4

• Use the min(W+, W-) to ﬁnd p-value.
• p ≤ 0.2188 → the difference is not statistically signiﬁcant.

• cf. paired two sample t-test: p = 0.1701.

Fisher's exact test
• Fisher's exact test
• Comparing binary outputs produced by two methods.

• The signiﬁcance of the deviation can be calculated exactly.

• Null hypothesis: output difference between two methods is zero.
Method 1 Method 2 Total
Class 1 a b a+b
Class 2 c d c+d
Total a+c b+d n
• Example
Clear LTH Total
Correct 142,731 142,375 285,106
Incorrect 23,055 23,411 46,466
Total 165,786 165,786 331,572
Really?

Pearson's chi-square test
• Pearson's chi-square test
• Each observation is independent from one another.

• The chi-square distribution is assumed.

• Null hypothesis: difference between observed frequency distribution and
true distribution is zero.
observed
• Example true
Clear LTH X2
Correct 142,731 142,375 0.89
Incorrect 23,055 23,411 5.41
Total 165,786 165,786 6.3

• Calculate X2-score:

• Use the X2-score to find p-value.

• p = 0.0121→ the difference is statistically significant with 98.79% confidence.

McNemar's test
• McNemar's test
• Applied to 2×2 contingency tables with binary outputs.

• Non-parametric test: no distribution is assumed.

• Null hypothesis: p(b) = p(c)
Method 2:+
Method 1:+
a
Method 1:-
b
• Example Method 2:- c d
Clear 1: + Clear 1: - Total
LTH 2: + 138,402 3,973 142,375
LTH 2: - 4,329 19,082 23,411
Total 142,731 23,055 165,786

• Calculate X2-score:

• Use the X2-score to find p-value.
• p < 0.0001→ the difference is statistically significant with 99.99% confidence.

Significance tests

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Significance tests

Similar to Significance tests (20)

More from Jinho Choi

More from Jinho Choi (20)

Significance tests