SlideShare a Scribd company logo
1 of 8
Download to read offline
Significance Tests
     in NLP
    Presented by Jinho D. Choi
 University of Colorado at Boulder
      September 15th, 2010
Data Type
•   Continuous data
    •   Outputs are from infinitely many possible values (regression).

    •   e.g., temperatures, document relevancies.

    •   Each value is relevant to one another.

    •   One sample t-test, Paired two sample t-test.


•   Categorical data
    •   Outputs are from finitely defined categories (classification).

    •   e.g,. pos-tags, dependency labels.

    •   Each value is not relevant to one another.

    •   Wilcoxon’s signed-rank test, Fisher’s exact test, Pearson’s chi-square
        test, McNemar’s test
One sample t-test
•   One sample t-test
    •   The true mean is known, and the normal distribution is assumed.

    •   Null hypothesis: difference between true mean and our mean is zero.

•   Example
    •   Average ITA score = 84.31% (true mean)
              be          say         get        know          see      our mean
            90.88%     89.75%       84.11%      87.57%      88.19%        90.25%

    •   Calculate t-score:


    •   Use the t-score to find p-value in the distribution table.
        •    Degree of freedom: minimal # of values to determine all the data points.

        •    p ≤ 0.01 → the difference is statistically significant with over 99% confidence.
Paired two sample t-test
•   Paired two sample t-test
    •    Each sample is tested by two players or a player twice.

    •    Null hypothesis: mean difference between two normally distributed
         populations is zero.

•   Example
                   EBC        EBN       SIN        XIN       WEB            WSJ    Mean
        LTH       83.36      86.32     86.80      85.50      85.53         87.15   85.88
        Clear     84.06      86.77     86.55      85.41      85.70         87.58   86.09


    •    Calculate t-score:

    •    Find p-value.
        •    p = 0.1701→ the difference is not statistically significant.


            NLP data is often not normally distributed.
Wilcoxon signed-rank test
•   Wilcoxon signed-rank test
    •    Non-parametric test: no distribution is assumed.

    •    Null hypothesis: median difference between pairs of observations is zero

•   Example
                        EBC         EBN         SIN         XIN       WEB      WSJ
            LTH        83.36       86.32       86.80       85.50      85.53   87.15
           Clear       84.06       86.77       86.55       85.41      85.70   87.58
        Clear - LTH     0.7        0.45        -0.25       -0.09      0.17    0.43
        Singed rank      6           5           -3          -1         2       4

    •    W+ = 2 + 4 + 5 + 6 = 17, W- = |-1| + |-3| = 4

    •    Use the min(W+, W-) to find p-value.
        •   p ≤ 0.2188 → the difference is not statistically significant.

        •   cf. paired two sample t-test: p = 0.1701.
Fisher's exact test
•   Fisher's exact test
    •   Comparing binary outputs produced by two methods.

    •   The significance of the deviation can be calculated exactly.

    •   Null hypothesis: output difference between two methods is zero.
                      Method 1 Method 2    Total
          Class 1        a        b         a+b
          Class 2        c        d        c+d
           Total        a+c      b+d         n
•   Example
                            Clear       LTH          Total
           Correct        142,731    142,375       285,106
          Incorrect        23,055     23,411        46,466
            Total         165,786    165,786       331,572
                                                                 Really?
Pearson's chi-square test
•   Pearson's chi-square test
    •   Each observation is independent from one another.

    •   The chi-square distribution is assumed.

    •   Null hypothesis: difference between observed frequency distribution and
        true distribution is zero.
                                                                    observed
•   Example                                                         true
                          Clear          LTH              X2
         Correct        142,731       142,375           0.89
        Incorrect        23,055        23,411           5.41
          Total         165,786       165,786            6.3

    •   Calculate X2-score:

    •   Use the X2-score to find p-value.

        •   p = 0.0121→ the difference is statistically significant with 98.79% confidence.
McNemar's test
•   McNemar's test
    •   Applied to 2×2 contingency tables with binary outputs.

    •   Non-parametric test: no distribution is assumed.

    •   Null hypothesis: p(b) = p(c)
                                                     Method 2:+
                                                                Method 1:+
                                                                    a
                                                                                 Method 1:-
                                                                                     b
•   Example                                          Method 2:-     c                d
                        Clear 1: +       Clear 1: -        Total
        LTH 2: +            138,402            3,973         142,375
        LTH 2: -               4,329          19,082          23,411
          Total             142,731           23,055         165,786


    •   Calculate X2-score:

    •   Use the X2-score to find p-value.
        •   p < 0.0001→ the difference is statistically significant with 99.99% confidence.

More Related Content

What's hot

A.6 confidence intervals
A.6  confidence intervalsA.6  confidence intervals
A.6 confidence intervals
Ulster BOCES
 
10 lecture Gillette College BIOL 1010-30
10 lecture Gillette College BIOL 1010-3010 lecture Gillette College BIOL 1010-30
10 lecture Gillette College BIOL 1010-30
deskam2
 

What's hot (19)

A.6 confidence intervals
A.6  confidence intervalsA.6  confidence intervals
A.6 confidence intervals
 
T test and types of t-test
T test and types of t-testT test and types of t-test
T test and types of t-test
 
Student's T-test, Paired T-Test, ANOVA & Proportionate Test
Student's T-test, Paired T-Test, ANOVA & Proportionate TestStudent's T-test, Paired T-Test, ANOVA & Proportionate Test
Student's T-test, Paired T-Test, ANOVA & Proportionate Test
 
Two sample t-test
Two sample t-testTwo sample t-test
Two sample t-test
 
Parametric test - t Test, ANOVA, ANCOVA, MANOVA
Parametric test  - t Test, ANOVA, ANCOVA, MANOVAParametric test  - t Test, ANOVA, ANCOVA, MANOVA
Parametric test - t Test, ANOVA, ANCOVA, MANOVA
 
Sample Size Determination
Sample Size DeterminationSample Size Determination
Sample Size Determination
 
Design of experiments
Design of experimentsDesign of experiments
Design of experiments
 
t-test vs ANOVA
t-test vs ANOVAt-test vs ANOVA
t-test vs ANOVA
 
Estimation and confidence interval
Estimation and confidence intervalEstimation and confidence interval
Estimation and confidence interval
 
Measures of Central Tendency
Measures of Central TendencyMeasures of Central Tendency
Measures of Central Tendency
 
Null hypothesis
Null hypothesisNull hypothesis
Null hypothesis
 
Statistics - Multiple Regression and Two Way Anova
Statistics - Multiple Regression and Two Way AnovaStatistics - Multiple Regression and Two Way Anova
Statistics - Multiple Regression and Two Way Anova
 
10 lecture Gillette College BIOL 1010-30
10 lecture Gillette College BIOL 1010-3010 lecture Gillette College BIOL 1010-30
10 lecture Gillette College BIOL 1010-30
 
7 anova chi square test
 7 anova chi square test 7 anova chi square test
7 anova chi square test
 
Methods of Statistical Analysis & Interpretation of Data..pptx
Methods of Statistical Analysis & Interpretation of Data..pptxMethods of Statistical Analysis & Interpretation of Data..pptx
Methods of Statistical Analysis & Interpretation of Data..pptx
 
Mod mean quartile
Mod mean quartileMod mean quartile
Mod mean quartile
 
Estimation
EstimationEstimation
Estimation
 
Design of Experiments (DOE)
Design of Experiments (DOE)Design of Experiments (DOE)
Design of Experiments (DOE)
 
Kruskal Wall Test
Kruskal Wall TestKruskal Wall Test
Kruskal Wall Test
 

Viewers also liked

Statistical concepts
Statistical conceptsStatistical concepts
Statistical concepts
Carlo Magno
 
Quantitative techniques in research
Quantitative techniques in researchQuantitative techniques in research
Quantitative techniques in research
Carlo Magno
 
One Sample T Test
One Sample T TestOne Sample T Test
One Sample T Test
shoffma5
 

Viewers also liked (20)

Chi squared test
Chi squared testChi squared test
Chi squared test
 
Randomized Controlled Trials
Randomized Controlled TrialsRandomized Controlled Trials
Randomized Controlled Trials
 
Uses of epidemiology
Uses of epidemiologyUses of epidemiology
Uses of epidemiology
 
Chi square test
Chi square testChi square test
Chi square test
 
Test of significance in Statistics
Test of significance in StatisticsTest of significance in Statistics
Test of significance in Statistics
 
Chi square test
Chi square testChi square test
Chi square test
 
Some statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysisSome statistical concepts relevant to proteomics data analysis
Some statistical concepts relevant to proteomics data analysis
 
Statistical concepts
Statistical conceptsStatistical concepts
Statistical concepts
 
Chi Squared
Chi SquaredChi Squared
Chi Squared
 
Interview Carlos Corriere del Ticino
Interview Carlos Corriere del TicinoInterview Carlos Corriere del Ticino
Interview Carlos Corriere del Ticino
 
Cerutti-AT2013-Graphical Subjective Logic
Cerutti-AT2013-Graphical Subjective LogicCerutti-AT2013-Graphical Subjective Logic
Cerutti-AT2013-Graphical Subjective Logic
 
The chi square_test
The chi square_testThe chi square_test
The chi square_test
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
presentation of data
presentation of datapresentation of data
presentation of data
 
Bar Diagram (chart) in Statistics presentation
Bar Diagram (chart) in Statistics presentationBar Diagram (chart) in Statistics presentation
Bar Diagram (chart) in Statistics presentation
 
Quantitative techniques in research
Quantitative techniques in researchQuantitative techniques in research
Quantitative techniques in research
 
One Sample T Test
One Sample T TestOne Sample T Test
One Sample T Test
 
Randomized controlled trials
Randomized controlled trialsRandomized controlled trials
Randomized controlled trials
 
One-Sample Hypothesis Tests
One-Sample Hypothesis TestsOne-Sample Hypothesis Tests
One-Sample Hypothesis Tests
 
diagrammatic presentation of data-bar diagram & pie diagram
diagrammatic presentation of data-bar diagram & pie diagramdiagrammatic presentation of data-bar diagram & pie diagram
diagrammatic presentation of data-bar diagram & pie diagram
 

Similar to Significance tests

NON-PARAMETRIC TESTS.pptx
NON-PARAMETRIC TESTS.pptxNON-PARAMETRIC TESTS.pptx
NON-PARAMETRIC TESTS.pptx
DrLasya
 
allnonparametrictest-210427031923.pptx
allnonparametrictest-210427031923.pptxallnonparametrictest-210427031923.pptx
allnonparametrictest-210427031923.pptx
SoujanyaLk1
 
09 test of hypothesis small sample.ppt
09 test of hypothesis small sample.ppt09 test of hypothesis small sample.ppt
09 test of hypothesis small sample.ppt
Pooja Sakhla
 
Final Exam ReviewChapter 10Know the three ideas of s.docx
Final Exam ReviewChapter 10Know the three ideas of s.docxFinal Exam ReviewChapter 10Know the three ideas of s.docx
Final Exam ReviewChapter 10Know the three ideas of s.docx
lmelaine
 
Effect of global market on indian market
Effect of global market on indian marketEffect of global market on indian market
Effect of global market on indian market
Arpit Jain
 

Similar to Significance tests (20)

t distribution, paired and unpaired t-test
t distribution, paired and unpaired t-testt distribution, paired and unpaired t-test
t distribution, paired and unpaired t-test
 
NON-PARAMETRIC TESTS.pptx
NON-PARAMETRIC TESTS.pptxNON-PARAMETRIC TESTS.pptx
NON-PARAMETRIC TESTS.pptx
 
Introduction to Business Analytics Course Part 9
Introduction to Business Analytics Course Part 9Introduction to Business Analytics Course Part 9
Introduction to Business Analytics Course Part 9
 
All non parametric test
All non parametric testAll non parametric test
All non parametric test
 
All non parametric test
All non parametric testAll non parametric test
All non parametric test
 
allnonparametrictest-210427031923.pptx
allnonparametrictest-210427031923.pptxallnonparametrictest-210427031923.pptx
allnonparametrictest-210427031923.pptx
 
Testing a claim about a standard deviation or variance
Testing a claim about a standard deviation or variance  Testing a claim about a standard deviation or variance
Testing a claim about a standard deviation or variance
 
Goodness of fit test
Goodness of fit testGoodness of fit test
Goodness of fit test
 
Sociology 601 class 7
Sociology 601 class 7Sociology 601 class 7
Sociology 601 class 7
 
hypothesis testing-tests of proportions and variances in six sigma
hypothesis testing-tests of proportions and variances in six sigmahypothesis testing-tests of proportions and variances in six sigma
hypothesis testing-tests of proportions and variances in six sigma
 
09 test of hypothesis small sample.ppt
09 test of hypothesis small sample.ppt09 test of hypothesis small sample.ppt
09 test of hypothesis small sample.ppt
 
Data analysis
Data analysisData analysis
Data analysis
 
Chi square
Chi squareChi square
Chi square
 
Dr.Dinesh-BIOSTAT-Tests-of-significance-1-min.pdf
Dr.Dinesh-BIOSTAT-Tests-of-significance-1-min.pdfDr.Dinesh-BIOSTAT-Tests-of-significance-1-min.pdf
Dr.Dinesh-BIOSTAT-Tests-of-significance-1-min.pdf
 
Probability
ProbabilityProbability
Probability
 
Final Exam ReviewChapter 10Know the three ideas of s.docx
Final Exam ReviewChapter 10Know the three ideas of s.docxFinal Exam ReviewChapter 10Know the three ideas of s.docx
Final Exam ReviewChapter 10Know the three ideas of s.docx
 
Population and sample mean
Population and sample meanPopulation and sample mean
Population and sample mean
 
Effect of global market on indian market
Effect of global market on indian marketEffect of global market on indian market
Effect of global market on indian market
 
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)슬로우캠퍼스:  scikit-learn & 머신러닝 (강박사)
슬로우캠퍼스: scikit-learn & 머신러닝 (강박사)
 
Estimating a Population Mean
Estimating a Population Mean  Estimating a Population Mean
Estimating a Population Mean
 

More from Jinho Choi

More from Jinho Choi (20)

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference Resolution
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning Representation
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
CKY Parsing
CKY ParsingCKY Parsing
CKY Parsing
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet Similarities
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical Relations
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue Management
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR Parsing
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to Dialogue
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue Understanding
 
Topological Sort
Topological SortTopological Sort
Topological Sort
 
Tries - Put
Tries - PutTries - Put
Tries - Put
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports Intelligently
 

Significance tests

  • 1. Significance Tests in NLP Presented by Jinho D. Choi University of Colorado at Boulder September 15th, 2010
  • 2. Data Type • Continuous data • Outputs are from infinitely many possible values (regression). • e.g., temperatures, document relevancies. • Each value is relevant to one another. • One sample t-test, Paired two sample t-test. • Categorical data • Outputs are from finitely defined categories (classification). • e.g,. pos-tags, dependency labels. • Each value is not relevant to one another. • Wilcoxon’s signed-rank test, Fisher’s exact test, Pearson’s chi-square test, McNemar’s test
  • 3. One sample t-test • One sample t-test • The true mean is known, and the normal distribution is assumed. • Null hypothesis: difference between true mean and our mean is zero. • Example • Average ITA score = 84.31% (true mean) be say get know see our mean 90.88% 89.75% 84.11% 87.57% 88.19% 90.25% • Calculate t-score: • Use the t-score to find p-value in the distribution table. • Degree of freedom: minimal # of values to determine all the data points. • p ≤ 0.01 → the difference is statistically significant with over 99% confidence.
  • 4. Paired two sample t-test • Paired two sample t-test • Each sample is tested by two players or a player twice. • Null hypothesis: mean difference between two normally distributed populations is zero. • Example EBC EBN SIN XIN WEB WSJ Mean LTH 83.36 86.32 86.80 85.50 85.53 87.15 85.88 Clear 84.06 86.77 86.55 85.41 85.70 87.58 86.09 • Calculate t-score: • Find p-value. • p = 0.1701→ the difference is not statistically significant. NLP data is often not normally distributed.
  • 5. Wilcoxon signed-rank test • Wilcoxon signed-rank test • Non-parametric test: no distribution is assumed. • Null hypothesis: median difference between pairs of observations is zero • Example EBC EBN SIN XIN WEB WSJ LTH 83.36 86.32 86.80 85.50 85.53 87.15 Clear 84.06 86.77 86.55 85.41 85.70 87.58 Clear - LTH 0.7 0.45 -0.25 -0.09 0.17 0.43 Singed rank 6 5 -3 -1 2 4 • W+ = 2 + 4 + 5 + 6 = 17, W- = |-1| + |-3| = 4 • Use the min(W+, W-) to find p-value. • p ≤ 0.2188 → the difference is not statistically significant. • cf. paired two sample t-test: p = 0.1701.
  • 6. Fisher's exact test • Fisher's exact test • Comparing binary outputs produced by two methods. • The significance of the deviation can be calculated exactly. • Null hypothesis: output difference between two methods is zero. Method 1 Method 2 Total Class 1 a b a+b Class 2 c d c+d Total a+c b+d n • Example Clear LTH Total Correct 142,731 142,375 285,106 Incorrect 23,055 23,411 46,466 Total 165,786 165,786 331,572 Really?
  • 7. Pearson's chi-square test • Pearson's chi-square test • Each observation is independent from one another. • The chi-square distribution is assumed. • Null hypothesis: difference between observed frequency distribution and true distribution is zero. observed • Example true Clear LTH X2 Correct 142,731 142,375 0.89 Incorrect 23,055 23,411 5.41 Total 165,786 165,786 6.3 • Calculate X2-score: • Use the X2-score to find p-value. • p = 0.0121→ the difference is statistically significant with 98.79% confidence.
  • 8. McNemar's test • McNemar's test • Applied to 2×2 contingency tables with binary outputs. • Non-parametric test: no distribution is assumed. • Null hypothesis: p(b) = p(c) Method 2:+ Method 1:+ a Method 1:- b • Example Method 2:- c d Clear 1: + Clear 1: - Total LTH 2: + 138,402 3,973 142,375 LTH 2: - 4,329 19,082 23,411 Total 142,731 23,055 165,786 • Calculate X2-score: • Use the X2-score to find p-value. • p < 0.0001→ the difference is statistically significant with 99.99% confidence.