A. spanos slides ch14-2013 (4)

CHAPTER 14: Frequentist Hypothesis Testing:
a Coherent Account
Aris Spanos [Fall 2013]

1 Inherent difficulties in learning statistical testing
Statistical testing is arguably the most important, but also the most
difficult and confusing chapter of statistical inference for several reasons, including the following.
(i) The need to introduce numerous new notions, concepts and
procedures before one can paint — even in broad brushes — a coherent
picture of hypothesis testing.
(ii) The current textbook discussion of statistical testing is both
highly confusing and confused. There are several sources of confusion.
I (a) Testing is conceptually one of the most sophisticated sub-fields
of any scientific discipline.
I (b) Inadequate knowledge by textbook writers who often do not
have the technical skills to read and understand the original sources,
and have to rely on second hand accounts of previous textbook writers that are often misleading or just outright erroneous. In most of
these textbooks hypothesis testing is poorly explained as an idiot’s
guide to combining off-the-shelf formulae with statistical tables like the
Normal, the Student’s t, the chi-square, etc., where the underlying statistical model that gives rise to the testing procedure is hidden in the
background.
I (c) The misleading portrayal of Neyman-Pearson testing as essentially decision-theoretic in nature, when in fact the latter has much
greater affinity with the Bayesian rather than the frequentist inference.
I (d) A deliberate attempt to distort and cannibalize frequentist
testing by certain Bayesian statisticians who revel in (unfairly) maligning frequentist inference in their misguided attempt to motivate their
preferred viewpoint of statistical inference. You will often hear such
Bayesians tell you that "what you really want (require) is probabilities
attached to hypotheses" and other such misadvised promptings! In
frequentist inference probabilities are always attached to the different
possible values of the sample x∈R ; never to hypotheses!

(iii) The discussion of frequentist testing is rather incomplete in so
far as it has been beleaguered by serious foundational problems since
1

the 1930s. As a result, different applied fields have generated their own
secondary literatures attempting to address these problems, but often
making things much worse! Indeed, in some fields like psychology it
has reached the stage where one has to correct the ‘corrections’ of those
chastising the initial ‘correctors’!
In an attempt to alleviate problem (i), the discussion that follows
uses a sketchy historical development of frequentist testing. To ameliorate problem (ii), the discussion includes ‘red flag’ pointers (¥) designed
to highlight important points that shed light on certain erroneous interpretations or misleading arguments. The discussion will pay special
attention to (iii), addressing some of the key foundational problems.

2

Francis Edgeworth

A typical example of a testing procedure at the end of the 19th century
is provided by Edgeworth (1885). From today’s perspective his testing
takes the form of viewing data x0 :=(11  12      1 ; 21  22   2 ) in
the context of the following bivariate simple Normal model:
¶
µµ
¶ µ 2
¶¶
µ
1
 0
1
v NIID


=1 2   
X:=
2
2
0 2
(1 )=1  (2)=2   (1 )= (2 ) =  2  (1  2 )=0
The hypothesis of interest concerns the equality of the two means:
Hypothesis of interest: 1 = 2 
The 2-dimensional sample is: X:=(11  12      1 ; 21  22   2 )
Common sense, combined with the statistical knowledge at the time
suggested using the difference between the estimated means:
P
P
1 − 2  where 1 =  =1 1  2 =  =1 2 
b
b
b 1
b 1

as a basis for deciding whether the two means are different or not.
To render the difference (b1 − 2 ) free of the units of measurement,

b
Edgeworth divided it by its standard deviation:
q


p
P
P
b2 1

b2 1

 (b1 −b2 )= (b2 + 2 )  1 =  (1 −b1 )2   2 =  (2 −b1 )2 
 
 1 b1
=1

2

=1

|ˆ 
to define the distance function: (X)= √1 −ˆ 2 |2 
2
(1 +1 )
 

The last issue he needed to address is how big the observed distance
(x0 ) should be to justify inferring that the two means are different.
Edgeworth argued that (1 −2 )6=0 could not be ‘accidental’ (due to
chance) if:
√
|ˆ 
(X) = √1 −ˆ2 |2  2 2
(1)
2
(1 +1 )
 

√
I Where did the threshold 2 2 come from? The tail area of N(0 1)
√
beyond ±2 2 is approximately 005, which was viewed as a reasonable
value for the probability of a ‘chance’ error, i.e. the error of inferring
a significant discrepancy. But why Normal? At the time statistical
inference relied heavily on large sample size  (asymptotic) results by
invoking the Central Limit Theorem.
In summary, Edgeworth introduced 3 features of statistical testing:
(i) a hypothesis of interest: 1 = 2 
(ii) the notion of a standardized distance: (X),√
(iii) a threshold value for significance: (X)  2 2

3

Karl Pearson

The Pearson approach to statistics can be summarized as follows:
Histogram
Data
=⇒
x0 :=(1    )

=⇒

Fitting a frequency curve
from the Pearson family
 (; b1  b2  b3  b4 )
   

The Pearson family of frequency curves can be expressed in terms
of the following differential equation:
 ln  ()


=

(−1 )

2 +3 +4 2

(2)

Depending on the values taken by the parameters (1  2  3  4 ) this
equation can generate numerous frequency curves such as the Normal,
the Student’s , the Beta, the Gamma, the Laplace, the Pareto, etc.

3

The first four raw data moments 01 , 02 , 03 and 04 are sufficient to
b b b
b
determine (b1  b2  b3  b4 ) and that enables one can select the particular
   
 () from the Pearson family.
Having selected a particular frequency curve 0 () on the basis of
b
 (; b1  b2  b3  b4 )= (; θ) Karl Pearson proposed to assess the appro   
priateness of this choice. His hypothesis of interest is the form:
0 () =  ∗ ()∈Pearson(1  2  3  4 )  ∗ ()-true density.

Pearson proposed the standardized distance function:
X ( − )2
X (( )−( ))2




(X)=
=
v 2 ()

( )
=1

=1



(3)

b
where (  =1 2  ) and ( =1 2  ) denote the empirical and
assumed (as specified by 0 ()) frequencies, and introduced the notion
of a p-value:
P((X)  (x0 )) = (x0 )
(4)
as a basis for inferring whether the choice of 0 () was appropriate or
not. The smaller the (x0 ) the worst the fit.
¥ It is important to Note that this hypothesis of interest 0 ()= ∗ ()
is not about a particular parameter, but about the adequacy of the
choice of the probability model. In this sense, this was the first MisSpecification (M-S) test for a distributional assumption!
Example. Mendel’s cross-breeding experiments (1865-6) were based
on pea-plants with different shape and color, which can be framed in
terms of two Bernoulli random variables:
R

W

z }| {
z }| {
(round)=0 (wrinkled)=1

Y

G
z }| {
z }| {
 (yellow)=0  (green)=1

His theory of heredity was based on two assumptions:
(i) the two random variables  and  are independent,
(ii) round (=0) and yellow ( =0) are dominant traits, but ‘winkled’ (=1) and ‘green’ ( =1) are recessive traits with probabilities:
P(=0)=P( =0) = 75 P(=1)=P( =1) = 25
This substantive theory gives rise to Mendel’s model based on the
bivariate distribution below:
4


0
1

0
1
 ()
5625 1875 750
1875 0625 250

 () 750 250 100
Mendel’s theory model  ( )
Data. The 4 gene-pair experiments Mendel carried out with =556,
gave rise the observed frequencies and relative frequencies given below:
R,Y

R,G

W,Y

W,G

(0 0) (0 1) (1 0) (1 1)
315
108
101
32
Observed frequencies

⇒


0
1

b
 ()
0
1
5666 1942 7608
1817 0576 2393

b
 () 7483 2518 1001
Observed relative frequencies

How adequate is Mendel’s theory in light of the data? One can answer
that question using Pearson’s goodness-of-fit chi-square test.
X (( )−( ))2


The chi-square test statistic (X)=
compares
( )
=1
how close are the observed to the expected relative frequencies.
Using the above data this test statistic yields:
³
´
(5666−5625)2
(1942−1875)2
(1817−1875)2
(0576−0625)2
(X)=556
+ 1875 + 1875 + 0625
=463
5625

In light of the fact that the tail area of 2 (3) yields: P((X)  0470)= 927
suggests accordance with Mendel’s theory.
Karl Pearson’s primary contributions to testing were:
(a) the broadening of the scope of the hypothesis of interest,
initiating misspecification testing with a goodness-of-fit test,
(b) a distance function whose distribution is known
(at least asymptotically), and
(c) the use of the tail probability as a basis of deciding how good
the fit with data x0 is.

5

4

William Gosset (aka ‘Student’)

Gosset’s 1908 seminal paper provided the cornerstone upon which Fisher
founded modern statistical inference. At that time it was known that in
the case when (1  2    ) are NIID (the simple Normal model),
P
1
the MLE estimator  :=  =  =1  had the following ‘sampling’
b
distribution:
³ 2´
√
  v N   ⇒ (  −) v N (0 1) 



It was also known that in the case where 2 is replaced by its estimator:
P
1
2 = −1 =1 ( −ˆ  )2 

√
(  −)


the distribution of
but asymptotically it can

√
?
(  −)
is unknown, i.e.
vD (0 1) 

√
(  −)
v N (0 1) 
be approximated by:


?

Gosset (1908) derived the unknown D ()  showing that for any 1:
√
(  −)


v St(−1)

(5)

where St(−1) denotes a Student’s t distribution with (−1) degrees
of freedom. The subtle question that needs to be answered is: what
does the result mean in light of the fact that  is unknown.
I This was the first finite sample result [a distributional result that
is valid for any   1] that inspired R. A. Fisher to pioneer modern
frequentist inference.

5

R. A. Fisher’s significance testing

The result (5) was formally proved and extended by Fisher (1915) and
used subsequently as a basis for several tests of hypotheses associated
with a number of different statistical models in a series of papers.
A key element in Fisher’s framework was the introduction of the
notion of a statistical model, whose generic form:
M (x)={ (x; θ) θ∈Θ} x∈R 

6

(6)

where  (x; θ) x∈R denotes the (joint) distribution of the sample

X:=(1    ) that encapsulates the prespecified probabilistic structure of the underlying stochastic process {  ∈N} The link to phenomenon of interest comes from viewing data x0 :=(1  2       ) as a
‘truly typical’ realization of the process {  ∈N}.
Fisher used the result (5) to construct a test of significance for the:
null hypothesis: 0 : =0 

(7)

in the context of the simple Normal model (table 1).
Table 1 - The simple Normal model
Statistical GM:
 =  +   ∈N
[1] Normal:
 v N( )
[2] Constant mean:
( ) =  for all ∈N
[3] Constant variance:  ( ) = 2  for all ∈N
[4] Independence:
{  ∈N} - independent process.
¥ Fisher must have realized that the result in (5) by Gosset could
not be used directly as a basis for a test because it involved the unknown
parameter  One can only speculate, but it must have dawn on him
that the result can be rendered operational if it is interpreted in terms
of factual reasoning in the sense that it holds under the True State
of Nature (TSN) (=∗ the true value). Note that, in general, the
expression ‘∗ denotes the true value of ’ is a shorthand for saying
that ‘data x0 constitute a realization of the sample X with distribution
 (x; ∗ )’. That is, (5) is a pivot (pivotal quantity):
√
∗ TSN
 (X; )= ( − ) v

St(−1)

(8)

in the sense that it is a function of both X and  whose distribution
under TSN is known.
Fisher (1956), p. 47, was very explicit about the nature of reasoning
underlying his significance testing:
“In general, tests of significance are based on hypothetical probabilities
calculated from their null hypotheses.” [emphasis in the original].
His problem was to adapt the result in (8) to hypothetical reasoning
with a view to construct a test. Fisher accomplished that in three steps.
7

Step 1. He replaced the unknown ∗ in (8) with the known null
value 0 
transforming the pivot  (X; ) into a statistic

√
 (X)= ( −0 ) 

A statistic is only a function of the sample X:=(1  2    )
Step 2. He adapted the factual reasoning underlying the pivot (8)
into hypothetical reasoning by evaluating the test statistic under 0 :
√
0
 (X)= ( −0 ) v

St(−1)

(9)

Step 3. He employed (9) to define the -value:
P( (X)   (x0 ); 0 ) = (x0 )

(10)

as an indicator of disagreement (inconsistency, contradiction) between data
x0 and 0 ; the bigger the value of  (x0 ) the smaller the -value.
£ It is crucial to note that it is a serious mistake — committed often
by misguided authors — to interpret the above probabilistic statement
defining the p-value as conditional on the null hypotheses 0 . Avoid
the highly misleading notation,
P( (X)   (x0 ) | 0 ) = (x0 ) ×
of the vertical line (|) instead of a semi-colon (;). Conditioning on 0
makes no sense in frequentist inference since  is not a random variable;
the evaluation in (10) is made under the scenario that 0 : =0 is true.
Although it is impossible to pin down Fisher’s interpretation of
the p-value because his views changed over time and he held several
interpretations at any one time, there is one fixed point among his
numerous articulations (Fisher, 1925, 1955):
‘a small p-value can be interpreted as a simple logical disjunction:
either an extremely rare event has occurred or 0 is not true’.
The focus on "a small p-value" needs to be read in conjunction
with Fisher’s falsificationist stance about testing in the sense that
significance tests can falsify but never verify hypotheses (Fisher, 1955):
“... tests of significance, when used accurately, are capable of
rejecting or invalidating hypotheses, in so far as these are contradicted by the data; but that they are never capable of establishing
them as certainly true.”
8

His logical disjunction combined with the falsificationist stance are
not helpful in shedding light on the key issue of learning from data:
what do data x0 convey about the truth or falsity of 0 ?
Occasionally, Fisher would go further and express what sounds more
like an aspiration for an evidential interpretation:
“The actual value of  ... indicates the strength of evidence
against the hypothesis” (see Fisher, 1925, p. 80),
but he never articulated a coherent evidential interpretation based
on the p-value. Indeed, as late as 1956 he offered ‘a degree of rational
disbelief’ interpretation (Fisher, 1956, pp. 46-47).
¥ A more pertinent interpretation of the -value (section 8) is :
‘the p-value is the probability – evaluated under the scenario that
0 is true – of all possible outcomes x∈R that accord less well with

0 than x0 does.
In light of the fact that data x0 were generated by the ‘true’ (θ=θ∗ )
statistical Data Generating Mechanism (DGM):
M∗ (x)={ (x; θ∗ )} x∈R 

√

∗
↓

and the test statistic  (X)= ( −0 ) measures the standardized difference between ∗ and 0  the p-value P( (X)   (x0 ); 0 )=(x0 ) seeks
to measure the discordance between 0 and M∗ (x).
This precludes several erroneous interpretations of the p-value:
£ The p-value is the probability that 0 is true.
£ 1−(x0 ) is the probability that the 1 is true.
£ The p-value is the probability that the particular result is due to
‘chance error’.
£ The p-value is an indicator of the size or the substantive importance
of the observed effect.
£ The p-value is the conditional probability of obtaining a test statistic (x) more extreme than (x0 ) given 0 ; there is no conditioning
involved.
Example 1. Consider testing the null hypothesis: 0 :  = 70
in the context of the simple Normal model (see table 1), using the
exam score data in table 1.6 (see chapter 1) where:
 =71686 2 =13606 and =70. The test statistic (9) yields:
ˆ
√
√
 (x0 )= 70(71686−70) =3824
13606

P( (X)  3824; 0 =70)=00007
9

where (x0 )=00007 is found from the St(69) tables. The tiny -value
suggests that x0 indicate strong discordance with 0 .
Table 2 - The simple Bernoulli model
Statistical GM:
 =  +   ∈N
[1] Bernoulli:
 v Ber( )
[2] Constant mean:
( ) =  for all ∈N
[3] Constant variance:  ( )=(1 − ) for all ∈N
[4] Independence:
Example 2. Arbuthnot’s 1710 conjecture: the ratio of males
to females in newborns might not be ‘fair’. This can be tested using
the statistical null hypothesis:
0 :  = 0 

where 0 =5 denotes ‘fair’

in the context of the simple Bernoulli model (table 2), based on the
random variable  deﬁned by: {male}={=1} {female}={=0}
P
1
Using the MLE of  ˆ :=  =  =1   whose sampling distribu
tion:
³
´
(1−)
  v Bin   ;  
one can derive the test statistic:
√
0
(
(X)= √  −0 ) v Bin (0 1; ) 
(11)
0 (1−0 )

Data: =30762 newborns during the period 1993-5 in Cyprus, out
of which 16029 were boys and 14833 girls.
The test statistic takes the form (11) and ˆ = 16029 =521
 30762
(x0 )=

√
30762(521−5)

√

5(5)

=7366 P((X)  7366; =5)=00000017

The tiny -value indicates strong discourdance with 0 . In general one
needs to specify a certain threshold for deciding how small the p-value is
‘small enough’ to falsify 0 . Fisher suggested several such thresholds,
01 025 05, 1, but insisted that the choice has to be made on a case
by case basis.

10

The main elements of a Fisher significance test { (X) (x0 )}.
Components of a Fisher significance test
(i) a prespecified statistical model: M (x)
(ii) a null (0 : =0 ) hypothesis,
(iii) a test statistic (distance function)  (X)
(iv) the distribution of  (X) under 0 
(v) the -value P( (X)   (x0 ); 0 )=(x0 )
(vi) a threshold value 0 [e.g.01 025 05], such that:
(x0 )  0 ⇒ x0 falsifies (rejects) 0 
Example. Consider a Fisher test based on a simple (one parameter)
Normal model (table 3).
Table 3 - The simple (one parameter) Normal model
Statistical GM:
 =  +   ∈N
[1] Normal:
 v N( )
[2] Constant mean:
( ) =  for all ∈N
[3] Constant variance:  ( )= 2 -known, for all ∈N
[4] Independence:

(i) M (x) :  v N ( 2 ) [2 is known],  = 1 2   
(ii) Null hypothesis: 0 :  = 0 (e.g. 0 = 0)
√

(iii) a test statistic:  (X) = ( −0 )


0
(iv) the distribution of  (X) under 0 :  (X) v N (0 1)
(v) the -value: P( (X)   (x0 ); 0 )=(x0 )
(vi) a threshold value for discordance, say 0 = 05
Distribution Plot

Distribution Plot

Normal, Mean=0, StDev=1

Normal, Mean=0, StDev=1

0.3
Density

0.4

0.3
Density

0.4

0.2

0.1

0.2

0.1
0.05

0.0

0.025
-1.64

0.0

0
X

Normal: P(  −164) = 05

0
X

1.96

Normal: P(  196) = 025
11

6

The Neyman-Pearson (N-P) framework

Neyman and Pearson (1933) motivated their own approach to testing
as an attempt to improve upon Fisher’s testing by addressing what
they consider as ad hoc features of that approach:
[a] Fisher’s ad hoc choice of a test statistics (X) pertaining to the
optimality of tests,
[b] Fisher’s use of a post-data [(x0 ) is known] threshold for the value to indicate falsification (rejection) of 0  and
[c] Fisher’s denial that (x0 ) can indicate confirmation (accordance)
of 0 
Neyman-Pearson (N-P) proposed solutions for [a]-[c] by:
¥ [i] Introducing the notion of an alternative hypothesis 1 as
the complement to the null with respect to Θ, within the same statistical
model:
M (x)={ (x; θ) θ∈Θ} x∈R 
(12)


[ii] Replacing the post-data p-value and its threshold with a pre-data
[before (x0 ) is available] significance level  determining the decision
rules:
(i) Reject 0  (ii) Accept 0 
which are calibrated using the error probabilities:
type I: P(Reject 0  when 0 is true)=,
type II: P(Accept 0  when 0 false)=.

These error probabilities enabled Neyman and Pearson to introduce the
key notion of an optimal N-P test: one that minimizes  subject to
a fixed , i.e.
[iii] Selecting the particular (X) in conjection with a rejection region for a given  that has the highest pre-data capacity (1 − ) to
detect discrepancies in the direction of 1 
In this sense, the N-P proposed solutions to the perceived weaknesses of Fisher’s significance testing, have changed the original Fisher
framework in four important respects:
¥ (a) the N-P testing takes place within a prespecified M (x) by
partitioning it into the null and alternative hypotheses,
12

(b) Fisher’s post-data p-value has been replaced with a pre-data 
and 
(c) the combination of (a)-(b) render ascertainable the pre-data capacity (power) of the test to detect discrepancies in the direction of
1 ,
(d) Fisher’s evidential aspirations for the p-value based on discordance, were replaced by the more behavioristic accept/reject decision
rules that aim to provide input for alternative actions: when accept 0
take action  when reject 0 take action .
As claimed by Neyman and Pearson (1928):
“the tests themselves give no final verdict, but as tools help the worker
who is using them to form his final decision.”
6.1

The archetypal N-P hypotheses specification

In N-P testing there has been a long debate concerning the proper way
to specify the null and alternative hypotheses. It is often thought to
be rather arbitrary and subject to abuse. This view stems primarily
from inadequate understanding of the role of statistical vs. substantive
information.
¥ It is argued that there is nothing arbitrary about specifying the
null and alternative hypotheses in N-P testing. The default alternative is always the complement to the null relative to the parameter
space of M (x).
Consider a generic statistical model:
M (x)={ (x; θ) θ∈Θ} x∈X:=R 


The archetypal way to specify the null and alternative hypotheses
for N-P testing is:
0 : ∈Θ0 vs. 1 : ∈Θ1 

(13)

where Θ0 and Θ1 constitute a partition of the parameter space Θ:
Θ0 ∩ Θ1 =∅ Θ0 ∪ Θ1 =Θ

In the case where the sets Θ0 or Θ1 contain a single point, say Θ0 ={0 }
that determines  (x; 0 ) the hypothesis is said to be simple, otherwise
it is composite.
Example. For the simple Bernoulli model, 0 :  = 0 is simple,
but 1 :  6= 0 is composite.
13

¥ An equivalent formulation that brings out the fact that N-P testing poses questions pertaining to the ‘true’ statistical Data Generating
Mechanism (DGM) M∗ (x)={ ∗ (x)} x∈X most clearly is to re-write
(13) in the equivalent form:
0 :  ∗ (x)∈M0 (x) vs. 1 :  ∗ (x)∈M1 (x)
where  ∗ (x) denotes the ‘true’ distribution of the sample, and:
M0 (x)={ (x; θ) θ∈Θ0 } or M1 (x)={ (x; θ) θ∈Θ1 } x∈X:=R ?


constitute a partition of M (x) P(x) denotes the set of all possible
statistical models that could have given rise to data x0 :=(1  2       ).
P (x )

H0
H1

N-P testing within M (x)

¥ When properly understood in the context of a statistical model
M (x) N-P testing leaves very little leeway in the specification of 0
and 1 hypotheses. This is because the whole of Θ [all possible values
of ] is relevant on statistical grounds, despite the fact that only a
small subset is often deemed relevant on substantive grounds.
The reasoning underlying this argument is that N-P tests invariably
pose hypothetical questions — framed in terms of ∈Θ — pertaining
to the true statistical DGM M∗ (x)={ ∗ (x)} x∈X Partitioning
M (x) provides the most effective way to learn from data by zeroing
in on the true  in the same way one can zero in on an unknown
prespecified city on a map using sequential partitioning.
¥ This foregrounds the importance of securing the statistical adequacy of M (x) [validating its probabilistic assumptions] before it
provides the basis of any inference. Hence, whenever the union of Θ0
and Θ1 the values of  postulated by 0 and 1 , respectively, do not
14

exhaust Θ there is always the possibility that the true  lies in the
complement: Θ − (Θ0 ∪ Θ1 ).
What constitutes an N-P test?
A misleading idea that needs to be done away with immediately is
that a N-P test is not a formula with some associated statistical tables!
It is a combination of a test statistic whose sampling distribution under
the null and alternative hypotheses is tractable in conjunction with a
rejection region.
Test statistic. Like an estimator, an N-P test statistic is a mapping
from the sample (X:=R ) to the parameter space (Θ):

() : X → R
that partitions the sample space into an acceptance 0 and a rejection region 1 : 0 ∩ 1 =∅ 0 ∪ 1 =X
in a way that correspond to Θ0 and Θ1  respectively:
¾
½
0 ↔ Θ0
=Θ
X=
1 ↔ Θ1
The N-P decision rules, based on a test statistic (X) take the form:
[i] if x0 ∈0  accept 0 

[ii] if x0 ∈1  reject 0 

(14)

and the associated error probabilities are:
type I:

P(x0 ∈1 ; 0 () true)=() for ∈Θ0 

type II: P(x0 ∈0 ; 1 (1 ) true)=(1 ) for 1 ∈Θ1 
True State of Nature
N-P rule
Accept 0
Reject 0

0 true
√
Type I error

0 false
Type √ error
II

£ Some misinformed statistics books go into great lengths to explain
that the above error probabilities should be viewed as conditional on
0 or 1  and they used the misleading notation (|):
P(x0 ∈1 |0 () true)=() for ∈Θ0 
15

Such conditioning makes no sense in a frequentist framework because 
is treated as an unknown constant. Instead, they should be interpreted
as evaluations of the sampling distribution of the test statistic under
diﬀerent scenarios pertaining to diﬀerent hypothesized values of 
¥ Some statistics books make a big deal of the distinction ‘accept
0 ’ vs. ‘fail to reject 0 ’, in their commendable attempt to bring out
the problem of misinterpreting ‘accept 0 ’ as tantamount to ‘there is
evidence for 0 ’; this is known as the fallacy of acceptance. By
the same token there is an analogous distinction between ‘reject 0 ’
vs. ‘accept 1 ’, that highlights the problem of misinterpreting ‘reject
0 ’ as tantamount to ‘there is evidence for 1 ’; this is known as the
fallacy of rejection.
What is objectionable about this practice is that the verbal distinctions by themselves do nothing but perpetuate these fallacies; we need
to address them, not play clever with words!
Table 3 - The simple (one parameter) Normal model
Statistical GM:
 =  +   ∈N
[1] Normal:
 v N( )
[2] Constant mean:
( ) =  for all ∈N
[3] Constant variance:  ( )= 2 -known, for all ∈N
[4] Independence:
Example 1. Consider the following hypotheses in the context of
the simple (one parameter) Normal model:
0 :  ≤ 0  vs. 1 :   0 
Let us choose a test  :={(X) 1 ()} of the form:
√
P

1
test statistic: (X)= ( −0 )  where   =  =1  
rejection region: 1 ()={x : (x)   }

(15)

(16)

To evaluate the two types of error probabilities the distribution of (X)
under both the null and alternatives are needed:
√
0 ( )

[I] (X)= ( −0 ) v 0 N(0 1)
√
1 ( )

[II] (X)= ( −0 ) v 1 N( 1  1)

16

(17)
for all 1  0 

where the non-zero mean  1 takes the form:
√
1
 1 = ( −0 )  0 for all 1  0 
The evaluation of the type I error probability is based on (I):
= max≤0 P((X)  ; 0 ())=P((X)   ; =0 )
Notice that in cases where the null is for the form 0 :  ≤ 0  the
type I error probability is deﬁned as the maximum over all values in
the interval (−∞ 0 ] which turns out to be the one evaluated at the
last point of the interval  = 0  Hence, the same test will be optimal
also in the case where the null is of the form 0 :  = 0 
The evaluation of type II error probabilities is based on (II):
(1 )=P((X) ≤  ; 1 (1 )) for all 1  0 
The power [rejecting the null when false] is equal to 1−(1 ) i.e.
(1 )=P((X)   ; 1 (1 )) for all 1  0 
Why do we care about power? The power (1 ) measures the pre-data
(generic) capacity of test  :={(X) 1 ()} to detect a discrepancy,
say =1 −0 = 1 when present. Hence, when (1 )=35 this test has
very low capacity to detect such a discrepancy.

How is this information helpful in practice? If =1 is the discrepancy
of substantive interest, this test is practically useless for that purposes
because we know beforehand that this test does not enough capacity
to detect  even if present!
17

What can one do in such a case? The power of the above test is
√
1
monotonically increasing with  1 = ( −0 )  and thus, increasing the
sample size , increases the power.
Hypothesis testing reasoning.
¥ As with Fisher’s significance testing, the reasoning underlying NP testing is hypothetical in the sense that both error probabilities are
based on evaluating the relevant test statistic under several hypothetical
scenarios, e.g. under 0 (=0 ) or under 1 (=1 ) for all 1 0  These
hypothetical sampling distributions are then used to compare 0 or 1
via (x0 ) to the True State of Nature (TSN) represented by data x0 
This is in contrast to frequentist estimation which is factual in
nature, i.e. the sampling distributions of estimators is evaluated under
the TSN, i.e.  = ∗ 
Significance level vs. -value
It is important to note that there is a mathematical relationship
between the type I error probability (significance level) and the p-value.
Placing them side by side:
P(type I error): P((X)   ; =0 ) = 
(18)
-value: P((X)  (x0 ); =0 )=(x0 )
it becomes obvious that:
(a) they are both specified in terms of the distribution of the same
test statistic (X) under 0 
(b) they both evaluate the probability of tail events of the form
{x: (x)  } x∈X, but
(c) they differ in terms of the threshold :  vs. (x0 );
in that sense  is a pre-data and (x0 ) is a post-data error probability.
In light of (a)-(c), three comments are called for.
First, the p-value can be viewed as the smallest significance level 
at which 0 would have been rejected with data x0 . For that reason the
p-value is often referred to as the observed significance level. Hence, it
should come as no surprise to learn that the above N-P decision rules
(14) could be recast in terms of the p-value:
[i]* if (x0 )   accept 0 

[ii]* if (x0 ) ≤  reject 0 
18

Indeed, practitioners often prefer to use the rules [i]*-[ii]* because the
p-value conveys additional information when it is not close to the predesignated threshold . For instance, rejecting 0 with (x0 )=0001
seems more informative than just 0 was rejected at  = 05
¥ Second, the p-value seems to have an implicit alternative built
into the choice of the tail area. For instance, a 2-sided alternative 1 :
6=0 would require one to evaluate P(|(X)|  |(x0 )|; =0 ) How
did Fisher get away with such an obvious breach of his own preaching?
Speculating on that, his answer might have been: 2-sided evaluation
of the p-value makes no sense, because, post-data, the sign of the test
statistic (x0 ) determines the direction of departure!
¥ Third, neither the signiﬁcance level  nor the p-value can be
interpreted as probabilities attached to particular values of  associated
with 0 or 1 , since the probabilities in (18) are ﬁrmly attached to
the sample realizations x∈X. Indeed, attaching probabilities to the
unknown constant  makes no sense in frequentist statistics.
As argued by Fisher (1921), p. 25: “We may discuss the probability
of occurrence of quantities which can be observed or deduced from observations, in relation to any hypotheses which may be suggested to explain
these observations. We can know nothing of the probability of hypotheses...”
This scotches another two erroneous interpretations of the p-value:
£ (x0 ) = 005 does not mean that there is a 5% chance of a Type I
error (i.e. false positive).
£ (x0 ) = 005 does not mean that there is a 95% chance that the
results would replicate if the study were repeated.
Example 1 (continued). Consider applying test :={(X) 1 ()}
for: 0 =10 =1 =100  =104  = 05 ⇒  =1645
√

√


(x0 )= ( −0 ) = 100(10175−10) =175   
1
The p-value is: P((X)  175; =0 )=04

Reject 0 

standard Normal tables
One-sided values
Two-sided values
=100 :  =128
=100 :  =1645
=050 :  =1645
=050 :  =196
=025 :  =196
=025 :  =200
=010 :  =233
=010 :  =258
19

The trade-off between type I and II error probabilities
I How does the introduction of type I and II error probabilities addresses
the arbitrariness associated with Fisher’s choice of a test statistic?
The ideal test is one whose error probabilities are zero, but no such
test exist for a given ; an ideal test, like the ideal estimator, exists as
 → ∞! Worse, there is a trade-off between the above error probabilities: as one increases the other decreases and vice versa.
Example. In the case of test  :={(X) 1 ()} in (16), decreasing the probability of type I error from =05 to =01 increases the
threshold from  =1645 to  =233 which makes it easier to accept
0  and this increases the probability of type II error.
To address this trade-off Neyman and Pearson (1933) proposed:
(i) to specify 0 and 1 in such a way so as to render the
type I error the more serious one, and then:
(ii) to choose an  significance level test {(X) 1 ()}
so as to maximize its power for all ∈Θ1 
N-P rationale. Consider the analogy with a criminal offense trial,
where the jury in such a trial are instructed by the judge to find the
defendant ‘not guilty’ unless they have been convinced ‘beyond any
reasonable doubt’ by the evidence:
0 : not guilty, vs. 1 : guilty.
The clause ‘beyond any reasonable doubt’ amounts to fixing the type
I error to a very small value, to reduce the risk of sending innocent
people to prison, or even death. At the same time, one would want the
system to minimize the risk of letting guilty people get off scot-free".
More formally, the choice of an optimal N-P test is based on fixing
an upper bound  for the type I error probability:
P(x∈1 ; 0 () true) ≤  for all ∈Θ0 

and then select a test {(X) 1 ()} that minimizes the type II error
probability, or equivalently, maximizes the power:
()=P(x0 ∈1 ; 1 () true)=1−() for all ∈Θ1 

I The general rule is that in selecting an optimal N-P test the whole
of the parameter space Θ is relevant. This is why partitioning both Θ
and the sample space X:=R using a test statistic (X) provides the

key to N-P testing.
20

Optimal properties of tests
The property that sets the gold standard for optimal test is:
[1] Uniformly Most Powerful (UMP): A test
 :={(X) 1 ()} is said to be UMP if it has higher power than any
e
other -level test   for all values ∈Θ1 , i.e.
e
(;  ) ≥ (;  ) for all ∈Θ1 
Example. Test  :={(X) 1 ()} in (16) is UMP; Lehmann (1986).
Additional properties of N-P tests
[2] Unbiasedness: A test  :={(X) 1 ()} is said to be unbiased
if the probability of rejecting 0 when false is always greater than that
of rejecting 0 when true, i.e.
max P(x0 ∈1 ; 0 ()) ≤   P(x0 ∈1 ; 1 ()) for all ∈Θ1 
∈Θ0

[3] Consistency: A test  :={(X) 1 ()} is said to be consistent
if its power goes to one for all ∈Θ1 as  → ∞, i.e.
lim   () = P(x0 ∈1 ; 1 () true)=1 for all ∈Θ1 
→∞

Like in estimation, this is a minimal property for tests.
Evaluating power. How does one evaluate the power of test
 :={(X) 1 ()} for diﬀerent discrepancies =1 −0 ?
Step 1. In light of the fact that for the relevant sampling distribution for the evaluation of (1 ) is:
√

1 ( )


[II] (X)= ( −0 ) v 1 N( 1  1) for all 1  0 
the use the N(0 1) tables requires on to split the test statistic into:
√
√
√
(  −0 )

1
= ( −1 ) +  1   1 = ( −0 )

since under 1 (1 ) the distribution of the ﬁrst component is:
³√
´  ( )
√
1 1
(  −1 )
(  −0 )
=
− 1
v N(0 1) for 1  0 



Step 2. Evaluating of power the test  :={(X) 1 ()} for different discrepancies yields:
=1 −0
=1
=2
=3

√
1
 1 = ( −0 ) :

=1:
=2:
=3:

√


(1 )=P( ( −1 )  − 1 + ; 1 )
(101)=P(  −1 + 1645) = 259
(102)=P(  −2 + 1645) = 639
(103)=P(  −3 + 1645) = 913

21

where  is a generic standard Normal r.v., i.e.  v N(0 1)

The power of this test is typical of an optimal test since (1 )
√
1
increases with the non-zero mean  1 = ( −0 )  and thus:
(a) the power increases with the sample size 
(b) the power increases with the discrepancy =(1 −0 ) and
(c) the power decreases with 
It is very important to emphasize three features of an optimal test.
First, a test is not just a formula associated with particular statistical
tables. It is a combination of a test statistic and a rejection region;
hence the notation :={(X) 1 ()}.
Second, the optimality of an N-P test is inextricably bound up with
the optimality of the estimator the test statistic is based on. Hence,
it is no accident that most optimal N-P tests are based on consistent,
fully eﬃcient and suﬃcient estimators. Example. In the case of the
simple (one parameter) Normal model, consider replacing   with the
2
unbiased estimator 2 =(1 + )2 vN( 2 ) The resulting test:
b
ˇ
 :={(X) 1 ()} where (X)=

√
2(2 −0 )



and 1 ()={x: (x)   }

will not be optimal because its power is much lower than that of  , and
ˇ
 is inconsistent! This is because the non-zero mean of the sampling
√
1
distribution under =1 will be  = 2(−0 ) which does not change as
 → ∞
Third, it is important to note that by changing the rejection region
one can render an optimal N-P test useless! For instance, replacing the
22

rejection region of :={(X) 1 ()} with:
 1 ()={x: (x)   }
the resulting test T :={(X)  1 ()} is practically useless because it
¸
is biased and its power decreases as the discrepancy  increases.
Example 2 - the Student’s t test. Consider the 1-sided hypotheses:
(1-s): 0 :  = 0  vs. 1 :   0
(19)
or (1-s*): 0 :  ≤ 0  vs. 1 :   0 
in the context of the simple Normal model (table 1).
In this case, a UMP (consistent) test  :={ (X) 1 ()} exists.
It is the well-known Student’s t test, and takes the form:
test statistic:

√
 (X)= ( −0 ) 

2



1
= −1


P

( −  )2 

=1

(20)

rejection region: 1 ()={x :  (x)   }

where  can be evaluated using the Student’s t tables.
To evaluate the two types of error probabilities the distribution of
 (X) under both the null and alternatives are needed:
√
(  −0 ) 0 (0 )
v

√
1 ( )
 (X)= ( −0 ) v 1

[I]*  (X)=
[II]*

where  1 =

√
(1 −0 )


St(−1)

(21)

St( 1 ; −1) for all 1  0 

is the non-centrality parameter.

Student’s t tables (=60)
One-sided values
Two-sided values
=100 :  =1296
=100 :  =1671
=050 :  =1671
=050 :  =2000
=010 :  =2390
=010 :  =2660
=001 :  =3232
=001 :  =3460
In summary, the main components of a Neyman-Pearson (N-P) test
{ (X) 1 ()} are given below.

23

Components of a Neyman-Pearson (N-P) test
(i)
a prespecified statistical model: M (x)
(ii) a null (0 ) and the alternative (1 ) within M (x),
(iii) a test statistic (distance function)  (X)
(iv) the distribution of  (X) under 0 
(v) prespecifying the significance level  [.01, .025, .05],
(vi) the rejection region 1 ()
(vii) the distribution of  (X) under 1 
6.2

The N-P Lemma and its extensions

The cornerstone of the Neyman-Pearson (N-P) approach is
the Neyman-Pearson lemma. Contemplate the simple generic statistical model:
M (x)={ (x; )} ∈Θ:={0  1 }} x∈R 
(22)

and consider the problem of testing the simple hypotheses:
0 :  = 0 vs. 1 :  = 1 

(23)
¥ The fact that the assumed parameter space is Θ:={0  1 } and (23)
constitute a partition, is often left out from most statistics textbook
discussions of this famous lemma!
Existence. There is exists an -significance level Uniformly Most
Powerful (UMP) [-UMP] test, whose generic form is:
(X)=(  (x;1 ) )
 (x;0 )

1 ()={x: (x)   }

(24)

where () is a monotone function.
Sufficiency. If an -level test of the form (24) exists, then it is UMP for
testing (23).
Necessity. If {(X) 1 ()} is -UMP test, then it will be given by (24).
At first sight the N-P lemma seems rather contrived because it is an
existence result for a simple statistical model M (x) whose parameter space is artificial Θ:={0  1 }, but fits perfectly into the archetypal
formulation. In addition, to implement this result one would need to
find () yielding a test statistic (X) whose distribution is known under
24

both 0 and 1 . Worse, this lemma is often misconstrued as suggesting that for an -UMP test to exist one needs to conﬁne testing to
simple-vs-simple cases even when Θ is uncountable!
¥ The truth of the matter is that the construction of an -UMP
test in more realistic cases has nothing to do with simple-vs-simple
hypotheses, but instead it is invariably based on the archetypal N-P
testing formulation in (13), and relies primarily on monotone likelihood
ratios and other features of the prespeciﬁed statistical model M (x).
Example. To illustrate these comments consider the simple-vssimple hypotheses:
(i) (1-1): 0 : =0 vs. 1 : =1 
(25)
in the context of a simple Normal (one parameter) model (table 3).
Applying the N-P lemma requires setting up the ratio:
©
ª
 (x;1 )

= exp 2 (1 − 0 )  − 22 (2 − 2 ) 
(26)
1
0
 (x;0 )

which is clearly not a test statistic, as it stands. However, there exists
a monotone function () which transforms (26) into a familiar test
statistic (Spanos, 1999, pp. 708-9):
h
i √
(x;1 )
 (x;1 )
1

1
(X)=( (x;0 ) )= ( 1 ) ln(  (x;0 ) )+ 2 = ( −0 ) 
with ascertainable type I and II error probabilities based on:
=

=

(X) v 0 N(0 1) (X) v 1 N( 1  1)  1 =

√
(1 −0 )



In this particular case, it is clear that the Neyman-Pearson lemma
does not apply because the hypotheses in (25) do not constitute a partition of the parameter space Θ=R, √ in (13). However, it turns out
as

that when the test statistic (X)= ( −0 ) is combined with information relating to the values 0 and 1 , it can provide the basis for
constructing several optimal tests (Lehmann, 1986):


[1] For 1  0  the test  :={(X) 1 ()} where

1 ()={x: (x)   } is a -UMP for the hypotheses:
(ii) (1-s≥): 0 :  ≤ 0 vs.

1 :   0 

(iii) (1-s): 0 :  = 0 vs.

1 :   0 

25

Despite the difference between (ii) and (iii), the relevant error probabilities coincide. The type I error probability for (ii) is defined as the
maximum over all  ≤ 0 , but coincides with that of (iii) (=0 ):
= max≤0 P((X)  ; 0 ())=P((X)   ; =0 )

Similarly, the p-value is the same because:
 (x0 )= max P((X)(x0 ); 0 ())=P((X)(x0 ); =0 )
≤0



[2] For 1  0  the test  :={(X) 1 ()} where

1 ()={x: (x)   } is a -UMP for the hypotheses:

(iv) (1-s≤): 0 :  ≥ 0 vs.

1 :   0 

(v) (1-s): 0 :  = 0 vs. 1 :   0 
Again, the type I error probability and p-value are defined by:
= max P((X)   ; 0 ())=P((X)  ; =0 )
≥0

 (x0 )= max P((X)(x0 ); 0 ())=P((X)(x0 ); =0 )
≥0

The existence of these -UMP tests extends the N-P lemma to more
realistic cases by invoking two regularity conditions:
¥ [A] The ratio (26) is a monotone function of the statistic    in
the sense that for any two values 1 0  (x;1 ) changes monotonically
(x;0 )
 (x;1 )
with    This implies that  (x;0 )  if and only if    for some 
This regularity condition is valid for most statistical models of interest in practice, including the one parameter Exponential family of
distributions [Normal, Gamma, Beta, Binomial, Negative Binomial,
Poisson, etc.], the Uniform, the Exponential, the Logistic, the Hypergeometric etc.; Lehmann (1986).
¥ [B] The parameter space under 1  say Θ1  is convex, i.e. for any
two values (1  2 ) ∈Θ1  their convex combinations 1 + (1 − )2 ∈Θ1 
for any 0 ≤  ≤ 1
When convexity does not hold, like the 2-sided alternative:
(vi) (2-s): 0 :  = 0 vs.

1 :  6= 0 

[3] the test  :={(X) 1 ()} 1 ()={x: |(x)|    }
2
is -UMPU (Unbiased); the -level and p-value are:
=P(|(X)|    ; =0 ) q (x0 )=P(|(X)| |(x0 )|; =0 )
2
26

6.3

Constructing optimal tests: Likelihood Ratio

The likelihood ratio test procedure can be viewed as a generalization
of the Neyman-Pearson lemma to more realistic cases where the null
and/or the alternative are composite hypotheses.
(a) The hypothesis of interest are of the general form:
0 : θ∈Θ0 vs. 1 : θ∈Θ1 
(b) The test statistic is related to the ‘likelihood’ ratio:
max∈Θ
 (X)= max∈Θ (;X) =
(;X)
0


(;X)

(;X)

(27)

Note that the max in the numerator is over all θ∈Θ [yielding the MLE
b
θ], but that of the denominator is conﬁned to all values under 0 :
e
θ∈Θ0 [yielding the constrained MLE θ].
(c) The rejection region is deﬁned in terms of a transformed ratio,
where () is chosen to yield a known sampling distribution under 0 :
1 ()= {x:  (X)= ( (X))   } 

(28)

In terms of procedures to construct optimal tests, the LR procedure
can be shown to yield several optimal tests; Lehmann (1986). Moreover,
in cases where no such () can be found, one can use the asymptotic
distribution which states that under certain restrictions (Wilks, 1938):
³
´
b X− ln (θ; X) 0 2 ()
e
v
2 ln  (X) =2 ln (θ;




0
where ∼ reads “under 0 is asymptotically distributed as” and  de
notes the number of restrictions involved in Θ0 
Example. Consider testing the hypotheses:

0 :  2 =2 vs. 1 :  2 6=2 
0
0
in the context of the simple Normal model (table 1). From the previous
chapter we know that the Maximum Likelihood method gives rise to:
P
P
b
ˆ 1
b 1

max(θ; x)=(θ; x) ⇒  =  =1  and  2 =  =1 ( −ˆ  )2 
∈Θ

P
e
max(θ; x)=(θ; x) ⇒  =  =1  and  2 =2 
ˆ 1
0
∈Θ0

27

Moreover, the two estimated likelihoods are:

¡
2 ¢− 2
max(θ; x)= 2b


∈Θ
n
o

2

2 −2
max(θ; x)= (2 0 ) exp − 22 
∈Θ0

0

Hence, the likelihood ratio is:
−
2

2

(;X) (20 )
 (X)= (;X) =


exp{−(2 22 )}

0
−
2


(22 )

h 2
i
2


2

= ( 2 ) exp{−( 2 )+1} 
0

0

The function () that transforms this ratio into a test statistic is:
2



0

(X)=( (X))=( 2 ) v 2 (−1)

0

and a rejection region of the form:
1 ={x: (X)  2 or (X)  1 }

where for a size  test, the constants 1  2 are chosen in such as way
R∞
R 1
so as:
() = 2 () =  
0
2
() denotes the chi-square density function. The test defined by
{(X) 1 ()} turns out to be UMP Unbiased (UMPU) (see Lehmann,
1986).
6.4

Substantive vs. statistical significance in N-P testing

We consider two substantive values of interest that concern the ratio
of Boys (B) to Girls (G) in newborns:
Arbuthnot: #B=#G, Bernoulli: 18B to 17G.

(29)

Statistical hypotheses. The Arbuthnot and Bernoulli conjectured
values for the proportion can be embedded into a simple Bernoulli
model (table 2), where =P(=1)=P(Boy):
Arbuthnot:  = 1 
2

Bernoulli:  =

18

35

(30)

How should one proceed to appraise  and  vis-a-vis the above data?
The archetypal specification (13) suggests probing each value separately using the difference ( −  ) as the discrepancy of interest.
28

Despite the fact that on substantive grounds the only relevant values of
 are (  ), on statistical inference grounds, the rest of the parameter
space is relevant; M (x) provides the relevant inductive premises.
This argument, however, has been questioned in the literature on the
grounds that N-P testing should probe these two values using simple-vssimple hypotheses? After all, the argument goes, the Neyman-Pearson
lemma would secure a -UMP test!
This argument shows insuﬃcient understanding of the NeymanPearson lemma, because the parameter space of the simple Bernoulli
model is not Θ={   } but Θ=[0 1] Moreover, the fact that the
simple Bernoulli model has a monotone likelihood ratio, i.e.:
³
´ ³
´ 
=1
 (x;1 )
1 (1−0 )
= (1−1 )
 for (0  1 ) s.t. 00 1 1
 (x;0 )
(1−0 )
0 (1−1 )
P
is a monotonically increasing function of the statistic   = =1 
since 1 (1−0 )  1 ensuring the existence of several -UMP tests.
0 (1−1 )


In particular, [i]  :={(X) C1 ()}:
√
(  −0 ) 0

(X)= √

0 (1−0 )


v Bin (0 1; )  C1 ()={x: (x)   }

(31)

is a -UMP test for the N-P hypotheses:
(ii) (1-s≥): 0 :  ≤ 0 vs. 1 :   0 
(iii) (1-s): 0 :  = 0 vs. 1 :   0 


and [ii]  :={(X) C1 ()}:
√
(  −0 ) 0

(X)= √

0 (1−0 )


v Bin (0 1; )  C1 ()={x: (x)   }

(32)

(iv) (1-s≤): 0 :  ≥ 0 vs. 1 :   0 
(v) (1-s): 0 :  = 0 vs. 1 :   0 
is a -UMP test for the N-P hypotheses. These results stem from the
fact that the formulations (ii)-(v) ensure the convexity of the parameter
space under 1 
Moreover, in the case of a two-sided hypotheses:
(i) (2-s): 0 :  = 0 vs.
29

1 :  6= 0 

the test [iii]  :={(X) C1 ()}:
√



0
(
(X)= √  −0 ) v Bin (0 1; )  C1 ()={x: |(x)|   }

0 (1−0 )

(33)

is a -UMP, Unbiased; see Lehmann (1986).
¥ Randomization? No! In cases where the test statistic has a
discrete sampling distribution under 0  as in (32)-(33), one might
not be able to deﬁne  exactly.
The traditional way of dealing with this problem is to randomize,
but this ‘solution’ raises more problems than it solves. Instead, the
best way to address the discreteness issue is to select a value  that
is attained for the given  or approximate the discrete distribution
with a continuous one to circumvent the problem. In practice, there
are several continuous distributions one can use, depending on whether
the original distribution is symmetric or not. In the above case, even for
moderately small size sizes, say =20 the Normal distribution provides
an excellent approximation for values of  around .5; see graph. With
=30762, the approximation is nearly exact.
D is tr ib u tio n P lo t
0 .2 0

D is tr ib u tio n
B in o m ia l

n
20

D is tr ib u tio n
N o rm al

p
0.5
M ean
10

S tD e v
2.236

Density

0 .1 5

0 .1 0

0 .0 5

0 .0 0

2

4

6

8

10
X

12

14

16

18

Normal approx. of Binomial:  (; =5 =20)
What about the choice between one-sided vs. two-sided?
The answer depends crucially on whether there exists reliable substantive information that renders part of the parameter space Θ irrelevant for both substantive and statistical purposes.
6.5

N-P testing of Arbuthnot’s value

In this case there is reliable substantive information that the probability
of the newborn being a male, =P(=1), is systematically slightly
above 1 ; the human sex ratio at birth is approximately  =512 (Hardy,
2
30

2002). Hence, a statistically more informative way to assess =5 might
be to use one-sided (1-s) directional probing.
Data: =30762 newborns during the period 1993-5 in Cyprus, out
of which 16029 were boys and 14833 girls.
In view of the huge sample size it is advisable to choose a smaller
signiﬁcance level, say =01 ⇒  =2326
Case 1. For testing Arbuthnot’s value 0 = 1 the relevant hypotheses
2
are:
(1-s): 0 :  = 1 vs. 1 :   1 
2
2
(34)
1
or (1-s*≤): 0 :  ≤ 2 vs. 1 :   1 
2


and  := {(X) C1 ()} is a -UMP test. In view of the fact that:

(x0 )=

√
30762( 16029 − 1 )
30762 2

√

5(5)

=7389

(35)

(x0 )=P((X)7389; 0 ) = 00000
the null hypothesis 0 : = 1 is strongly rejected.
2
What if one were to ignore the substantive information and apply a
2-sided test?
Case 2. For testing Arbuthnot’s value this takes the form:
(2-s): 0 :  =

1
2

vs.

1 :  6= 1 
2

(36)

 :={(X) C1 ()} in (33) is a -UMP test; for =05   =2576
2
In view of the fact that:
(x0 )=

√
30762( 16029 − 1 )
30762 2

√

5(5)

=7389  2576

q (x0 )=P(|(X)|  7389; 0 )=00000
the null hypothesis 0 : = 1 is strongly rejected.
2
6.6

N-P testing of Bernoulli’s value

The ﬁrst question we need to answer is the appropriate form of the
alternative in testing Bernoulli’s conjecture = 18 .
35
Using the same substantive information that indicates  is systematically slightly above 1 (Hardy, 2002), it seems reasonable to use a
2
1-sided alternative in the direction of 1 .
2
Case 1. For probing Bernoulli’s value the relevant hypotheses are:
(1-s): 0 :  =

18
35

31

vs. 1 : 1 

18

35

(37)



The optimal test (UMP) for (37) is  := {(X) C1 ()}, where for
=01 ⇒  = −2326. Using the same data:
√

16029

18

30762(
− )
(x0 )= √ 18 30762 35 =2379  (x0 )=P((X)  2379; 0 )=991
18
35 (1− 35 )

the null hypothesis 0 : = 18 is accepted with a high p-value.
35
What if one were to ignore the substantive information as pertaining
only to Arbuthnot’s value, and apply a 2-sided test?
Case 2. For testing Bernoulli’s value using the 2-s probing:
(2-s): 0 :  = 18 vs. 1 :  6= 18 
(38)
35
35
based on the UMPU test  :={(X) C1 ()} in (33):
√
30762( 16029 − 18 )
30762 35

(x0 )= √ 18

18
35 (1− 35 )

=2379  2576

(39)

q (x0 )=P(|(X)|  2379; 0 )=0174
yields accepting 0 , but the p-value is considerably smaller than
 (x0 )=991 for the 1-sided test.
The question that arises is which one is the relevant p-value?
Indeed, a moments reﬂection suggests that these two are not the
only choices. There is also the p-value for:
(1-s): 0 :  = 18 vs. 1 :   18 
(40)
35
35
which yields a much smaller p-value:
(x0 )=P((X)  2379; 0 )=009

(41)

reversing the previous results and leading to rejecting 0 .
¥ This raises the question:
On what basis should one choose the relevant p-value?
The short answer ‘on substantive grounds’ seems questionable because,
ﬁrst, it ignores the fact that the whole of the parameter space is relevant on statistical grounds, and second, one applies a test to assess
the validity of substantive information, not foist it on the data at
the outset. Even more questionable is the argument that one should
choose a simple alternative on substantive grounds in order to apply the
Neyman-Pearson lemma that guarantees a -UMP test; this is based
on a misapprehension of the lemma.
The longer answer, that takes account the fact that post-data (x0 )
is known, and thus there is often a clear direction of departure indicated
by data x0 associated with the sign of (x0 ) will be elaborated upon
in section 8.
32

7

Foundational issues raised by N-P testing

Let us now appraise how successful the N-P framework was in addressing the perceived weaknesses of Fisher’s testing:
[a] his ad hoc choice of a test statistics (X)
[b] his use of a post-data [(x0 ) is known] threshold for the -value
that indicates discordance (reject) with 0  and
[c] his denial that (x0 ) can indicate accordance with 0 
Very briefly, the N-P framework partly succeeded in addressing
[a] and [c], but did not provide a coherent evidential account that
answers the basic question (Mayo, 1996):
when do data x0 provide evidence for or against
(42)
a hypothesis or a claim ?
This is primarily because one could not interpret the decision results
‘Accept 0 ’ (‘Reject 0 ’) as data x0 provide evidence for (against) 0
[and thus provide evidence for 1 ]. Why?
(i) A particular test  :={(X) C1 ()} could have led to ‘Accept
0 ’ because the power of that test to detect an existing discrepancy
 was very low; the test had no generic capacity to detect an existing
discrepancy. This can easily happen when the sample size  is small.
(ii) A particular test  :={(X) C1 ()} could have led to ‘Reject
0 ’ simply because the power of that test was high enough to detect
‘trivial’ discrepancies from 0 , i.e. the generic capacity of the test was
extremely high. This can easily happen when the sample size  is very
large.
In light of the fact that the N-P framework did not provide a bridge
between the accept/reject rules and what scientific inquiry is after,
i.e. when data x0 provide evidence for or against a hypothesis ,
it should come as no surprise to learn that most practitioners sought
such answers by trying unsuccessfully (and misleadingly) to distill an
evidential interpretation out of Fisher’s p-value. In their eyes the predesignated  is equally vulnerable to manipulation [pre-designation to
keep practitioners ‘honest’ is unenforceable in practice], but the p-value
is more informative and data-specific.
Is Fisher’s p-value better in answering the question (42)? No. A
small (large) — relative to a certain threhold  — p-value could not
be interpreted as evidence for the presence (absence) of a substantive
33

discrepancy  for the same reasons as (i)-(ii). The generic capacity
of the test (its power) affects the p-value. For instance, a very small
p-value can easily arise in the case of a very large sample size .
Fallacies of Acceptance and Rejection
The issues with the accept/reject 0 and p-value results raised above
can be formalized into two classic fallacies.
(a) The fallacy of acceptance: no evidence against 0 is misinterpreted
as evidence for 0 .
This fallacy can easily arise in cases where the test in question has
low power to detect discrepancies of interest.
(b) The fallacy of rejection: evidence against 0 is misinterpreted as
evidence for a particular 1 .
This fallacy arises in cases where the test in question has high power
to detect substantively minor discrepancies. Since the power of a test
increases with the sample size  this renders N-P rejections, as well as
tiny p-values, with large  highly susceptible to this fallacy.
In the statistics literature, as well as in the secondary literatures in
several applied fields, there have been numerous attempts to circumvent
these two fallacies, but none succeeded.

8

Post-data severity evaluation

A moment’s reflection about the inability of the accept/reject 0 and
p-value results to provide a satisfactory answer to the above question
in (42) suggests that it is primarily due to the fact that there is a
problem when such results are detached from the test itself, and are
treated as providing the same evidence for a particular hypothesis 
(0 or 1 ), regardless of the generic capacity (the power) of the test
in question. That is, whether data x0 provide evidence for or against a
particular hypothesis  (0 or 1 ) depends crucially on the generic
capacity of the test (power) in question to detect discrepancies from
0 . This stems from the intuition that a small p-value or a rejection
of 0 based on a test with low power (e.g. a small ) for detecting a
particular discrepancy  provides stronger evidence for the presence of a particular discrepancy  than using a test with much higher
power (e.g. a large ). Mayo (1996) proposed a frequentist evidential
account based on harnessing this intuition in the form of a post-data
34

severity evaluation of the accept/reject results. This is based on
custom-tailoring the generic capacity of the test to establish the discrepancy  warranted by data x0 . This evidential account can be used
to circumvent the above fallacies, as well as other (misplaced) charges
against frequentist testing.
The severity evaluation is a post-data appraisal of the accept/reject
and p-value results that revolves around the discrepancy  from 0
warranted by data x0 
¥ A hypothesis  passes a severe test  with data x0 if:
(S-1) x0 accords with , and
(S-2) with very high probability, test  would have produced a result
that accords less well with  than x0 does, if  were false.
Severity can be viewed as an feature of a test  as it relates to
a particular data x0 and a speciﬁc claim  being considered. Hence,
the severity function has three arguments,  ( x0  ) denoting the
severity with which  passes  with x0 ; see Mayo and Spanos (2006).
Example. To explain how the above severity evaluation can be
applied in practice, let us return to the problem of assessing the two
substantive values of interest:
Arbuthnot: = 1  Bernoulli:  = 18 
2
35
When these values are viewed in the context of the error statistical perspective, it becomes clear that the way to frame the probing is to choose
one of the values as the null hypothesis, and let the diﬀerence between
them ( − ) represent the discrepancy of substantive interest.
Data: =30762 newborns during the period 1993-5 in Cyprus,
16029 are boys and 14833 girls.
Let us return to probing the Arbuthnot value  based on:
(1-s): 0 : 0 = 1 vs. 1 : 0  1
2
2
that gave rise to the observed test statistic:
(x0 )=

√
30762( 16029 − 1 )
30762 2

√

5(5)

=7.389,

(43)

leading to a rejection of 0 at =01 ⇒ =2326 Indeed, 0 would
have also been rejected by a (2-s) test since =2576.
An important feature of the severity evaluation is that it is postdata, and thus the sign of the observed test statistic (x0 ) provides
35

directional information that indicates the directional inferential claims
that ‘passed’. In relation to the above example, the severity ‘accordance’ condition (S-1) implies that the rejection of 0 = 1 with (x0 )=7389 
2
0, indicates that the form of the inferential claim that ‘passed’ is of the
generic form:
  1 = 0 + for some  ≥ 0
(44)
The directional feature of the severity evaluation is very important
in addressing several criticisms of N-P testing, including the potential
arbitrariness and possible abuse of:
[a] switching between one-sided, two-sided or simple-vs-simple hypotheses,
[b] interchanging the null and alternative hypotheses, and
[c] manipulating the level of signiﬁcance in an attempt to get the
desired testing result.
To establish the particular discrepancy  warranted by data x0 , the
severity post-data ‘discordance’ condition (S-2) calls for evaluating the
probability of the event: "outcomes x that accord less well with   1
than x0 does", i.e. [x: (x) ≤ (x0 )] yielding:

 ( ;   1 ) = P(x : (x) ≤ (x0 );   1 is false)

(45)

The evaluation of this probability is based on the sampling distribution:
√

ˆ

=

(
(X)= √  −0 ) v1 Bin ((1 )  (1 ); )  for 1  0
0 (1−0 )

(46)

√

(1 )= √(1 −0 ) ≥ 0  (1 )= 1 (1−1 )  0   (1 ) ≤ 1
0 (1−0 )
0 (1−0 )

Since the hypothesis that ‘passed’ is of the form   1 =0 +, the

primary objective of SEV( ;   1 ) is to determine the largest discrepancy  ≥ 0 warranted by data x0 .
Table 4 evaluates SEV(;   1 ) for diﬀerent values of  with the
evaluations based on the standardized (X):
[(X)−(1 )] =1

√

 (1 )

v Bin (0 1; ) ' N(0 1)

For example, when 1 = 5142 :
√ ˆ
( −0 )

√

0 (1−0 )

− (1 )=7389 −
36

√
30762(5142−5)

√

5(5)

=2408

(47)

and Φ( ≤ 2408)=992 where Φ() is the cumulative distribution function (cdf) of the standard Normal distribution. Note that the scaling
p
 (1 )=9996 ' 1 can be ignored.
Table 4: Reject 0 :  = 5 vs. 1 :   5
Relevant claim
Severity

1 =[5+]
P(x: (X)≤(x0 ); 1 )
010
  0510
999
013
  0513
997
0142
  05142
992
0143
  05143
991
015
  0515
983
016
  0516
962
017
  0517
923
018
  0518
859
020
  0520
645
021
  0521
500
025
  0525
084
An evidential interpretation. Taking a very high probability,
say 95 as a threshold, the largest discrepancy from the null warranted
by this data is:

 ≤ 01637 since  ( ;   51637) = 950
Is this discrepancy substantively significant? In general, to answer this
question one needs to appeal to substantive subject matter information to assess the warranted discrepancy on substantive grounds. In
human biology it is commonly accepted that the sex ratio at birth is
approximately 105 boys to 100 girls; see Hardy (2002). Translating this
ratio in terms of  yields  = 105 =5122 which suggests that the above
205
warranted discrepancy  ≥ 01637 is substantively significant since this
outputs  ≥ 5173 which exceeds  . In light of that, the substantive
discrepancy of interest between  and  :
 ∗ =(1835) − 5 = 0142857

yields SEV( ;   5143)=991 indicating that there is excellent evidence for the claim   1 =0 + ∗ .
I In terms of the ultimate objective of statistical inference, learning
from data, this evidential interpretation seems highly effective because
it narrows down an infinite set Θ to a very small subset!
37

8.1

Revisiting issues pertaining to N-P testing

The above evidential interpretation based on the severity assessment
can also be used to shed light on a number of issues raised in the
previous sections.
(a) The arbitrariness of N-P hypotheses specification
The question that naturally arises at this stage is whether having
good evidence for inferring   18 depends on the particular way one
35
has chosen to specify the hypotheses for the N-P test. Intuitively,
one would expect that the evidence should not be dependent on the
specification of the null and alternative hypotheses as such, but on x0
and the statistical model M (x) x∈R .

To explore that issue let us focus on probing the Bernoulli value:
0 : 0 = 18  where the test statistic yields:
35
√

16029

18

30762(
− )
(x0 )= √ 18 30762 35 =2379
18
35 (1− 35 )

(48)

This value leads to rejecting 0 at =01 when one uses a (1-s) test
[ =2326], but accepting 0 when one uses a (2-s) test [ =2576].
Which result should one believe?
Irrespective of these conflicting results, the severity ‘accordance’
condition (S-1) implies that, in light of the observed test statistic
(x0 )=23790, the directional claim that ‘passed’ on the basis of (48)
is of the form:
  1 =0 +
To evaluate the particular discrepancy  warranted by data x0 , condition (S-2) calls for the evaluation of the probability of the event:
‘outcomes x that accord less well with   1 than x0 does’, i.e. [x:
(x) ≤ (x0 )] giving rise to (45), but with a different 0 .

Table 5 provides several such evaluations of SEV( ;   1 ) for
different values of ; note that these evaluations are also based on (46).
A number of negative values of  are included in order to address the
original substantive hypotheses of interest, as well as bring out the fact
that the results of table 4 and 5 are identical when viewed as inferences
pertaining to ; they simply have a different null values 0  and thus
different discrepancies, but identical 1 values.
38

The severity evaluations in tables 4 and 5, not only render the choice
between (2-s), (1-s) and simple-vs-simple irrelevant, they also scotch
the well-rehearsed argument pertaining to the asymmetry between the
null and alternative hypotheses. The N-P convention of selecting a
small  is generally viewed as reflecting a strong bias against rejection
of the null. On the other hand, Bayesian critics of frequentist testing
argue the exact opposite: N-P testing is biased against accepting the
null; see Lindley (1993). The evaluations in tables 4 and 5 demonstrate
that the evidential interpretation of frequentist testing based on severe
testing addresses all asymmetries, notional or real.
Table 5: Accept/Reject 0 : = 18 vs. 1 : ≷ 18
35
35
Relevant claim
Severity

1 =[5143+] P(x: (x)≤(x0 ); 1 )
−0043
  0510
999
−0013
  0513
997
−0001
  05142
992
0000
  05143
991
0007
  0515
983
0017
  0516
962
0027
  0517
923
0037
  0518
859
0057
  0520
645
0067
  0521
500
0107
  0525
084
(b) Addressing the Fallacy of Rejection
The potential arbitrariness of the N-P specification of the null and
alternative hypotheses and the associated p-values is brought out in
probing the Bernoulli value  = 18 using different alternatives:
35
(1-s):  (x0 )=P((X)  2379; 0 )=991
(2-s):

q (x0 )=P(|(X)|  2379; 0 )=017

(1-s):  (x0 )=P((X)  2379; 0 )=009
Can the above severity evaluation explain away these highly conflicting and
confusing results?
39

The ﬁrst choice 1 :   18 was driven solely by substantive infor35
mation relating to = 1 , which is often a bad idea because it ignores
2
the statistical dimension of inference. The second choice was based on
lack of information about the direction of departure, which makes sense
pre-data, but not post-data. The third choice reﬂects the post-data direction of departure indicated by (x0 )=2379  0 In light of that, the
severity evaluation renders  (x0 )=009 the relevant p-value, after all!
¥ The key problem with the p-value. In addition, the severity
evaluation brings out the vulnerability of both the N-P reject rule and
the p-value to the fallacy of rejection in cases where  is large. Viewing
the relevant p-value ((x0 )=009) from the severity vantage point, it
is directly related to 1 passing a severe test because: the probability
that test  would have produced a result that accords less well with
1 than x0 does (x: (x)  (x0 )), if 1 were false (0 true):

Sev( ; x0 ; 0 ) =P((X)  (x0 );  ≤ 0 ) =

=1−P((X)(x0 ); =0 )=991

is very high; Mayo (1996). The key problem with the p-value, however,
is that is establishes the existence of some discrepancy  ≥ 0 but
provides no information concerning the magnitude licensed by x0  The

severity evaluation remedies that by relating Sev( ; x0 ;   0 )=991
to the inferential claim   18  with the implicit discrepancy associated
35
with the p-value being =0.
(c) Addressing the Fallacy of Acceptance
Example. Let us return to the hypotheses:
(1-s): 0 : 0 = 1 vs. 1 : 0  1
2
2
in the context of the simple Bernoulli model (table 2), and consider

applying test  with =025 ⇒  =196 to the case where data x0
are based on =2000 and yield =0503:
√
√
(x0 )= 2000(503−5) =268 P((x)  268; 0 ) = 394
5(5)

(x0 ) = 268 leads to accepting 0 , and the p-value indicates no rejection of 0 
Does this mean that data x0 provide evidence for 0 ? or more accurately, does x0 provide evidence for no substantive discrepancy from
0 ? Not necessarily!
40

To answer that question one needs to apply the post-data severe
testing reasoning to evaluate the discrepancy  ≥ 0 for 1 =0 + war

ranted by data x0 . Test  :={(X) 1 ()} passes 0 , and the idea is
to establish the smallest discrepancy  ≥ 0 from 0 warranted by data
x0 by evaluating (post-data) the claim  ≤ 1 for 1 =0 +
Condition (S-1) of severity is satisﬁed since x0 accords with 0 because (x0 )  , but (S-2) requires one to evaluate the probability

of "all outcomes x for which test  accords less well with 0 than
x0 does", i.e. (x: (x)  (x0 )) under the hypothetical scenario that
" ≤ 1 is false" or equivalently "  1 is true":

 ( ; x0 ;  ≤ 1 )=P(x: (x)  (x0 );   1 ) for  ≥ 0

(49)

The evaluation of (49) relies on the sampling distribution (46), and
for diﬀerent discrepancies ( ≥ 0) yields the results in table 6. Note

that Sev( ; x0 ;  ≤ 1 ) is evaluated at =1 because the probability
increases with .
The numerical evaluations are based on the standardized (X) from
(47). For example, , when 1 = 515 :
√ ˆ

(
√  −0 ) − (1 )=268 −
0 (1−0 )

√
2000(515−5)

√

5(5)

= −1074

Φ(  −1074)=859 where Φ() is the cumulative distribution function
q
p
(cdf) of N(0 1). Note that the scaling factor  (1 )= 515(1−515) =
5(5)
9996 can be ignored because its value is so close to 1.

Table 6 — Severity Evaluation of ‘Accept 0 ’ with ( ; x0 )

 ≤ 1 = 0 + 

Relevant claim:

Sev( ≤ 1 )

.001 .005

.01

.015 .017 .0173 .018

.429 .571 .734 .859 .895

.900

.02

.025

.910 .936 .975 .992

Assuming a severity threshold of, say 90 is high enough, the above
results indicate that the ‘smallest’ discrepancy warranted by x0 , is  ≥
0173 since:
 ( ; x0 ;  ≤ 5173)=9
41

.03

That is, data x0  in conjunction with test   provide evidence (with
severity at least 9) for the presence of a discrepancy as large as  ≥
0173. Is this discrepancy substantively insignificant? No!
As mentioned above, the accepted ratio at birth in human biology
is  =512 which suggests that the warranted discrepancy  ≥ 0173
is, indeed, substantively significant since this outputs  ≥ 5173 which
exceeds  .
This is an example where the post-data severity evaluation can be
used to address the fallacy of acceptance. The severity assessment indicates that the statistically insignificant result (x0 )=268 at =025,
actually provides evidence against 0 :  ≤ 5 and for  ≥ 5173 with
severity at least 90
(d) The arbitrariness of choosing the significance level
It is often argued by critics of frequentist testing that the choice
of the significance level in the context of the N-P approach, or the
threshold for the p-value in the case of Fisherian significance testing,
is totally arbitrary and manipulable.
To see how the severity evaluations can address this problem, let us
try to ‘manipulate’ the original significance level (=01) to a different
one that would alter the accept/reject results for 0 : = 18  Looking
35
back at results for Bernoulli’s conjectured value using the data from
Cyprus, it is easy to see that choosing a larger significance level, say
=02 ⇒  =2053 would lead to rejecting 0 for both N-P formulations:
(2-s) (1 : 6= 18 ) and (1-s) (1 :  18 )
35
35
since:

(x0 )=2379  2053
(2-s): (x0 )=P(|(X)|  2379; 0 )=0174
(1-s):  (x0 )=P((X)  2379; 0 )=009

How does this change the original severity evaluations? The short
answer is: it doesn’t! The severity evaluations remain invariant to
any changes to  because they depend on on (x0 )=2379  0 and
not on   which indicates that the generic form of the hypothesis that
‘passed’ is   1 =0 + and the same evaluations apply.

42

9

Summary and conclusions

In frequentist inference learning from data x0 about the stochastic
phenomenon of interest is accomplished by applying optimal inference
procedures with ascertainable error probabilities in the context of a statistical model:
M (x)={ (x; θ) θ∈Θ} x∈R 
(50)

Hypothesis testing gives rise to learning from data by partitioning
M (x) into two subsets:

M0 (x)={ (x; θ) θ∈Θ0 } or M1 (x)={ (x; θ) θ∈Θ1 } x∈R ?


(51)

framed in terms of the parameter(s):
0 : ∈Θ0  vs. 0 : ∈Θ1 
(52)
and inquiring x0 for evidence in favor one of the two subsets (51) using
hypothetical reasoning.
Note that one could specify (52) more perceptively as:
∈Θ

∈Θ

0
1
z
}|
{
z
}|
{
∗
∗
0 :  (x)∈M0 (x)={ (x; θ) θ∈Θ0 } vs. 1 :  (x)∈M1 (x)={ (x; θ) θ∈Θ1 }

where  ∗ (x)= (x; θ∗ ) denotes the ‘true’ distribution of the sample. This
notation elucidates the basic question posed in hypothesis testing:
I Given that the true M∗ (x) lies within the boundaries of M (x)
can x0 be used to narrow it down to a smaller subset M0 (x)?
A test  :={(X) 1 ()} is deﬁned in terms of a test statistic
((X)) and a rejection region (1 ()), and its optimality is calibrated
in terms of the relevant error probabilities:
type I:

P(x0 ∈1 ; 0 () true) ≤ () for ∈Θ0 

type II: P(x0 ∈0 ; 1 (1 ) true)=(1 ) for 1 ∈Θ1 
These error probabilities specify how often these procedures lead to
erroneous inferences, and thus determine the pre-data capacity (power)
of the test in question to give rise to valid inferences:
(1 )=P(x0 ∈1 ; 1 (1 ) true)=(1 ) for all 1 ∈Θ1 
An inference is reached by an inductive procedure which, with high
probability, will reach true conclusions from valid (or approximately
43

valid) premises (statistical model). Hence, the trustworthiness of frequentist inference depends on two interrelated pre-conditions:
(a) adopting optimal inference procedures, in the context of:
(b) a statistically adequate model.
I How does hypothesis testing related to the other forms of frequentist
inference?
It turns out that testing is inextricably bound up with estimation
(point and interval) in the sense that an optimal estimator is often the
cornerstone of an optimal test and an optimal confidence interval. For
instance, consistent, fully efficient and minimal sufficient estimators
provide the basis for most of the optimal tests and confidence intervals!
Having said that, optimal point estimation is considered rather noncontroversial, but optimal interval estimation and hypothesis testing
have raised several foundational issues that have bedeviled frequentist inference since the 1930s. In particular, the use and abuse of the
relevant error probabilities calibrating these latter forms of inductive
inference have contributed significantly to the general confusion and
controversy when applying these procedures.
Addressing foundational issues
The first such issue concerns the presupposition of the (approximately) true premises, i.e. the true M∗ (x) lies within the boundaries
of M (x) This can be addressed by securing the statistical adequacy
of M (x) vis-a-vis data x0  using trenchant Mis-Specification (MS) testing, before applying frequentist testing; see chapter 15. As
shown above, when M (x) is misspecified, the nominal and actual error probabilities can be very different, rendering the reliability of the
test in question, at best unknown and at worst highly misleading. That
is, statistical adequacy ensures the ascertainability of the relevant error
probabilities, and hence the reliability of inference.
The second issue concerns the form and nature of the evidence
x0 can provide for ∈Θ0 or ∈Θ1  Neither Fisher’s p-value, nor the
N-P accept/reject rules can provide such an evidential interpretation,
primarily because they are vulnerable to two serious fallacies:
(c) Fallacy of acceptance: no evidence against the null is misinterpreted as evidence for it.
(d) Fallacy of rejection: evidence against the null is misinterpreted
as evidence for a specific alternative.
44

As discussed above, these fallacies can be circumvented by supplementing the accept/reject rules (or the p-value) with a post-data evaluation of inference based on severe testing with a view to determined
the discrepancy  from the null warranted by data x0  This establishes
the inferential claim warranted [and thus, the unwarranted ones], in
particular cases, so that it gives rise to learning from data.
The severity perspective sheds ample light on the various controversies relating to p-values vs. type I and II error probabilities, delineating
that the p-value is a post-data and the type I and II are pre-data error probabilities, and although inadequate for the task of furnishing
an evidential interpretation, they were intended to fulﬁll complementary roles. Pre-data error probabilities, like the type I and II are used
to appraise the generic capacity of testing procedures, and post-data
error probabilities, like severity, are used to bridge the gap between
the coarse ‘accept/reject’ and evidence provided by data x0 for the
warranted discrepancy from the null, by custom-tailoring the pre-data
capacity in light of (x0 ).
This reﬁnement/extension of the Fisher-Neyman-Pearson framework can be used to address, not only the above mentioned fallacies,
but several criticisms that have beleaguered frequentist testing since
the 1930s, including:
(e) the arbitrariness of selecting the thresholds,
(f) the framing of the null and alternative hypotheses in the context
of a statistical model M (x), and
(g) the numerous misinterpretations of the p-value.

45

10

Appendix: Revisiting Confidence Intervals (CIs)

Mathematical Duality between Testing and Cls
From a purely mathematical perspective a point estimator is a mapping () of the form:
() : R → Θ

Similarly, an interval estimator is also a mapping () of the form:
(): R → Θ, such that −1 ()={x: ∈(x)}⊂F for ∈Θ.

Hence, one can assign a probability to −1 ()=(x)={x: ∈(x)} in
the sense that:
P (x: ∈(x); =∗ ) :=P (∈(X); =∗ ) 
denotes the probability that the random set (X) contains ∗ -the true
value of  The expression ‘∗ denotes the true value of ’ is a shorthand
for saying that ‘data x0 constitute a realization of the sample X with
distribution  (x; ∗ )’. One can define a (1−) Confidence Interval (CI)
for  by attaching a lower bound to this probability:
P (∈(X); =∗ ) ≥ (1 − )
Neyman (1937) utilized a mathematical duality between the N-P
Hypothesis Testing (HT) and Confidence Intervals (CIs) to put forward
an optimal theory for the latter.
The -level test of the hypotheses:
0 : =0  vs. 1 : 6=0 
takes the form {(X;0 ) 1 (0 ; )} (X;0 )-the test statistic and the
rejection region defined by:
1 (0 ; )={x: |(x;0 )|    } ⊂ R 

2
The complement of the latter with respect to the sample space (R ),

defines the acceptance region:
0 (0 ; )={x: |(x;0 )| ≤   } = R − 1 (0 ; )

2
The mathematical duality between HT and CIs stems from the fact
that for each x∈R  one can define a corresponding set (x;) on the

parameter space by:
(x;)={: x∈0 (0 ; )} ∈Θ e.g. (x;)={: |(x;0 )| ≤  }
2
46

Conversely, for each (x;) there is a CI corresponding to the acceptance region 0 (0 ; ) such that:
0 (0 ; )={x: 0 ∈(x;)} x∈R  e.g. 0 (0 ; )={x: |(x;0 | ≤  }

2
This equivalence is encapsulated by:
for each x∈R x∈0 (0 ; ) ⇔ 0 ∈(x; )

for each 0 ∈Θ

When the lower b (X) and upper bound b (X) of a CI, i.e.


´
³
b (X) ≤  ≤ b (X) =1−

P 

are monotone increasing functions of X the equivalence is:
´
³
´
³
b (x) ≤  ≤ b (x) ⇔ x∈ b−1 () ≤ x ≤ b−1 () 



∈ 
Optimality of Confidence Intervals

The mathematical duality between HT and CIs is particularly useful
in defining an optimal CI that is analogous to an optimal N-P test.
The key result is that the acceptance region of a -significance level
test  that is Uniformly Most Powerful (UMP), gives rise to (1−) CI
that is Uniformly Most Accurate (UMA), i.e. the error coverage probability that (x; ) contains a value  6= 0  where 0 is assumed to be the true value, is minimized; Lehmann (1986). Similarly,
a −UMPU test gives rise to a UMAU Confidence Interval.
Moreover, it can be shown that such optimal CIs have
the shortest length in the following sense:
if [b (x) b (x)] is a (1−)-UMAU CI, then its length [width] is less


than or equal to that of any other (1−) CI, say [e (x) e (x)]:






∗ (b (x) − b (x)) ≤ ∗ (e (x) − e (x))

Underlying reasoning: CI vs. HT

Does this mathematical duality render HT and CIs equivalent in
terms of their respective inference results? No!
47

The primary source of the confusion on this issue is that they share
a common pivotal function, and their error probabilities are evaluated
using the same tail areas. But that’s where the analogy ends.
To bring out how their inferential claims and respective error probabilities are fundamentally different, consider the simple (one parameter)
Normal model (table 3), where the common pivot is:
(X; )=

√
(  −)


v N (0 1) 

(53)

Since  is unknown, (53) is meaningless as it stands, and it crucial to
understand how the CI and HT procedures render it meaningful using
different types of reasoning, referred to as factual and hypothetical,
respectively:
√
∗ TSN

(X;)= ( − ) v N(0 1)
(54)
√
0

(X)= ( −0 ) v N(0 1)
Putting the (1−) CI side-by-side with the (1−) acceptance region:
´
³


∗
P   −  ( √ ) ≤  ≤   +  ( √ ); = =1−
2
2
´
³
(55)


P 0 −  ( √ ) ≤   ≤ 0 +  ( √ ); =0 =1−
2
2

it is clear that their inferential claims are crucially different:
[a] The CI claims that with probability (1−) the random bounds

[  ±   ( √ )] will cover (overlay) the true ∗ [whatever that happens
2
to be], under the TSN.
[b] The test claims that with probability (1−)  test  would yield
a result that accords equally well or better with 0 (x: |(x;0 )| ≤   ),
2
if 0 were true.
Hence, their claims pertain to the parameter vs. the sample space,
and the evaluation under the TSN (=∗ ) vs. under the null (=0 ),
render the two statements inferentially very different. The idea that a
given (known) 0 and the true ∗ are interchangeable is an illusion.
This can be seen most clearly in the simple Bernoulli case where 
P
needs to be estimated using b =  =1  :=  for the UMAU (1−)
 1
CI:
³
´
 (1− )
 (1− )
∗




P   −  ( √ ) ≤     +  ( √ ); = =1−
2
2
48

but the acceptance region of -UMPU test is defined in terms of the
hypothesized value 0 :
´
³
0 (1−0 )
0 (1−0 )
(
√
(
√
P 0 − 2
) ≤    0 + 2
); =0 =1−



The fact is that there is only one ∗ and an infinity of values 0  As
a result, the HT hypothetical reasoning is equally well-defined pre-data
and post data. In contrast, there is only one TSN which has already
played out post-data, rendering coverage error probability degenerate,
i.e. the observed CI:


[ −   ( √ )  +   ( √ )]
2
2

(56)

either includes or excludes the true value ∗ .
Observed Confidence Intervals and Severity
In addition to addressing the fallacies of acceptance and rejection,
the post-data severity evaluation can be used to address the issue of
degenerate post-data error probabilities and the inability to distinguish
between different values of  within an observed CI; Mayo and Spanos
(2006). This is accomplished by making two key changes:
(a) Replace the factual reasoning underlying CIs, which becomes
degenerate post-data, with the hypothetical reasoning underlying the
post-data severity evaluation.
(b) Replace any claims pertaining to overlaying the true ∗ with
severity-based inferential claims of the form:
  1 =0 +  ≤ 1 =0 + for some  ≥ 0
(57)
This can be achieved by relating the observed bounds:

 ±   ( √ )
(58)
2


to particular values of 1 associated with (57), say 1 = −  ( √ ) and
2
evaluating the post-data severity of the claim:

  1 = −  ( √ )
2

A moment’s reflection, however, suggests that the connection between establishing the warranted discrepancy  from the null value 0
and the observed CI (58) is more apparent than real, because the severity evaluation probability is attached to the inferential claim (57), and
not to the particular value 1 . Hence, at best one would be creating
49

the illusion that the severity assessment on the boundary of a 1-sided
observed CI is related to the confidence level; it does not! The truth of
the matter is that the inferential claim (58) does not pertain directly
to ∗  but to the warranted discrepancy  in light of x0 .
Fallacious arguments for using Confidence Intervals
A. Confidence Intervals are more reliable than p-values
It is often argued in the social science statistical literature that CIs
are more reliable than p-values because the latter are vulnerable to the
large  problem; as the sample size  → ∞ the p-value becomes smaller
and smaller. Hence, one would reject any null hypothesis given enough
observations. What these critics do not seem to realize is that a CI
like:
³
´


∗
 ( √ ) ≤  ≤  +  ( √ ); =
P   − 2 
=1−


2
is equally vulnerable to the large  problem because the length of the
interval:
h
i h
i



 ( √ ) −  −  ( √ )
  + 2 
= 2  ( √ )


2
2

shrinks to zero as  → ∞

B. The middle of a Confidence Interval is more probable?
Despite the obvious fact that post-data the probability of coverage
is either zero or one, there have been numerous recurring attempts in
the literature to discriminate among the different values of  within an
observed interval:
“... the difference between population means is much more likely to
be near the middle of the confidence interval than towards the extremes.”
(Gardner and Altman, 2000, p. 22)
This claim is clearly fallacious. A moment’s reflection suggests
that viewing all possible observed CIs (56) as realizations from a single
distribution in (54), leads one to the conclusion that, unless the particular  happens (by accident) to coincide with the true value ∗  values
to the right or the left of  will be more likely depending on whether
 is to the left or the right of ∗  Given that ∗ is unknown, however,
such an evaluation cannot be made in practice with a single realization.
This is illustrated below with a CI for  in the case of the simple (one
parameter) Normal model.
50

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.

∗
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a

51

A. spanos slides ch14-2013 (4)

A. spanos slides ch14-2013 (4)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie A. spanos slides ch14-2013 (4)

Ähnlich wie A. spanos slides ch14-2013 (4) (20)

Mehr von jemille6

Mehr von jemille6 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

A. spanos slides ch14-2013 (4)