SlideShare ist ein Scribd-Unternehmen logo
1 von 54
Downloaden Sie, um offline zu lesen
CHAPTER 14: Frequentist Hypothesis Testing:
a Coherent Account
Aris Spanos [Fall 2013]

1 Inherent difficulties in learning statistical testing
Statistical testing is arguably the most important, but also the most
difficult and confusing chapter of statistical inference for several reasons, including the following.
(i) The need to introduce numerous new notions, concepts and
procedures before one can paint — even in broad brushes — a coherent
picture of hypothesis testing.
(ii) The current textbook discussion of statistical testing is both
highly confusing and confused. There are several sources of confusion.
I (a) Testing is conceptually one of the most sophisticated sub-fields
of any scientific discipline.
I (b) Inadequate knowledge by textbook writers who often do not
have the technical skills to read and understand the original sources,
and have to rely on second hand accounts of previous textbook writers that are often misleading or just outright erroneous. In most of
these textbooks hypothesis testing is poorly explained as an idiot’s
guide to combining off-the-shelf formulae with statistical tables like the
Normal, the Student’s t, the chi-square, etc., where the underlying statistical model that gives rise to the testing procedure is hidden in the
background.
I (c) The misleading portrayal of Neyman-Pearson testing as essentially decision-theoretic in nature, when in fact the latter has much
greater affinity with the Bayesian rather than the frequentist inference.
I (d) A deliberate attempt to distort and cannibalize frequentist
testing by certain Bayesian statisticians who revel in (unfairly) maligning frequentist inference in their misguided attempt to motivate their
preferred viewpoint of statistical inference. You will often hear such
Bayesians tell you that "what you really want (require) is probabilities
attached to hypotheses" and other such misadvised promptings! In
frequentist inference probabilities are always attached to the different
possible values of the sample x∈R ; never to hypotheses!

(iii) The discussion of frequentist testing is rather incomplete in so
far as it has been beleaguered by serious foundational problems since
1
the 1930s. As a result, different applied fields have generated their own
secondary literatures attempting to address these problems, but often
making things much worse! Indeed, in some fields like psychology it
has reached the stage where one has to correct the ‘corrections’ of those
chastising the initial ‘correctors’!
In an attempt to alleviate problem (i), the discussion that follows
uses a sketchy historical development of frequentist testing. To ameliorate problem (ii), the discussion includes ‘red flag’ pointers (¥) designed
to highlight important points that shed light on certain erroneous interpretations or misleading arguments. The discussion will pay special
attention to (iii), addressing some of the key foundational problems.

2

Francis Edgeworth

A typical example of a testing procedure at the end of the 19th century
is provided by Edgeworth (1885). From today’s perspective his testing
takes the form of viewing data x0 :=(11  12      1 ; 21  22   2 ) in
the context of the following bivariate simple Normal model:
¶
µµ
¶ µ 2
¶¶
µ
1
 0
1
v NIID


=1 2   
X:=
2
2
0 2
(1 )=1  (2)=2   (1 )= (2 ) =  2  (1  2 )=0
The hypothesis of interest concerns the equality of the two means:
Hypothesis of interest: 1 = 2 
The 2-dimensional sample is: X:=(11  12      1 ; 21  22   2 )
Common sense, combined with the statistical knowledge at the time
suggested using the difference between the estimated means:
P
P
1 − 2  where 1 =  =1 1  2 =  =1 2 
b
b
b 1
b 1

as a basis for deciding whether the two means are different or not.
To render the difference (b1 − 2 ) free of the units of measurement,

b
Edgeworth divided it by its standard deviation:
q


p
P
P
b2 1

b2 1

 (b1 −b2 )= (b2 + 2 )  1 =  (1 −b1 )2   2 =  (2 −b1 )2 
 
 1 b1
=1

2

=1
|ˆ 
to define the distance function: (X)= √1 −ˆ 2 |2 
2
(1 +1 )
 

The last issue he needed to address is how big the observed distance
(x0 ) should be to justify inferring that the two means are different.
Edgeworth argued that (1 −2 )6=0 could not be ‘accidental’ (due to
chance) if:
√
|ˆ 
(X) = √1 −ˆ2 |2  2 2
(1)
2
(1 +1 )
 

√
I Where did the threshold 2 2 come from? The tail area of N(0 1)
√
beyond ±2 2 is approximately 005, which was viewed as a reasonable
value for the probability of a ‘chance’ error, i.e. the error of inferring
a significant discrepancy. But why Normal? At the time statistical
inference relied heavily on large sample size  (asymptotic) results by
invoking the Central Limit Theorem.
In summary, Edgeworth introduced 3 features of statistical testing:
(i) a hypothesis of interest: 1 = 2 
(ii) the notion of a standardized distance: (X),√
(iii) a threshold value for significance: (X)  2 2

3

Karl Pearson

The Pearson approach to statistics can be summarized as follows:
Histogram
Data
=⇒
x0 :=(1    )

=⇒

Fitting a frequency curve
from the Pearson family
 (; b1  b2  b3  b4 )
   

The Pearson family of frequency curves can be expressed in terms
of the following differential equation:
 ln  ()


=

(−1 )

2 +3 +4 2

(2)

Depending on the values taken by the parameters (1  2  3  4 ) this
equation can generate numerous frequency curves such as the Normal,
the Student’s , the Beta, the Gamma, the Laplace, the Pareto, etc.

3
The first four raw data moments 01 , 02 , 03 and 04 are sufficient to
b b b
b
determine (b1  b2  b3  b4 ) and that enables one can select the particular
   
 () from the Pearson family.
Having selected a particular frequency curve 0 () on the basis of
b
 (; b1  b2  b3  b4 )= (; θ) Karl Pearson proposed to assess the appro   
priateness of this choice. His hypothesis of interest is the form:
0 () =  ∗ ()∈Pearson(1  2  3  4 )  ∗ ()-true density.

Pearson proposed the standardized distance function:
X ( − )2
X (( )−( ))2




(X)=
=
v 2 ()

( )
=1

=1



(3)

b
where (  =1 2  ) and ( =1 2  ) denote the empirical and
assumed (as specified by 0 ()) frequencies, and introduced the notion
of a p-value:
P((X)  (x0 )) = (x0 )
(4)
as a basis for inferring whether the choice of 0 () was appropriate or
not. The smaller the (x0 ) the worst the fit.
¥ It is important to Note that this hypothesis of interest 0 ()= ∗ ()
is not about a particular parameter, but about the adequacy of the
choice of the probability model. In this sense, this was the first MisSpecification (M-S) test for a distributional assumption!
Example. Mendel’s cross-breeding experiments (1865-6) were based
on pea-plants with different shape and color, which can be framed in
terms of two Bernoulli random variables:
R

W

z }| {
z }| {
(round)=0 (wrinkled)=1

Y

G
z }| {
z }| {
 (yellow)=0  (green)=1

His theory of heredity was based on two assumptions:
(i) the two random variables  and  are independent,
(ii) round (=0) and yellow ( =0) are dominant traits, but ‘winkled’ (=1) and ‘green’ ( =1) are recessive traits with probabilities:
P(=0)=P( =0) = 75 P(=1)=P( =1) = 25
This substantive theory gives rise to Mendel’s model based on the
bivariate distribution below:
4

0
1

0
1
 ()
5625 1875 750
1875 0625 250

 () 750 250 100
Mendel’s theory model  ( )
Data. The 4 gene-pair experiments Mendel carried out with =556,
gave rise the observed frequencies and relative frequencies given below:
R,Y

R,G

W,Y

W,G

(0 0) (0 1) (1 0) (1 1)
315
108
101
32
Observed frequencies

⇒


0
1

b
 ()
0
1
5666 1942 7608
1817 0576 2393

b
 () 7483 2518 1001
Observed relative frequencies

How adequate is Mendel’s theory in light of the data? One can answer
that question using Pearson’s goodness-of-fit chi-square test.
X (( )−( ))2


The chi-square test statistic (X)=
compares
( )
=1
how close are the observed to the expected relative frequencies.
Using the above data this test statistic yields:
³
´
(5666−5625)2
(1942−1875)2
(1817−1875)2
(0576−0625)2
(X)=556
+ 1875 + 1875 + 0625
=463
5625

In light of the fact that the tail area of 2 (3) yields: P((X)  0470)= 927
suggests accordance with Mendel’s theory.
Karl Pearson’s primary contributions to testing were:
(a) the broadening of the scope of the hypothesis of interest,
initiating misspecification testing with a goodness-of-fit test,
(b) a distance function whose distribution is known
(at least asymptotically), and
(c) the use of the tail probability as a basis of deciding how good
the fit with data x0 is.

5
4

William Gosset (aka ‘Student’)

Gosset’s 1908 seminal paper provided the cornerstone upon which Fisher
founded modern statistical inference. At that time it was known that in
the case when (1  2    ) are NIID (the simple Normal model),
P
1
the MLE estimator  :=  =  =1  had the following ‘sampling’
b
distribution:
³ 2´
√
  v N   ⇒ (  −) v N (0 1) 



It was also known that in the case where 2 is replaced by its estimator:
P
1
2 = −1 =1 ( −ˆ  )2 

√
(  −)


the distribution of
but asymptotically it can

√
?
(  −)
is unknown, i.e.
vD (0 1) 

√
(  −)
v N (0 1) 
be approximated by:


?

Gosset (1908) derived the unknown D ()  showing that for any 1:
√
(  −)


v St(−1)

(5)

where St(−1) denotes a Student’s t distribution with (−1) degrees
of freedom. The subtle question that needs to be answered is: what
does the result mean in light of the fact that  is unknown.
I This was the first finite sample result [a distributional result that
is valid for any   1] that inspired R. A. Fisher to pioneer modern
frequentist inference.

5

R. A. Fisher’s significance testing

The result (5) was formally proved and extended by Fisher (1915) and
used subsequently as a basis for several tests of hypotheses associated
with a number of different statistical models in a series of papers.
A key element in Fisher’s framework was the introduction of the
notion of a statistical model, whose generic form:
M (x)={ (x; θ) θ∈Θ} x∈R 

6

(6)
where  (x; θ) x∈R denotes the (joint) distribution of the sample

X:=(1    ) that encapsulates the prespecified probabilistic structure of the underlying stochastic process {  ∈N} The link to phenomenon of interest comes from viewing data x0 :=(1  2       ) as a
‘truly typical’ realization of the process {  ∈N}.
Fisher used the result (5) to construct a test of significance for the:
null hypothesis: 0 : =0 

(7)

in the context of the simple Normal model (table 1).
Table 1 - The simple Normal model
Statistical GM:
 =  +   ∈N
[1] Normal:
 v N( )
[2] Constant mean:
( ) =  for all ∈N
[3] Constant variance:  ( ) = 2  for all ∈N
[4] Independence:
{  ∈N} - independent process.
¥ Fisher must have realized that the result in (5) by Gosset could
not be used directly as a basis for a test because it involved the unknown
parameter  One can only speculate, but it must have dawn on him
that the result can be rendered operational if it is interpreted in terms
of factual reasoning in the sense that it holds under the True State
of Nature (TSN) (=∗ the true value). Note that, in general, the
expression ‘∗ denotes the true value of ’ is a shorthand for saying
that ‘data x0 constitute a realization of the sample X with distribution
 (x; ∗ )’. That is, (5) is a pivot (pivotal quantity):
√
∗ TSN
 (X; )= ( − ) v

St(−1)

(8)

in the sense that it is a function of both X and  whose distribution
under TSN is known.
Fisher (1956), p. 47, was very explicit about the nature of reasoning
underlying his significance testing:
“In general, tests of significance are based on hypothetical probabilities
calculated from their null hypotheses.” [emphasis in the original].
His problem was to adapt the result in (8) to hypothetical reasoning
with a view to construct a test. Fisher accomplished that in three steps.
7
Step 1. He replaced the unknown ∗ in (8) with the known null
value 0 
transforming the pivot  (X; ) into a statistic

√
 (X)= ( −0 ) 

A statistic is only a function of the sample X:=(1  2    )
Step 2. He adapted the factual reasoning underlying the pivot (8)
into hypothetical reasoning by evaluating the test statistic under 0 :
√
0
 (X)= ( −0 ) v

St(−1)

(9)

Step 3. He employed (9) to define the -value:
P( (X)   (x0 ); 0 ) = (x0 )

(10)

as an indicator of disagreement (inconsistency, contradiction) between data
x0 and 0 ; the bigger the value of  (x0 ) the smaller the -value.
£ It is crucial to note that it is a serious mistake — committed often
by misguided authors — to interpret the above probabilistic statement
defining the p-value as conditional on the null hypotheses 0 . Avoid
the highly misleading notation,
P( (X)   (x0 ) | 0 ) = (x0 ) ×
of the vertical line (|) instead of a semi-colon (;). Conditioning on 0
makes no sense in frequentist inference since  is not a random variable;
the evaluation in (10) is made under the scenario that 0 : =0 is true.
Although it is impossible to pin down Fisher’s interpretation of
the p-value because his views changed over time and he held several
interpretations at any one time, there is one fixed point among his
numerous articulations (Fisher, 1925, 1955):
‘a small p-value can be interpreted as a simple logical disjunction:
either an extremely rare event has occurred or 0 is not true’.
The focus on "a small p-value" needs to be read in conjunction
with Fisher’s falsificationist stance about testing in the sense that
significance tests can falsify but never verify hypotheses (Fisher, 1955):
“... tests of significance, when used accurately, are capable of
rejecting or invalidating hypotheses, in so far as these are contradicted by the data; but that they are never capable of establishing
them as certainly true.”
8
His logical disjunction combined with the falsificationist stance are
not helpful in shedding light on the key issue of learning from data:
what do data x0 convey about the truth or falsity of 0 ?
Occasionally, Fisher would go further and express what sounds more
like an aspiration for an evidential interpretation:
“The actual value of  ... indicates the strength of evidence
against the hypothesis” (see Fisher, 1925, p. 80),
but he never articulated a coherent evidential interpretation based
on the p-value. Indeed, as late as 1956 he offered ‘a degree of rational
disbelief’ interpretation (Fisher, 1956, pp. 46-47).
¥ A more pertinent interpretation of the -value (section 8) is :
‘the p-value is the probability – evaluated under the scenario that
0 is true – of all possible outcomes x∈R that accord less well with

0 than x0 does.
In light of the fact that data x0 were generated by the ‘true’ (θ=θ∗ )
statistical Data Generating Mechanism (DGM):
M∗ (x)={ (x; θ∗ )} x∈R 

√

∗
↓

and the test statistic  (X)= ( −0 ) measures the standardized difference between ∗ and 0  the p-value P( (X)   (x0 ); 0 )=(x0 ) seeks
to measure the discordance between 0 and M∗ (x).
This precludes several erroneous interpretations of the p-value:
£ The p-value is the probability that 0 is true.
£ 1−(x0 ) is the probability that the 1 is true.
£ The p-value is the probability that the particular result is due to
‘chance error’.
£ The p-value is an indicator of the size or the substantive importance
of the observed effect.
£ The p-value is the conditional probability of obtaining a test statistic (x) more extreme than (x0 ) given 0 ; there is no conditioning
involved.
Example 1. Consider testing the null hypothesis: 0 :  = 70
in the context of the simple Normal model (see table 1), using the
exam score data in table 1.6 (see chapter 1) where:
 =71686 2 =13606 and =70. The test statistic (9) yields:
ˆ
√
√
 (x0 )= 70(71686−70) =3824
13606

P( (X)  3824; 0 =70)=00007
9
where (x0 )=00007 is found from the St(69) tables. The tiny -value
suggests that x0 indicate strong discordance with 0 .
Table 2 - The simple Bernoulli model
Statistical GM:
 =  +   ∈N
[1] Bernoulli:
 v Ber( )
[2] Constant mean:
( ) =  for all ∈N
[3] Constant variance:  ( )=(1 − ) for all ∈N
[4] Independence:
{  ∈N} - independent process.
Example 2. Arbuthnot’s 1710 conjecture: the ratio of males
to females in newborns might not be ‘fair’. This can be tested using
the statistical null hypothesis:
0 :  = 0 

where 0 =5 denotes ‘fair’

in the context of the simple Bernoulli model (table 2), based on the
random variable  defined by: {male}={=1} {female}={=0}
P
1
Using the MLE of  ˆ :=  =  =1   whose sampling distribu
tion:
³
´
(1−)
  v Bin   ;  
one can derive the test statistic:
√
0
(
(X)= √  −0 ) v Bin (0 1; ) 
(11)
0 (1−0 )

Data: =30762 newborns during the period 1993-5 in Cyprus, out
of which 16029 were boys and 14833 girls.
The test statistic takes the form (11) and ˆ = 16029 =521
 30762
(x0 )=

√
30762(521−5)

√

5(5)

=7366 P((X)  7366; =5)=00000017

The tiny -value indicates strong discourdance with 0 . In general one
needs to specify a certain threshold for deciding how small the p-value is
‘small enough’ to falsify 0 . Fisher suggested several such thresholds,
01 025 05, 1, but insisted that the choice has to be made on a case
by case basis.

10
The main elements of a Fisher significance test { (X) (x0 )}.
Components of a Fisher significance test
(i) a prespecified statistical model: M (x)
(ii) a null (0 : =0 ) hypothesis,
(iii) a test statistic (distance function)  (X)
(iv) the distribution of  (X) under 0 
(v) the -value P( (X)   (x0 ); 0 )=(x0 )
(vi) a threshold value 0 [e.g.01 025 05], such that:
(x0 )  0 ⇒ x0 falsifies (rejects) 0 
Example. Consider a Fisher test based on a simple (one parameter)
Normal model (table 3).
Table 3 - The simple (one parameter) Normal model
Statistical GM:
 =  +   ∈N
[1] Normal:
 v N( )
[2] Constant mean:
( ) =  for all ∈N
[3] Constant variance:  ( )= 2 -known, for all ∈N
[4] Independence:
{  ∈N} - independent process.

(i) M (x) :  v N ( 2 ) [2 is known],  = 1 2   
(ii) Null hypothesis: 0 :  = 0 (e.g. 0 = 0)
√

(iii) a test statistic:  (X) = ( −0 )


0
(iv) the distribution of  (X) under 0 :  (X) v N (0 1)
(v) the -value: P( (X)   (x0 ); 0 )=(x0 )
(vi) a threshold value for discordance, say 0 = 05
Distribution Plot

Distribution Plot

Normal, Mean=0, StDev=1

Normal, Mean=0, StDev=1

0.3
Density

0.4

0.3
Density

0.4

0.2

0.1

0.2

0.1
0.05

0.0

0.025
-1.64

0.0

0
X

Normal: P(  −164) = 05

0
X

1.96

Normal: P(  196) = 025
11
6

The Neyman-Pearson (N-P) framework

Neyman and Pearson (1933) motivated their own approach to testing
as an attempt to improve upon Fisher’s testing by addressing what
they consider as ad hoc features of that approach:
[a] Fisher’s ad hoc choice of a test statistics (X) pertaining to the
optimality of tests,
[b] Fisher’s use of a post-data [(x0 ) is known] threshold for the value to indicate falsification (rejection) of 0  and
[c] Fisher’s denial that (x0 ) can indicate confirmation (accordance)
of 0 
Neyman-Pearson (N-P) proposed solutions for [a]-[c] by:
¥ [i] Introducing the notion of an alternative hypothesis 1 as
the complement to the null with respect to Θ, within the same statistical
model:
M (x)={ (x; θ) θ∈Θ} x∈R 
(12)


[ii] Replacing the post-data p-value and its threshold with a pre-data
[before (x0 ) is available] significance level  determining the decision
rules:
(i) Reject 0  (ii) Accept 0 
which are calibrated using the error probabilities:
type I: P(Reject 0  when 0 is true)=,
type II: P(Accept 0  when 0 false)=.

These error probabilities enabled Neyman and Pearson to introduce the
key notion of an optimal N-P test: one that minimizes  subject to
a fixed , i.e.
[iii] Selecting the particular (X) in conjection with a rejection region for a given  that has the highest pre-data capacity (1 − ) to
detect discrepancies in the direction of 1 
In this sense, the N-P proposed solutions to the perceived weaknesses of Fisher’s significance testing, have changed the original Fisher
framework in four important respects:
¥ (a) the N-P testing takes place within a prespecified M (x) by
partitioning it into the null and alternative hypotheses,
12
(b) Fisher’s post-data p-value has been replaced with a pre-data 
and 
(c) the combination of (a)-(b) render ascertainable the pre-data capacity (power) of the test to detect discrepancies in the direction of
1 ,
(d) Fisher’s evidential aspirations for the p-value based on discordance, were replaced by the more behavioristic accept/reject decision
rules that aim to provide input for alternative actions: when accept 0
take action  when reject 0 take action .
As claimed by Neyman and Pearson (1928):
“the tests themselves give no final verdict, but as tools help the worker
who is using them to form his final decision.”
6.1

The archetypal N-P hypotheses specification

In N-P testing there has been a long debate concerning the proper way
to specify the null and alternative hypotheses. It is often thought to
be rather arbitrary and subject to abuse. This view stems primarily
from inadequate understanding of the role of statistical vs. substantive
information.
¥ It is argued that there is nothing arbitrary about specifying the
null and alternative hypotheses in N-P testing. The default alternative is always the complement to the null relative to the parameter
space of M (x).
Consider a generic statistical model:
M (x)={ (x; θ) θ∈Θ} x∈X:=R 


The archetypal way to specify the null and alternative hypotheses
for N-P testing is:
0 : ∈Θ0 vs. 1 : ∈Θ1 

(13)

where Θ0 and Θ1 constitute a partition of the parameter space Θ:
Θ0 ∩ Θ1 =∅ Θ0 ∪ Θ1 =Θ

In the case where the sets Θ0 or Θ1 contain a single point, say Θ0 ={0 }
that determines  (x; 0 ) the hypothesis is said to be simple, otherwise
it is composite.
Example. For the simple Bernoulli model, 0 :  = 0 is simple,
but 1 :  6= 0 is composite.
13
¥ An equivalent formulation that brings out the fact that N-P testing poses questions pertaining to the ‘true’ statistical Data Generating
Mechanism (DGM) M∗ (x)={ ∗ (x)} x∈X most clearly is to re-write
(13) in the equivalent form:
0 :  ∗ (x)∈M0 (x) vs. 1 :  ∗ (x)∈M1 (x)
where  ∗ (x) denotes the ‘true’ distribution of the sample, and:
M0 (x)={ (x; θ) θ∈Θ0 } or M1 (x)={ (x; θ) θ∈Θ1 } x∈X:=R ?


constitute a partition of M (x) P(x) denotes the set of all possible
statistical models that could have given rise to data x0 :=(1  2       ).
P (x )

H0
H1

N-P testing within M (x)

¥ When properly understood in the context of a statistical model
M (x) N-P testing leaves very little leeway in the specification of 0
and 1 hypotheses. This is because the whole of Θ [all possible values
of ] is relevant on statistical grounds, despite the fact that only a
small subset is often deemed relevant on substantive grounds.
The reasoning underlying this argument is that N-P tests invariably
pose hypothetical questions — framed in terms of ∈Θ — pertaining
to the true statistical DGM M∗ (x)={ ∗ (x)} x∈X Partitioning
M (x) provides the most effective way to learn from data by zeroing
in on the true  in the same way one can zero in on an unknown
prespecified city on a map using sequential partitioning.
¥ This foregrounds the importance of securing the statistical adequacy of M (x) [validating its probabilistic assumptions] before it
provides the basis of any inference. Hence, whenever the union of Θ0
and Θ1 the values of  postulated by 0 and 1 , respectively, do not
14
exhaust Θ there is always the possibility that the true  lies in the
complement: Θ − (Θ0 ∪ Θ1 ).
What constitutes an N-P test?
A misleading idea that needs to be done away with immediately is
that a N-P test is not a formula with some associated statistical tables!
It is a combination of a test statistic whose sampling distribution under
the null and alternative hypotheses is tractable in conjunction with a
rejection region.
Test statistic. Like an estimator, an N-P test statistic is a mapping
from the sample (X:=R ) to the parameter space (Θ):

() : X → R
that partitions the sample space into an acceptance 0 and a rejection region 1 : 0 ∩ 1 =∅ 0 ∪ 1 =X
in a way that correspond to Θ0 and Θ1  respectively:
¾
½
0 ↔ Θ0
=Θ
X=
1 ↔ Θ1
The N-P decision rules, based on a test statistic (X) take the form:
[i] if x0 ∈0  accept 0 

[ii] if x0 ∈1  reject 0 

(14)

and the associated error probabilities are:
type I:

P(x0 ∈1 ; 0 () true)=() for ∈Θ0 

type II: P(x0 ∈0 ; 1 (1 ) true)=(1 ) for 1 ∈Θ1 
True State of Nature
N-P rule
Accept 0
Reject 0

0 true
√
Type I error

0 false
Type √ error
II

£ Some misinformed statistics books go into great lengths to explain
that the above error probabilities should be viewed as conditional on
0 or 1  and they used the misleading notation (|):
P(x0 ∈1 |0 () true)=() for ∈Θ0 
15
Such conditioning makes no sense in a frequentist framework because 
is treated as an unknown constant. Instead, they should be interpreted
as evaluations of the sampling distribution of the test statistic under
different scenarios pertaining to different hypothesized values of 
¥ Some statistics books make a big deal of the distinction ‘accept
0 ’ vs. ‘fail to reject 0 ’, in their commendable attempt to bring out
the problem of misinterpreting ‘accept 0 ’ as tantamount to ‘there is
evidence for 0 ’; this is known as the fallacy of acceptance. By
the same token there is an analogous distinction between ‘reject 0 ’
vs. ‘accept 1 ’, that highlights the problem of misinterpreting ‘reject
0 ’ as tantamount to ‘there is evidence for 1 ’; this is known as the
fallacy of rejection.
What is objectionable about this practice is that the verbal distinctions by themselves do nothing but perpetuate these fallacies; we need
to address them, not play clever with words!
Table 3 - The simple (one parameter) Normal model
Statistical GM:
 =  +   ∈N
[1] Normal:
 v N( )
[2] Constant mean:
( ) =  for all ∈N
[3] Constant variance:  ( )= 2 -known, for all ∈N
[4] Independence:
{  ∈N} - independent process.
Example 1. Consider the following hypotheses in the context of
the simple (one parameter) Normal model:
0 :  ≤ 0  vs. 1 :   0 
Let us choose a test  :={(X) 1 ()} of the form:
√
P

1
test statistic: (X)= ( −0 )  where   =  =1  
rejection region: 1 ()={x : (x)   }

(15)

(16)

To evaluate the two types of error probabilities the distribution of (X)
under both the null and alternatives are needed:
√
0 ( )

[I] (X)= ( −0 ) v 0 N(0 1)
√
1 ( )

[II] (X)= ( −0 ) v 1 N( 1  1)

16

(17)
for all 1  0 
where the non-zero mean  1 takes the form:
√
1
 1 = ( −0 )  0 for all 1  0 
The evaluation of the type I error probability is based on (I):
= max≤0 P((X)  ; 0 ())=P((X)   ; =0 )
Notice that in cases where the null is for the form 0 :  ≤ 0  the
type I error probability is defined as the maximum over all values in
the interval (−∞ 0 ] which turns out to be the one evaluated at the
last point of the interval  = 0  Hence, the same test will be optimal
also in the case where the null is of the form 0 :  = 0 
The evaluation of type II error probabilities is based on (II):
(1 )=P((X) ≤  ; 1 (1 )) for all 1  0 
The power [rejecting the null when false] is equal to 1−(1 ) i.e.
(1 )=P((X)   ; 1 (1 )) for all 1  0 
Why do we care about power? The power (1 ) measures the pre-data
(generic) capacity of test  :={(X) 1 ()} to detect a discrepancy,
say =1 −0 = 1 when present. Hence, when (1 )=35 this test has
very low capacity to detect such a discrepancy.

How is this information helpful in practice? If =1 is the discrepancy
of substantive interest, this test is practically useless for that purposes
because we know beforehand that this test does not enough capacity
to detect  even if present!
17
What can one do in such a case? The power of the above test is
√
1
monotonically increasing with  1 = ( −0 )  and thus, increasing the
sample size , increases the power.
Hypothesis testing reasoning.
¥ As with Fisher’s significance testing, the reasoning underlying NP testing is hypothetical in the sense that both error probabilities are
based on evaluating the relevant test statistic under several hypothetical
scenarios, e.g. under 0 (=0 ) or under 1 (=1 ) for all 1 0  These
hypothetical sampling distributions are then used to compare 0 or 1
via (x0 ) to the True State of Nature (TSN) represented by data x0 
This is in contrast to frequentist estimation which is factual in
nature, i.e. the sampling distributions of estimators is evaluated under
the TSN, i.e.  = ∗ 
Significance level vs. -value
It is important to note that there is a mathematical relationship
between the type I error probability (significance level) and the p-value.
Placing them side by side:
P(type I error): P((X)   ; =0 ) = 
(18)
-value: P((X)  (x0 ); =0 )=(x0 )
it becomes obvious that:
(a) they are both specified in terms of the distribution of the same
test statistic (X) under 0 
(b) they both evaluate the probability of tail events of the form
{x: (x)  } x∈X, but
(c) they differ in terms of the threshold :  vs. (x0 );
in that sense  is a pre-data and (x0 ) is a post-data error probability.
In light of (a)-(c), three comments are called for.
First, the p-value can be viewed as the smallest significance level 
at which 0 would have been rejected with data x0 . For that reason the
p-value is often referred to as the observed significance level. Hence, it
should come as no surprise to learn that the above N-P decision rules
(14) could be recast in terms of the p-value:
[i]* if (x0 )   accept 0 

[ii]* if (x0 ) ≤  reject 0 
18
Indeed, practitioners often prefer to use the rules [i]*-[ii]* because the
p-value conveys additional information when it is not close to the predesignated threshold . For instance, rejecting 0 with (x0 )=0001
seems more informative than just 0 was rejected at  = 05
¥ Second, the p-value seems to have an implicit alternative built
into the choice of the tail area. For instance, a 2-sided alternative 1 :
6=0 would require one to evaluate P(|(X)|  |(x0 )|; =0 ) How
did Fisher get away with such an obvious breach of his own preaching?
Speculating on that, his answer might have been: 2-sided evaluation
of the p-value makes no sense, because, post-data, the sign of the test
statistic (x0 ) determines the direction of departure!
¥ Third, neither the significance level  nor the p-value can be
interpreted as probabilities attached to particular values of  associated
with 0 or 1 , since the probabilities in (18) are firmly attached to
the sample realizations x∈X. Indeed, attaching probabilities to the
unknown constant  makes no sense in frequentist statistics.
As argued by Fisher (1921), p. 25: “We may discuss the probability
of occurrence of quantities which can be observed or deduced from observations, in relation to any hypotheses which may be suggested to explain
these observations. We can know nothing of the probability of hypotheses...”
This scotches another two erroneous interpretations of the p-value:
£ (x0 ) = 005 does not mean that there is a 5% chance of a Type I
error (i.e. false positive).
£ (x0 ) = 005 does not mean that there is a 95% chance that the
results would replicate if the study were repeated.
Example 1 (continued). Consider applying test :={(X) 1 ()}
for: 0 =10 =1 =100  =104  = 05 ⇒  =1645
√

√


(x0 )= ( −0 ) = 100(10175−10) =175   
1
The p-value is: P((X)  175; =0 )=04

Reject 0 

standard Normal tables
One-sided values
Two-sided values
=100 :  =128
=100 :  =1645
=050 :  =1645
=050 :  =196
=025 :  =196
=025 :  =200
=010 :  =233
=010 :  =258
19
The trade-off between type I and II error probabilities
I How does the introduction of type I and II error probabilities addresses
the arbitrariness associated with Fisher’s choice of a test statistic?
The ideal test is one whose error probabilities are zero, but no such
test exist for a given ; an ideal test, like the ideal estimator, exists as
 → ∞! Worse, there is a trade-off between the above error probabilities: as one increases the other decreases and vice versa.
Example. In the case of test  :={(X) 1 ()} in (16), decreasing the probability of type I error from =05 to =01 increases the
threshold from  =1645 to  =233 which makes it easier to accept
0  and this increases the probability of type II error.
To address this trade-off Neyman and Pearson (1933) proposed:
(i) to specify 0 and 1 in such a way so as to render the
type I error the more serious one, and then:
(ii) to choose an  significance level test {(X) 1 ()}
so as to maximize its power for all ∈Θ1 
N-P rationale. Consider the analogy with a criminal offense trial,
where the jury in such a trial are instructed by the judge to find the
defendant ‘not guilty’ unless they have been convinced ‘beyond any
reasonable doubt’ by the evidence:
0 : not guilty, vs. 1 : guilty.
The clause ‘beyond any reasonable doubt’ amounts to fixing the type
I error to a very small value, to reduce the risk of sending innocent
people to prison, or even death. At the same time, one would want the
system to minimize the risk of letting guilty people get off scot-free".
More formally, the choice of an optimal N-P test is based on fixing
an upper bound  for the type I error probability:
P(x∈1 ; 0 () true) ≤  for all ∈Θ0 

and then select a test {(X) 1 ()} that minimizes the type II error
probability, or equivalently, maximizes the power:
()=P(x0 ∈1 ; 1 () true)=1−() for all ∈Θ1 

I The general rule is that in selecting an optimal N-P test the whole
of the parameter space Θ is relevant. This is why partitioning both Θ
and the sample space X:=R using a test statistic (X) provides the

key to N-P testing.
20
Optimal properties of tests
The property that sets the gold standard for optimal test is:
[1] Uniformly Most Powerful (UMP): A test
 :={(X) 1 ()} is said to be UMP if it has higher power than any
e
other -level test   for all values ∈Θ1 , i.e.
e
(;  ) ≥ (;  ) for all ∈Θ1 
Example. Test  :={(X) 1 ()} in (16) is UMP; Lehmann (1986).
Additional properties of N-P tests
[2] Unbiasedness: A test  :={(X) 1 ()} is said to be unbiased
if the probability of rejecting 0 when false is always greater than that
of rejecting 0 when true, i.e.
max P(x0 ∈1 ; 0 ()) ≤   P(x0 ∈1 ; 1 ()) for all ∈Θ1 
∈Θ0

[3] Consistency: A test  :={(X) 1 ()} is said to be consistent
if its power goes to one for all ∈Θ1 as  → ∞, i.e.
lim   () = P(x0 ∈1 ; 1 () true)=1 for all ∈Θ1 
→∞

Like in estimation, this is a minimal property for tests.
Evaluating power. How does one evaluate the power of test
 :={(X) 1 ()} for different discrepancies =1 −0 ?
Step 1. In light of the fact that for the relevant sampling distribution for the evaluation of (1 ) is:
√

1 ( )


[II] (X)= ( −0 ) v 1 N( 1  1) for all 1  0 
the use the N(0 1) tables requires on to split the test statistic into:
√
√
√
(  −0 )

1
= ( −1 ) +  1   1 = ( −0 )

since under 1 (1 ) the distribution of the first component is:
³√
´  ( )
√
1 1
(  −1 )
(  −0 )
=
− 1
v N(0 1) for 1  0 



Step 2. Evaluating of power the test  :={(X) 1 ()} for different discrepancies yields:
=1 −0
=1
=2
=3

√
1
 1 = ( −0 ) :

=1:
=2:
=3:

√


(1 )=P( ( −1 )  − 1 + ; 1 )
(101)=P(  −1 + 1645) = 259
(102)=P(  −2 + 1645) = 639
(103)=P(  −3 + 1645) = 913

21
where  is a generic standard Normal r.v., i.e.  v N(0 1)

The power of this test is typical of an optimal test since (1 )
√
1
increases with the non-zero mean  1 = ( −0 )  and thus:
(a) the power increases with the sample size 
(b) the power increases with the discrepancy =(1 −0 ) and
(c) the power decreases with 
It is very important to emphasize three features of an optimal test.
First, a test is not just a formula associated with particular statistical
tables. It is a combination of a test statistic and a rejection region;
hence the notation :={(X) 1 ()}.
Second, the optimality of an N-P test is inextricably bound up with
the optimality of the estimator the test statistic is based on. Hence,
it is no accident that most optimal N-P tests are based on consistent,
fully efficient and sufficient estimators. Example. In the case of the
simple (one parameter) Normal model, consider replacing   with the
2
unbiased estimator 2 =(1 + )2 vN( 2 ) The resulting test:
b
ˇ
 :={(X) 1 ()} where (X)=

√
2(2 −0 )



and 1 ()={x: (x)   }

will not be optimal because its power is much lower than that of  , and
ˇ
 is inconsistent! This is because the non-zero mean of the sampling
√
1
distribution under =1 will be  = 2(−0 ) which does not change as
 → ∞
Third, it is important to note that by changing the rejection region
one can render an optimal N-P test useless! For instance, replacing the
22
rejection region of :={(X) 1 ()} with:
 1 ()={x: (x)   }
the resulting test T :={(X)  1 ()} is practically useless because it
¸
is biased and its power decreases as the discrepancy  increases.
Example 2 - the Student’s t test. Consider the 1-sided hypotheses:
(1-s): 0 :  = 0  vs. 1 :   0
(19)
or (1-s*): 0 :  ≤ 0  vs. 1 :   0 
in the context of the simple Normal model (table 1).
In this case, a UMP (consistent) test  :={ (X) 1 ()} exists.
It is the well-known Student’s t test, and takes the form:
test statistic:

√
 (X)= ( −0 ) 

2



1
= −1


P

( −  )2 

=1

(20)

rejection region: 1 ()={x :  (x)   }

where  can be evaluated using the Student’s t tables.
To evaluate the two types of error probabilities the distribution of
 (X) under both the null and alternatives are needed:
√
(  −0 ) 0 (0 )
v

√
1 ( )
 (X)= ( −0 ) v 1

[I]*  (X)=
[II]*

where  1 =

√
(1 −0 )


St(−1)

(21)

St( 1 ; −1) for all 1  0 

is the non-centrality parameter.

Student’s t tables (=60)
One-sided values
Two-sided values
=100 :  =1296
=100 :  =1671
=050 :  =1671
=050 :  =2000
=010 :  =2390
=010 :  =2660
=001 :  =3232
=001 :  =3460
In summary, the main components of a Neyman-Pearson (N-P) test
{ (X) 1 ()} are given below.

23
Components of a Neyman-Pearson (N-P) test
(i)
a prespecified statistical model: M (x)
(ii) a null (0 ) and the alternative (1 ) within M (x),
(iii) a test statistic (distance function)  (X)
(iv) the distribution of  (X) under 0 
(v) prespecifying the significance level  [.01, .025, .05],
(vi) the rejection region 1 ()
(vii) the distribution of  (X) under 1 
6.2

The N-P Lemma and its extensions

The cornerstone of the Neyman-Pearson (N-P) approach is
the Neyman-Pearson lemma. Contemplate the simple generic statistical model:
M (x)={ (x; )} ∈Θ:={0  1 }} x∈R 
(22)

and consider the problem of testing the simple hypotheses:
0 :  = 0 vs. 1 :  = 1 

(23)
¥ The fact that the assumed parameter space is Θ:={0  1 } and (23)
constitute a partition, is often left out from most statistics textbook
discussions of this famous lemma!
Existence. There is exists an -significance level Uniformly Most
Powerful (UMP) [-UMP] test, whose generic form is:
(X)=(  (x;1 ) )
 (x;0 )

1 ()={x: (x)   }

(24)

where () is a monotone function.
Sufficiency. If an -level test of the form (24) exists, then it is UMP for
testing (23).
Necessity. If {(X) 1 ()} is -UMP test, then it will be given by (24).
At first sight the N-P lemma seems rather contrived because it is an
existence result for a simple statistical model M (x) whose parameter space is artificial Θ:={0  1 }, but fits perfectly into the archetypal
formulation. In addition, to implement this result one would need to
find () yielding a test statistic (X) whose distribution is known under
24
both 0 and 1 . Worse, this lemma is often misconstrued as suggesting that for an -UMP test to exist one needs to confine testing to
simple-vs-simple cases even when Θ is uncountable!
¥ The truth of the matter is that the construction of an -UMP
test in more realistic cases has nothing to do with simple-vs-simple
hypotheses, but instead it is invariably based on the archetypal N-P
testing formulation in (13), and relies primarily on monotone likelihood
ratios and other features of the prespecified statistical model M (x).
Example. To illustrate these comments consider the simple-vssimple hypotheses:
(i) (1-1): 0 : =0 vs. 1 : =1 
(25)
in the context of a simple Normal (one parameter) model (table 3).
Applying the N-P lemma requires setting up the ratio:
©
ª
 (x;1 )

= exp 2 (1 − 0 )  − 22 (2 − 2 ) 
(26)
1
0
 (x;0 )

which is clearly not a test statistic, as it stands. However, there exists
a monotone function () which transforms (26) into a familiar test
statistic (Spanos, 1999, pp. 708-9):
h
i √
(x;1 )
 (x;1 )
1

1
(X)=( (x;0 ) )= ( 1 ) ln(  (x;0 ) )+ 2 = ( −0 ) 
with ascertainable type I and II error probabilities based on:
=

=

(X) v 0 N(0 1) (X) v 1 N( 1  1)  1 =

√
(1 −0 )



In this particular case, it is clear that the Neyman-Pearson lemma
does not apply because the hypotheses in (25) do not constitute a partition of the parameter space Θ=R, √ in (13). However, it turns out
as

that when the test statistic (X)= ( −0 ) is combined with information relating to the values 0 and 1 , it can provide the basis for
constructing several optimal tests (Lehmann, 1986):


[1] For 1  0  the test  :={(X) 1 ()} where

1 ()={x: (x)   } is a -UMP for the hypotheses:
(ii) (1-s≥): 0 :  ≤ 0 vs.

1 :   0 

(iii) (1-s): 0 :  = 0 vs.

1 :   0 

25
Despite the difference between (ii) and (iii), the relevant error probabilities coincide. The type I error probability for (ii) is defined as the
maximum over all  ≤ 0 , but coincides with that of (iii) (=0 ):
= max≤0 P((X)  ; 0 ())=P((X)   ; =0 )

Similarly, the p-value is the same because:
 (x0 )= max P((X)(x0 ); 0 ())=P((X)(x0 ); =0 )
≤0



[2] For 1  0  the test  :={(X) 1 ()} where

1 ()={x: (x)   } is a -UMP for the hypotheses:

(iv) (1-s≤): 0 :  ≥ 0 vs.

1 :   0 

(v) (1-s): 0 :  = 0 vs. 1 :   0 
Again, the type I error probability and p-value are defined by:
= max P((X)   ; 0 ())=P((X)  ; =0 )
≥0

 (x0 )= max P((X)(x0 ); 0 ())=P((X)(x0 ); =0 )
≥0

The existence of these -UMP tests extends the N-P lemma to more
realistic cases by invoking two regularity conditions:
¥ [A] The ratio (26) is a monotone function of the statistic    in
the sense that for any two values 1 0  (x;1 ) changes monotonically
(x;0 )
 (x;1 )
with    This implies that  (x;0 )  if and only if    for some 
This regularity condition is valid for most statistical models of interest in practice, including the one parameter Exponential family of
distributions [Normal, Gamma, Beta, Binomial, Negative Binomial,
Poisson, etc.], the Uniform, the Exponential, the Logistic, the Hypergeometric etc.; Lehmann (1986).
¥ [B] The parameter space under 1  say Θ1  is convex, i.e. for any
two values (1  2 ) ∈Θ1  their convex combinations 1 + (1 − )2 ∈Θ1 
for any 0 ≤  ≤ 1
When convexity does not hold, like the 2-sided alternative:
(vi) (2-s): 0 :  = 0 vs.

1 :  6= 0 

[3] the test  :={(X) 1 ()} 1 ()={x: |(x)|    }
2
is -UMPU (Unbiased); the -level and p-value are:
=P(|(X)|    ; =0 ) q (x0 )=P(|(X)| |(x0 )|; =0 )
2
26
6.3

Constructing optimal tests: Likelihood Ratio

The likelihood ratio test procedure can be viewed as a generalization
of the Neyman-Pearson lemma to more realistic cases where the null
and/or the alternative are composite hypotheses.
(a) The hypothesis of interest are of the general form:
0 : θ∈Θ0 vs. 1 : θ∈Θ1 
(b) The test statistic is related to the ‘likelihood’ ratio:
max∈Θ
 (X)= max∈Θ (;X) =
(;X)
0


(;X)

(;X)

(27)

Note that the max in the numerator is over all θ∈Θ [yielding the MLE
b
θ], but that of the denominator is confined to all values under 0 :
e
θ∈Θ0 [yielding the constrained MLE θ].
(c) The rejection region is defined in terms of a transformed ratio,
where () is chosen to yield a known sampling distribution under 0 :
1 ()= {x:  (X)= ( (X))   } 

(28)

In terms of procedures to construct optimal tests, the LR procedure
can be shown to yield several optimal tests; Lehmann (1986). Moreover,
in cases where no such () can be found, one can use the asymptotic
distribution which states that under certain restrictions (Wilks, 1938):
³
´
b X− ln (θ; X) 0 2 ()
e
v
2 ln  (X) =2 ln (θ;




0
where ∼ reads “under 0 is asymptotically distributed as” and  de
notes the number of restrictions involved in Θ0 
Example. Consider testing the hypotheses:

0 :  2 =2 vs. 1 :  2 6=2 
0
0
in the context of the simple Normal model (table 1). From the previous
chapter we know that the Maximum Likelihood method gives rise to:
P
P
b
ˆ 1
b 1

max(θ; x)=(θ; x) ⇒  =  =1  and  2 =  =1 ( −ˆ  )2 
∈Θ

P
e
max(θ; x)=(θ; x) ⇒  =  =1  and  2 =2 
ˆ 1
0
∈Θ0

27
Moreover, the two estimated likelihoods are:

¡
2 ¢− 2
max(θ; x)= 2b


∈Θ
n
o

2

2 −2
max(θ; x)= (2 0 ) exp − 22 
∈Θ0

0

Hence, the likelihood ratio is:
−
2

2

(;X) (20 )
 (X)= (;X) =


exp{−(2 22 )}

0
−
2


(22 )

h 2
i
2


2

= ( 2 ) exp{−( 2 )+1} 
0

0

The function () that transforms this ratio into a test statistic is:
2



0

(X)=( (X))=( 2 ) v 2 (−1)

0

and a rejection region of the form:
1 ={x: (X)  2 or (X)  1 }

where for a size  test, the constants 1  2 are chosen in such as way
R∞
R 1
so as:
() = 2 () =  
0
2
() denotes the chi-square density function. The test defined by
{(X) 1 ()} turns out to be UMP Unbiased (UMPU) (see Lehmann,
1986).
6.4

Substantive vs. statistical significance in N-P testing

We consider two substantive values of interest that concern the ratio
of Boys (B) to Girls (G) in newborns:
Arbuthnot: #B=#G, Bernoulli: 18B to 17G.

(29)

Statistical hypotheses. The Arbuthnot and Bernoulli conjectured
values for the proportion can be embedded into a simple Bernoulli
model (table 2), where =P(=1)=P(Boy):
Arbuthnot:  = 1 
2

Bernoulli:  =

18

35

(30)

How should one proceed to appraise  and  vis-a-vis the above data?
The archetypal specification (13) suggests probing each value separately using the difference ( −  ) as the discrepancy of interest.
28
Despite the fact that on substantive grounds the only relevant values of
 are (  ), on statistical inference grounds, the rest of the parameter
space is relevant; M (x) provides the relevant inductive premises.
This argument, however, has been questioned in the literature on the
grounds that N-P testing should probe these two values using simple-vssimple hypotheses? After all, the argument goes, the Neyman-Pearson
lemma would secure a -UMP test!
This argument shows insufficient understanding of the NeymanPearson lemma, because the parameter space of the simple Bernoulli
model is not Θ={   } but Θ=[0 1] Moreover, the fact that the
simple Bernoulli model has a monotone likelihood ratio, i.e.:
³
´ ³
´ 
=1
 (x;1 )
1 (1−0 )
= (1−1 )
 for (0  1 ) s.t. 00 1 1
 (x;0 )
(1−0 )
0 (1−1 )
P
is a monotonically increasing function of the statistic   = =1 
since 1 (1−0 )  1 ensuring the existence of several -UMP tests.
0 (1−1 )


In particular, [i]  :={(X) C1 ()}:
√
(  −0 ) 0

(X)= √

0 (1−0 )


v Bin (0 1; )  C1 ()={x: (x)   }

(31)

is a -UMP test for the N-P hypotheses:
(ii) (1-s≥): 0 :  ≤ 0 vs. 1 :   0 
(iii) (1-s): 0 :  = 0 vs. 1 :   0 


and [ii]  :={(X) C1 ()}:
√
(  −0 ) 0

(X)= √

0 (1−0 )


v Bin (0 1; )  C1 ()={x: (x)   }

(32)

(iv) (1-s≤): 0 :  ≥ 0 vs. 1 :   0 
(v) (1-s): 0 :  = 0 vs. 1 :   0 
is a -UMP test for the N-P hypotheses. These results stem from the
fact that the formulations (ii)-(v) ensure the convexity of the parameter
space under 1 
Moreover, in the case of a two-sided hypotheses:
(i) (2-s): 0 :  = 0 vs.
29

1 :  6= 0 
the test [iii]  :={(X) C1 ()}:
√



0
(
(X)= √  −0 ) v Bin (0 1; )  C1 ()={x: |(x)|   }

0 (1−0 )

(33)

is a -UMP, Unbiased; see Lehmann (1986).
¥ Randomization? No! In cases where the test statistic has a
discrete sampling distribution under 0  as in (32)-(33), one might
not be able to define  exactly.
The traditional way of dealing with this problem is to randomize,
but this ‘solution’ raises more problems than it solves. Instead, the
best way to address the discreteness issue is to select a value  that
is attained for the given  or approximate the discrete distribution
with a continuous one to circumvent the problem. In practice, there
are several continuous distributions one can use, depending on whether
the original distribution is symmetric or not. In the above case, even for
moderately small size sizes, say =20 the Normal distribution provides
an excellent approximation for values of  around .5; see graph. With
=30762, the approximation is nearly exact.
D is tr ib u tio n P lo t
0 .2 0

D is tr ib u tio n
B in o m ia l

n
20

D is tr ib u tio n
N o rm al

p
0.5
M ean
10

S tD e v
2.236

Density

0 .1 5

0 .1 0

0 .0 5

0 .0 0

2

4

6

8

10
X

12

14

16

18

Normal approx. of Binomial:  (; =5 =20)
What about the choice between one-sided vs. two-sided?
The answer depends crucially on whether there exists reliable substantive information that renders part of the parameter space Θ irrelevant for both substantive and statistical purposes.
6.5

N-P testing of Arbuthnot’s value

In this case there is reliable substantive information that the probability
of the newborn being a male, =P(=1), is systematically slightly
above 1 ; the human sex ratio at birth is approximately  =512 (Hardy,
2
30
2002). Hence, a statistically more informative way to assess =5 might
be to use one-sided (1-s) directional probing.
Data: =30762 newborns during the period 1993-5 in Cyprus, out
of which 16029 were boys and 14833 girls.
In view of the huge sample size it is advisable to choose a smaller
significance level, say =01 ⇒  =2326
Case 1. For testing Arbuthnot’s value 0 = 1 the relevant hypotheses
2
are:
(1-s): 0 :  = 1 vs. 1 :   1 
2
2
(34)
1
or (1-s*≤): 0 :  ≤ 2 vs. 1 :   1 
2


and  := {(X) C1 ()} is a -UMP test. In view of the fact that:

(x0 )=

√
30762( 16029 − 1 )
30762 2

√

5(5)

=7389

(35)

(x0 )=P((X)7389; 0 ) = 00000
the null hypothesis 0 : = 1 is strongly rejected.
2
What if one were to ignore the substantive information and apply a
2-sided test?
Case 2. For testing Arbuthnot’s value this takes the form:
(2-s): 0 :  =

1
2

vs.

1 :  6= 1 
2

(36)

 :={(X) C1 ()} in (33) is a -UMP test; for =05   =2576
2
In view of the fact that:
(x0 )=

√
30762( 16029 − 1 )
30762 2

√

5(5)

=7389  2576

q (x0 )=P(|(X)|  7389; 0 )=00000
the null hypothesis 0 : = 1 is strongly rejected.
2
6.6

N-P testing of Bernoulli’s value

The first question we need to answer is the appropriate form of the
alternative in testing Bernoulli’s conjecture = 18 .
35
Using the same substantive information that indicates  is systematically slightly above 1 (Hardy, 2002), it seems reasonable to use a
2
1-sided alternative in the direction of 1 .
2
Case 1. For probing Bernoulli’s value the relevant hypotheses are:
(1-s): 0 :  =

18
35

31

vs. 1 : 1 

18

35

(37)


The optimal test (UMP) for (37) is  := {(X) C1 ()}, where for
=01 ⇒  = −2326. Using the same data:
√

16029

18

30762(
− )
(x0 )= √ 18 30762 35 =2379  (x0 )=P((X)  2379; 0 )=991
18
35 (1− 35 )

the null hypothesis 0 : = 18 is accepted with a high p-value.
35
What if one were to ignore the substantive information as pertaining
only to Arbuthnot’s value, and apply a 2-sided test?
Case 2. For testing Bernoulli’s value using the 2-s probing:
(2-s): 0 :  = 18 vs. 1 :  6= 18 
(38)
35
35
based on the UMPU test  :={(X) C1 ()} in (33):
√
30762( 16029 − 18 )
30762 35

(x0 )= √ 18

18
35 (1− 35 )

=2379  2576

(39)

q (x0 )=P(|(X)|  2379; 0 )=0174
yields accepting 0 , but the p-value is considerably smaller than
 (x0 )=991 for the 1-sided test.
The question that arises is which one is the relevant p-value?
Indeed, a moments reflection suggests that these two are not the
only choices. There is also the p-value for:
(1-s): 0 :  = 18 vs. 1 :   18 
(40)
35
35
which yields a much smaller p-value:
(x0 )=P((X)  2379; 0 )=009

(41)

reversing the previous results and leading to rejecting 0 .
¥ This raises the question:
On what basis should one choose the relevant p-value?
The short answer ‘on substantive grounds’ seems questionable because,
first, it ignores the fact that the whole of the parameter space is relevant on statistical grounds, and second, one applies a test to assess
the validity of substantive information, not foist it on the data at
the outset. Even more questionable is the argument that one should
choose a simple alternative on substantive grounds in order to apply the
Neyman-Pearson lemma that guarantees a -UMP test; this is based
on a misapprehension of the lemma.
The longer answer, that takes account the fact that post-data (x0 )
is known, and thus there is often a clear direction of departure indicated
by data x0 associated with the sign of (x0 ) will be elaborated upon
in section 8.
32
7

Foundational issues raised by N-P testing

Let us now appraise how successful the N-P framework was in addressing the perceived weaknesses of Fisher’s testing:
[a] his ad hoc choice of a test statistics (X)
[b] his use of a post-data [(x0 ) is known] threshold for the -value
that indicates discordance (reject) with 0  and
[c] his denial that (x0 ) can indicate accordance with 0 
Very briefly, the N-P framework partly succeeded in addressing
[a] and [c], but did not provide a coherent evidential account that
answers the basic question (Mayo, 1996):
when do data x0 provide evidence for or against
(42)
a hypothesis or a claim ?
This is primarily because one could not interpret the decision results
‘Accept 0 ’ (‘Reject 0 ’) as data x0 provide evidence for (against) 0
[and thus provide evidence for 1 ]. Why?
(i) A particular test  :={(X) C1 ()} could have led to ‘Accept
0 ’ because the power of that test to detect an existing discrepancy
 was very low; the test had no generic capacity to detect an existing
discrepancy. This can easily happen when the sample size  is small.
(ii) A particular test  :={(X) C1 ()} could have led to ‘Reject
0 ’ simply because the power of that test was high enough to detect
‘trivial’ discrepancies from 0 , i.e. the generic capacity of the test was
extremely high. This can easily happen when the sample size  is very
large.
In light of the fact that the N-P framework did not provide a bridge
between the accept/reject rules and what scientific inquiry is after,
i.e. when data x0 provide evidence for or against a hypothesis ,
it should come as no surprise to learn that most practitioners sought
such answers by trying unsuccessfully (and misleadingly) to distill an
evidential interpretation out of Fisher’s p-value. In their eyes the predesignated  is equally vulnerable to manipulation [pre-designation to
keep practitioners ‘honest’ is unenforceable in practice], but the p-value
is more informative and data-specific.
Is Fisher’s p-value better in answering the question (42)? No. A
small (large) — relative to a certain threhold  — p-value could not
be interpreted as evidence for the presence (absence) of a substantive
33
discrepancy  for the same reasons as (i)-(ii). The generic capacity
of the test (its power) affects the p-value. For instance, a very small
p-value can easily arise in the case of a very large sample size .
Fallacies of Acceptance and Rejection
The issues with the accept/reject 0 and p-value results raised above
can be formalized into two classic fallacies.
(a) The fallacy of acceptance: no evidence against 0 is misinterpreted
as evidence for 0 .
This fallacy can easily arise in cases where the test in question has
low power to detect discrepancies of interest.
(b) The fallacy of rejection: evidence against 0 is misinterpreted as
evidence for a particular 1 .
This fallacy arises in cases where the test in question has high power
to detect substantively minor discrepancies. Since the power of a test
increases with the sample size  this renders N-P rejections, as well as
tiny p-values, with large  highly susceptible to this fallacy.
In the statistics literature, as well as in the secondary literatures in
several applied fields, there have been numerous attempts to circumvent
these two fallacies, but none succeeded.

8

Post-data severity evaluation

A moment’s reflection about the inability of the accept/reject 0 and
p-value results to provide a satisfactory answer to the above question
in (42) suggests that it is primarily due to the fact that there is a
problem when such results are detached from the test itself, and are
treated as providing the same evidence for a particular hypothesis 
(0 or 1 ), regardless of the generic capacity (the power) of the test
in question. That is, whether data x0 provide evidence for or against a
particular hypothesis  (0 or 1 ) depends crucially on the generic
capacity of the test (power) in question to detect discrepancies from
0 . This stems from the intuition that a small p-value or a rejection
of 0 based on a test with low power (e.g. a small ) for detecting a
particular discrepancy  provides stronger evidence for the presence of a particular discrepancy  than using a test with much higher
power (e.g. a large ). Mayo (1996) proposed a frequentist evidential
account based on harnessing this intuition in the form of a post-data
34
severity evaluation of the accept/reject results. This is based on
custom-tailoring the generic capacity of the test to establish the discrepancy  warranted by data x0 . This evidential account can be used
to circumvent the above fallacies, as well as other (misplaced) charges
against frequentist testing.
The severity evaluation is a post-data appraisal of the accept/reject
and p-value results that revolves around the discrepancy  from 0
warranted by data x0 
¥ A hypothesis  passes a severe test  with data x0 if:
(S-1) x0 accords with , and
(S-2) with very high probability, test  would have produced a result
that accords less well with  than x0 does, if  were false.
Severity can be viewed as an feature of a test  as it relates to
a particular data x0 and a specific claim  being considered. Hence,
the severity function has three arguments,  ( x0  ) denoting the
severity with which  passes  with x0 ; see Mayo and Spanos (2006).
Example. To explain how the above severity evaluation can be
applied in practice, let us return to the problem of assessing the two
substantive values of interest:
Arbuthnot: = 1  Bernoulli:  = 18 
2
35
When these values are viewed in the context of the error statistical perspective, it becomes clear that the way to frame the probing is to choose
one of the values as the null hypothesis, and let the difference between
them ( − ) represent the discrepancy of substantive interest.
Data: =30762 newborns during the period 1993-5 in Cyprus,
16029 are boys and 14833 girls.
Let us return to probing the Arbuthnot value  based on:
(1-s): 0 : 0 = 1 vs. 1 : 0  1
2
2
that gave rise to the observed test statistic:
(x0 )=

√
30762( 16029 − 1 )
30762 2

√

5(5)

=7.389,

(43)

leading to a rejection of 0 at =01 ⇒ =2326 Indeed, 0 would
have also been rejected by a (2-s) test since =2576.
An important feature of the severity evaluation is that it is postdata, and thus the sign of the observed test statistic (x0 ) provides
35
directional information that indicates the directional inferential claims
that ‘passed’. In relation to the above example, the severity ‘accordance’ condition (S-1) implies that the rejection of 0 = 1 with (x0 )=7389 
2
0, indicates that the form of the inferential claim that ‘passed’ is of the
generic form:
  1 = 0 + for some  ≥ 0
(44)
The directional feature of the severity evaluation is very important
in addressing several criticisms of N-P testing, including the potential
arbitrariness and possible abuse of:
[a] switching between one-sided, two-sided or simple-vs-simple hypotheses,
[b] interchanging the null and alternative hypotheses, and
[c] manipulating the level of significance in an attempt to get the
desired testing result.
To establish the particular discrepancy  warranted by data x0 , the
severity post-data ‘discordance’ condition (S-2) calls for evaluating the
probability of the event: "outcomes x that accord less well with   1
than x0 does", i.e. [x: (x) ≤ (x0 )] yielding:

 ( ;   1 ) = P(x : (x) ≤ (x0 );   1 is false)

(45)

The evaluation of this probability is based on the sampling distribution:
√

ˆ

=

(
(X)= √  −0 ) v1 Bin ((1 )  (1 ); )  for 1  0
0 (1−0 )

(46)

√

(1 )= √(1 −0 ) ≥ 0  (1 )= 1 (1−1 )  0   (1 ) ≤ 1
0 (1−0 )
0 (1−0 )

Since the hypothesis that ‘passed’ is of the form   1 =0 +, the

primary objective of SEV( ;   1 ) is to determine the largest discrepancy  ≥ 0 warranted by data x0 .
Table 4 evaluates SEV(;   1 ) for different values of  with the
evaluations based on the standardized (X):
[(X)−(1 )] =1

√

 (1 )

v Bin (0 1; ) ' N(0 1)

For example, when 1 = 5142 :
√ ˆ
( −0 )

√

0 (1−0 )

− (1 )=7389 −
36

√
30762(5142−5)

√

5(5)

=2408

(47)
and Φ( ≤ 2408)=992 where Φ() is the cumulative distribution function (cdf) of the standard Normal distribution. Note that the scaling
p
 (1 )=9996 ' 1 can be ignored.
Table 4: Reject 0 :  = 5 vs. 1 :   5
Relevant claim
Severity

1 =[5+]
P(x: (X)≤(x0 ); 1 )
010
  0510
999
013
  0513
997
0142
  05142
992
0143
  05143
991
015
  0515
983
016
  0516
962
017
  0517
923
018
  0518
859
020
  0520
645
021
  0521
500
025
  0525
084
An evidential interpretation. Taking a very high probability,
say 95 as a threshold, the largest discrepancy from the null warranted
by this data is:

 ≤ 01637 since  ( ;   51637) = 950
Is this discrepancy substantively significant? In general, to answer this
question one needs to appeal to substantive subject matter information to assess the warranted discrepancy on substantive grounds. In
human biology it is commonly accepted that the sex ratio at birth is
approximately 105 boys to 100 girls; see Hardy (2002). Translating this
ratio in terms of  yields  = 105 =5122 which suggests that the above
205
warranted discrepancy  ≥ 01637 is substantively significant since this
outputs  ≥ 5173 which exceeds  . In light of that, the substantive
discrepancy of interest between  and  :
 ∗ =(1835) − 5 = 0142857

yields SEV( ;   5143)=991 indicating that there is excellent evidence for the claim   1 =0 + ∗ .
I In terms of the ultimate objective of statistical inference, learning
from data, this evidential interpretation seems highly effective because
it narrows down an infinite set Θ to a very small subset!
37
8.1

Revisiting issues pertaining to N-P testing

The above evidential interpretation based on the severity assessment
can also be used to shed light on a number of issues raised in the
previous sections.
(a) The arbitrariness of N-P hypotheses specification
The question that naturally arises at this stage is whether having
good evidence for inferring   18 depends on the particular way one
35
has chosen to specify the hypotheses for the N-P test. Intuitively,
one would expect that the evidence should not be dependent on the
specification of the null and alternative hypotheses as such, but on x0
and the statistical model M (x) x∈R .

To explore that issue let us focus on probing the Bernoulli value:
0 : 0 = 18  where the test statistic yields:
35
√

16029

18

30762(
− )
(x0 )= √ 18 30762 35 =2379
18
35 (1− 35 )

(48)

This value leads to rejecting 0 at =01 when one uses a (1-s) test
[ =2326], but accepting 0 when one uses a (2-s) test [ =2576].
Which result should one believe?
Irrespective of these conflicting results, the severity ‘accordance’
condition (S-1) implies that, in light of the observed test statistic
(x0 )=23790, the directional claim that ‘passed’ on the basis of (48)
is of the form:
  1 =0 +
To evaluate the particular discrepancy  warranted by data x0 , condition (S-2) calls for the evaluation of the probability of the event:
‘outcomes x that accord less well with   1 than x0 does’, i.e. [x:
(x) ≤ (x0 )] giving rise to (45), but with a different 0 .

Table 5 provides several such evaluations of SEV( ;   1 ) for
different values of ; note that these evaluations are also based on (46).
A number of negative values of  are included in order to address the
original substantive hypotheses of interest, as well as bring out the fact
that the results of table 4 and 5 are identical when viewed as inferences
pertaining to ; they simply have a different null values 0  and thus
different discrepancies, but identical 1 values.
38
The severity evaluations in tables 4 and 5, not only render the choice
between (2-s), (1-s) and simple-vs-simple irrelevant, they also scotch
the well-rehearsed argument pertaining to the asymmetry between the
null and alternative hypotheses. The N-P convention of selecting a
small  is generally viewed as reflecting a strong bias against rejection
of the null. On the other hand, Bayesian critics of frequentist testing
argue the exact opposite: N-P testing is biased against accepting the
null; see Lindley (1993). The evaluations in tables 4 and 5 demonstrate
that the evidential interpretation of frequentist testing based on severe
testing addresses all asymmetries, notional or real.
Table 5: Accept/Reject 0 : = 18 vs. 1 : ≷ 18
35
35
Relevant claim
Severity

1 =[5143+] P(x: (x)≤(x0 ); 1 )
−0043
  0510
999
−0013
  0513
997
−0001
  05142
992
0000
  05143
991
0007
  0515
983
0017
  0516
962
0027
  0517
923
0037
  0518
859
0057
  0520
645
0067
  0521
500
0107
  0525
084
(b) Addressing the Fallacy of Rejection
The potential arbitrariness of the N-P specification of the null and
alternative hypotheses and the associated p-values is brought out in
probing the Bernoulli value  = 18 using different alternatives:
35
(1-s):  (x0 )=P((X)  2379; 0 )=991
(2-s):

q (x0 )=P(|(X)|  2379; 0 )=017

(1-s):  (x0 )=P((X)  2379; 0 )=009
Can the above severity evaluation explain away these highly conflicting and
confusing results?
39
The first choice 1 :   18 was driven solely by substantive infor35
mation relating to = 1 , which is often a bad idea because it ignores
2
the statistical dimension of inference. The second choice was based on
lack of information about the direction of departure, which makes sense
pre-data, but not post-data. The third choice reflects the post-data direction of departure indicated by (x0 )=2379  0 In light of that, the
severity evaluation renders  (x0 )=009 the relevant p-value, after all!
¥ The key problem with the p-value. In addition, the severity
evaluation brings out the vulnerability of both the N-P reject rule and
the p-value to the fallacy of rejection in cases where  is large. Viewing
the relevant p-value ((x0 )=009) from the severity vantage point, it
is directly related to 1 passing a severe test because: the probability
that test  would have produced a result that accords less well with
1 than x0 does (x: (x)  (x0 )), if 1 were false (0 true):

Sev( ; x0 ; 0 ) =P((X)  (x0 );  ≤ 0 ) =

=1−P((X)(x0 ); =0 )=991

is very high; Mayo (1996). The key problem with the p-value, however,
is that is establishes the existence of some discrepancy  ≥ 0 but
provides no information concerning the magnitude licensed by x0  The

severity evaluation remedies that by relating Sev( ; x0 ;   0 )=991
to the inferential claim   18  with the implicit discrepancy associated
35
with the p-value being =0.
(c) Addressing the Fallacy of Acceptance
Example. Let us return to the hypotheses:
(1-s): 0 : 0 = 1 vs. 1 : 0  1
2
2
in the context of the simple Bernoulli model (table 2), and consider

applying test  with =025 ⇒  =196 to the case where data x0
are based on =2000 and yield =0503:
√
√
(x0 )= 2000(503−5) =268 P((x)  268; 0 ) = 394
5(5)

(x0 ) = 268 leads to accepting 0 , and the p-value indicates no rejection of 0 
Does this mean that data x0 provide evidence for 0 ? or more accurately, does x0 provide evidence for no substantive discrepancy from
0 ? Not necessarily!
40
To answer that question one needs to apply the post-data severe
testing reasoning to evaluate the discrepancy  ≥ 0 for 1 =0 + war

ranted by data x0 . Test  :={(X) 1 ()} passes 0 , and the idea is
to establish the smallest discrepancy  ≥ 0 from 0 warranted by data
x0 by evaluating (post-data) the claim  ≤ 1 for 1 =0 +
Condition (S-1) of severity is satisfied since x0 accords with 0 because (x0 )  , but (S-2) requires one to evaluate the probability

of "all outcomes x for which test  accords less well with 0 than
x0 does", i.e. (x: (x)  (x0 )) under the hypothetical scenario that
" ≤ 1 is false" or equivalently "  1 is true":

 ( ; x0 ;  ≤ 1 )=P(x: (x)  (x0 );   1 ) for  ≥ 0

(49)

The evaluation of (49) relies on the sampling distribution (46), and
for different discrepancies ( ≥ 0) yields the results in table 6. Note

that Sev( ; x0 ;  ≤ 1 ) is evaluated at =1 because the probability
increases with .
The numerical evaluations are based on the standardized (X) from
(47). For example, , when 1 = 515 :
√ ˆ

(
√  −0 ) − (1 )=268 −
0 (1−0 )

√
2000(515−5)

√

5(5)

= −1074

Φ(  −1074)=859 where Φ() is the cumulative distribution function
q
p
(cdf) of N(0 1). Note that the scaling factor  (1 )= 515(1−515) =
5(5)
9996 can be ignored because its value is so close to 1.

Table 6 — Severity Evaluation of ‘Accept 0 ’ with ( ; x0 )

 ≤ 1 = 0 + 

Relevant claim:

Sev( ≤ 1 )

.001 .005

.01

.015 .017 .0173 .018

.429 .571 .734 .859 .895

.900

.02

.025

.910 .936 .975 .992

Assuming a severity threshold of, say 90 is high enough, the above
results indicate that the ‘smallest’ discrepancy warranted by x0 , is  ≥
0173 since:
 ( ; x0 ;  ≤ 5173)=9
41

.03
That is, data x0  in conjunction with test   provide evidence (with
severity at least 9) for the presence of a discrepancy as large as  ≥
0173. Is this discrepancy substantively insignificant? No!
As mentioned above, the accepted ratio at birth in human biology
is  =512 which suggests that the warranted discrepancy  ≥ 0173
is, indeed, substantively significant since this outputs  ≥ 5173 which
exceeds  .
This is an example where the post-data severity evaluation can be
used to address the fallacy of acceptance. The severity assessment indicates that the statistically insignificant result (x0 )=268 at =025,
actually provides evidence against 0 :  ≤ 5 and for  ≥ 5173 with
severity at least 90
(d) The arbitrariness of choosing the significance level
It is often argued by critics of frequentist testing that the choice
of the significance level in the context of the N-P approach, or the
threshold for the p-value in the case of Fisherian significance testing,
is totally arbitrary and manipulable.
To see how the severity evaluations can address this problem, let us
try to ‘manipulate’ the original significance level (=01) to a different
one that would alter the accept/reject results for 0 : = 18  Looking
35
back at results for Bernoulli’s conjectured value using the data from
Cyprus, it is easy to see that choosing a larger significance level, say
=02 ⇒  =2053 would lead to rejecting 0 for both N-P formulations:
(2-s) (1 : 6= 18 ) and (1-s) (1 :  18 )
35
35
since:

(x0 )=2379  2053
(2-s): (x0 )=P(|(X)|  2379; 0 )=0174
(1-s):  (x0 )=P((X)  2379; 0 )=009

How does this change the original severity evaluations? The short
answer is: it doesn’t! The severity evaluations remain invariant to
any changes to  because they depend on on (x0 )=2379  0 and
not on   which indicates that the generic form of the hypothesis that
‘passed’ is   1 =0 + and the same evaluations apply.

42
9

Summary and conclusions

In frequentist inference learning from data x0 about the stochastic
phenomenon of interest is accomplished by applying optimal inference
procedures with ascertainable error probabilities in the context of a statistical model:
M (x)={ (x; θ) θ∈Θ} x∈R 
(50)

Hypothesis testing gives rise to learning from data by partitioning
M (x) into two subsets:

M0 (x)={ (x; θ) θ∈Θ0 } or M1 (x)={ (x; θ) θ∈Θ1 } x∈R ?


(51)

framed in terms of the parameter(s):
0 : ∈Θ0  vs. 0 : ∈Θ1 
(52)
and inquiring x0 for evidence in favor one of the two subsets (51) using
hypothetical reasoning.
Note that one could specify (52) more perceptively as:
∈Θ

∈Θ

0
1
z
}|
{
z
}|
{
∗
∗
0 :  (x)∈M0 (x)={ (x; θ) θ∈Θ0 } vs. 1 :  (x)∈M1 (x)={ (x; θ) θ∈Θ1 }

where  ∗ (x)= (x; θ∗ ) denotes the ‘true’ distribution of the sample. This
notation elucidates the basic question posed in hypothesis testing:
I Given that the true M∗ (x) lies within the boundaries of M (x)
can x0 be used to narrow it down to a smaller subset M0 (x)?
A test  :={(X) 1 ()} is defined in terms of a test statistic
((X)) and a rejection region (1 ()), and its optimality is calibrated
in terms of the relevant error probabilities:
type I:

P(x0 ∈1 ; 0 () true) ≤ () for ∈Θ0 

type II: P(x0 ∈0 ; 1 (1 ) true)=(1 ) for 1 ∈Θ1 
These error probabilities specify how often these procedures lead to
erroneous inferences, and thus determine the pre-data capacity (power)
of the test in question to give rise to valid inferences:
(1 )=P(x0 ∈1 ; 1 (1 ) true)=(1 ) for all 1 ∈Θ1 
An inference is reached by an inductive procedure which, with high
probability, will reach true conclusions from valid (or approximately
43
valid) premises (statistical model). Hence, the trustworthiness of frequentist inference depends on two interrelated pre-conditions:
(a) adopting optimal inference procedures, in the context of:
(b) a statistically adequate model.
I How does hypothesis testing related to the other forms of frequentist
inference?
It turns out that testing is inextricably bound up with estimation
(point and interval) in the sense that an optimal estimator is often the
cornerstone of an optimal test and an optimal confidence interval. For
instance, consistent, fully efficient and minimal sufficient estimators
provide the basis for most of the optimal tests and confidence intervals!
Having said that, optimal point estimation is considered rather noncontroversial, but optimal interval estimation and hypothesis testing
have raised several foundational issues that have bedeviled frequentist inference since the 1930s. In particular, the use and abuse of the
relevant error probabilities calibrating these latter forms of inductive
inference have contributed significantly to the general confusion and
controversy when applying these procedures.
Addressing foundational issues
The first such issue concerns the presupposition of the (approximately) true premises, i.e. the true M∗ (x) lies within the boundaries
of M (x) This can be addressed by securing the statistical adequacy
of M (x) vis-a-vis data x0  using trenchant Mis-Specification (MS) testing, before applying frequentist testing; see chapter 15. As
shown above, when M (x) is misspecified, the nominal and actual error probabilities can be very different, rendering the reliability of the
test in question, at best unknown and at worst highly misleading. That
is, statistical adequacy ensures the ascertainability of the relevant error
probabilities, and hence the reliability of inference.
The second issue concerns the form and nature of the evidence
x0 can provide for ∈Θ0 or ∈Θ1  Neither Fisher’s p-value, nor the
N-P accept/reject rules can provide such an evidential interpretation,
primarily because they are vulnerable to two serious fallacies:
(c) Fallacy of acceptance: no evidence against the null is misinterpreted as evidence for it.
(d) Fallacy of rejection: evidence against the null is misinterpreted
as evidence for a specific alternative.
44
As discussed above, these fallacies can be circumvented by supplementing the accept/reject rules (or the p-value) with a post-data evaluation of inference based on severe testing with a view to determined
the discrepancy  from the null warranted by data x0  This establishes
the inferential claim warranted [and thus, the unwarranted ones], in
particular cases, so that it gives rise to learning from data.
The severity perspective sheds ample light on the various controversies relating to p-values vs. type I and II error probabilities, delineating
that the p-value is a post-data and the type I and II are pre-data error probabilities, and although inadequate for the task of furnishing
an evidential interpretation, they were intended to fulfill complementary roles. Pre-data error probabilities, like the type I and II are used
to appraise the generic capacity of testing procedures, and post-data
error probabilities, like severity, are used to bridge the gap between
the coarse ‘accept/reject’ and evidence provided by data x0 for the
warranted discrepancy from the null, by custom-tailoring the pre-data
capacity in light of (x0 ).
This refinement/extension of the Fisher-Neyman-Pearson framework can be used to address, not only the above mentioned fallacies,
but several criticisms that have beleaguered frequentist testing since
the 1930s, including:
(e) the arbitrariness of selecting the thresholds,
(f) the framing of the null and alternative hypotheses in the context
of a statistical model M (x), and
(g) the numerous misinterpretations of the p-value.

45
10

Appendix: Revisiting Confidence Intervals (CIs)

Mathematical Duality between Testing and Cls
From a purely mathematical perspective a point estimator is a mapping () of the form:
() : R → Θ

Similarly, an interval estimator is also a mapping () of the form:
(): R → Θ, such that −1 ()={x: ∈(x)}⊂F for ∈Θ.

Hence, one can assign a probability to −1 ()=(x)={x: ∈(x)} in
the sense that:
P (x: ∈(x); =∗ ) :=P (∈(X); =∗ ) 
denotes the probability that the random set (X) contains ∗ -the true
value of  The expression ‘∗ denotes the true value of ’ is a shorthand
for saying that ‘data x0 constitute a realization of the sample X with
distribution  (x; ∗ )’. One can define a (1−) Confidence Interval (CI)
for  by attaching a lower bound to this probability:
P (∈(X); =∗ ) ≥ (1 − )
Neyman (1937) utilized a mathematical duality between the N-P
Hypothesis Testing (HT) and Confidence Intervals (CIs) to put forward
an optimal theory for the latter.
The -level test of the hypotheses:
0 : =0  vs. 1 : 6=0 
takes the form {(X;0 ) 1 (0 ; )} (X;0 )-the test statistic and the
rejection region defined by:
1 (0 ; )={x: |(x;0 )|    } ⊂ R 

2
The complement of the latter with respect to the sample space (R ),

defines the acceptance region:
0 (0 ; )={x: |(x;0 )| ≤   } = R − 1 (0 ; )

2
The mathematical duality between HT and CIs stems from the fact
that for each x∈R  one can define a corresponding set (x;) on the

parameter space by:
(x;)={: x∈0 (0 ; )} ∈Θ e.g. (x;)={: |(x;0 )| ≤  }
2
46
Conversely, for each (x;) there is a CI corresponding to the acceptance region 0 (0 ; ) such that:
0 (0 ; )={x: 0 ∈(x;)} x∈R  e.g. 0 (0 ; )={x: |(x;0 | ≤  }

2
This equivalence is encapsulated by:
for each x∈R x∈0 (0 ; ) ⇔ 0 ∈(x; )

for each 0 ∈Θ

When the lower b (X) and upper bound b (X) of a CI, i.e.


´
³
b (X) ≤  ≤ b (X) =1−

P 

are monotone increasing functions of X the equivalence is:
´
³
´
³
b (x) ≤  ≤ b (x) ⇔ x∈ b−1 () ≤ x ≤ b−1 () 



∈ 
Optimality of Confidence Intervals

The mathematical duality between HT and CIs is particularly useful
in defining an optimal CI that is analogous to an optimal N-P test.
The key result is that the acceptance region of a -significance level
test  that is Uniformly Most Powerful (UMP), gives rise to (1−) CI
that is Uniformly Most Accurate (UMA), i.e. the error coverage probability that (x; ) contains a value  6= 0  where 0 is assumed to be the true value, is minimized; Lehmann (1986). Similarly,
a −UMPU test gives rise to a UMAU Confidence Interval.
Moreover, it can be shown that such optimal CIs have
the shortest length in the following sense:
if [b (x) b (x)] is a (1−)-UMAU CI, then its length [width] is less


than or equal to that of any other (1−) CI, say [e (x) e (x)]:






∗ (b (x) − b (x)) ≤ ∗ (e (x) − e (x))

Underlying reasoning: CI vs. HT

Does this mathematical duality render HT and CIs equivalent in
terms of their respective inference results? No!
47
The primary source of the confusion on this issue is that they share
a common pivotal function, and their error probabilities are evaluated
using the same tail areas. But that’s where the analogy ends.
To bring out how their inferential claims and respective error probabilities are fundamentally different, consider the simple (one parameter)
Normal model (table 3), where the common pivot is:
(X; )=

√
(  −)


v N (0 1) 

(53)

Since  is unknown, (53) is meaningless as it stands, and it crucial to
understand how the CI and HT procedures render it meaningful using
different types of reasoning, referred to as factual and hypothetical,
respectively:
√
∗ TSN

(X;)= ( − ) v N(0 1)
(54)
√
0

(X)= ( −0 ) v N(0 1)
Putting the (1−) CI side-by-side with the (1−) acceptance region:
´
³


∗
P   −  ( √ ) ≤  ≤   +  ( √ ); = =1−
2
2
´
³
(55)


P 0 −  ( √ ) ≤   ≤ 0 +  ( √ ); =0 =1−
2
2

it is clear that their inferential claims are crucially different:
[a] The CI claims that with probability (1−) the random bounds

[  ±   ( √ )] will cover (overlay) the true ∗ [whatever that happens
2
to be], under the TSN.
[b] The test claims that with probability (1−)  test  would yield
a result that accords equally well or better with 0 (x: |(x;0 )| ≤   ),
2
if 0 were true.
Hence, their claims pertain to the parameter vs. the sample space,
and the evaluation under the TSN (=∗ ) vs. under the null (=0 ),
render the two statements inferentially very different. The idea that a
given (known) 0 and the true ∗ are interchangeable is an illusion.
This can be seen most clearly in the simple Bernoulli case where 
P
needs to be estimated using b =  =1  :=  for the UMAU (1−)
 1
CI:
³
´
 (1− )
 (1− )
∗




P   −  ( √ ) ≤     +  ( √ ); = =1−
2
2
48
but the acceptance region of -UMPU test is defined in terms of the
hypothesized value 0 :
´
³
0 (1−0 )
0 (1−0 )
(
√
(
√
P 0 − 2
) ≤    0 + 2
); =0 =1−



The fact is that there is only one ∗ and an infinity of values 0  As
a result, the HT hypothetical reasoning is equally well-defined pre-data
and post data. In contrast, there is only one TSN which has already
played out post-data, rendering coverage error probability degenerate,
i.e. the observed CI:


[ −   ( √ )  +   ( √ )]
2
2

(56)

either includes or excludes the true value ∗ .
Observed Confidence Intervals and Severity
In addition to addressing the fallacies of acceptance and rejection,
the post-data severity evaluation can be used to address the issue of
degenerate post-data error probabilities and the inability to distinguish
between different values of  within an observed CI; Mayo and Spanos
(2006). This is accomplished by making two key changes:
(a) Replace the factual reasoning underlying CIs, which becomes
degenerate post-data, with the hypothetical reasoning underlying the
post-data severity evaluation.
(b) Replace any claims pertaining to overlaying the true ∗ with
severity-based inferential claims of the form:
  1 =0 +  ≤ 1 =0 + for some  ≥ 0
(57)
This can be achieved by relating the observed bounds:

 ±   ( √ )
(58)
2


to particular values of 1 associated with (57), say 1 = −  ( √ ) and
2
evaluating the post-data severity of the claim:

  1 = −  ( √ )
2

A moment’s reflection, however, suggests that the connection between establishing the warranted discrepancy  from the null value 0
and the observed CI (58) is more apparent than real, because the severity evaluation probability is attached to the inferential claim (57), and
not to the particular value 1 . Hence, at best one would be creating
49
the illusion that the severity assessment on the boundary of a 1-sided
observed CI is related to the confidence level; it does not! The truth of
the matter is that the inferential claim (58) does not pertain directly
to ∗  but to the warranted discrepancy  in light of x0 .
Fallacious arguments for using Confidence Intervals
A. Confidence Intervals are more reliable than p-values
It is often argued in the social science statistical literature that CIs
are more reliable than p-values because the latter are vulnerable to the
large  problem; as the sample size  → ∞ the p-value becomes smaller
and smaller. Hence, one would reject any null hypothesis given enough
observations. What these critics do not seem to realize is that a CI
like:
³
´


∗
 ( √ ) ≤  ≤  +  ( √ ); =
P   − 2 
=1−


2
is equally vulnerable to the large  problem because the length of the
interval:
h
i h
i



 ( √ ) −  −  ( √ )
  + 2 
= 2  ( √ )


2
2

shrinks to zero as  → ∞

B. The middle of a Confidence Interval is more probable?
Despite the obvious fact that post-data the probability of coverage
is either zero or one, there have been numerous recurring attempts in
the literature to discriminate among the different values of  within an
observed interval:
“... the difference between population means is much more likely to
be near the middle of the confidence interval than towards the extremes.”
(Gardner and Altman, 2000, p. 22)
This claim is clearly fallacious. A moment’s reflection suggests
that viewing all possible observed CIs (56) as realizations from a single
distribution in (54), leads one to the conclusion that, unless the particular  happens (by accident) to coincide with the true value ∗  values
to the right or the left of  will be more likely depending on whether
 is to the left or the right of ∗  Given that ∗ is unknown, however,
such an evaluation cannot be made in practice with a single realization.
This is illustrated below with a CI for  in the case of the simple (one
parameter) Normal model.
50
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.

∗
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a
` − − − −  − − − − a

51
A. spanos slides ch14-2013 (4)
A. spanos slides ch14-2013 (4)
A. spanos slides ch14-2013 (4)

Weitere ähnliche Inhalte

Was ist angesagt?

On similarity of fuzzy triangles
On similarity of fuzzy trianglesOn similarity of fuzzy triangles
On similarity of fuzzy trianglesijfls
 
Characterization of student’s t distribution with some application to finance
Characterization of student’s t  distribution with some application to financeCharacterization of student’s t  distribution with some application to finance
Characterization of student’s t distribution with some application to financeAlexander Decker
 
Solving Transportation Problems with Hexagonal Fuzzy Numbers Using Best Candi...
Solving Transportation Problems with Hexagonal Fuzzy Numbers Using Best Candi...Solving Transportation Problems with Hexagonal Fuzzy Numbers Using Best Candi...
Solving Transportation Problems with Hexagonal Fuzzy Numbers Using Best Candi...IJERA Editor
 
Bayesian Estimation for Missing Values in Latin Square Design
Bayesian Estimation for Missing Values in Latin Square DesignBayesian Estimation for Missing Values in Latin Square Design
Bayesian Estimation for Missing Values in Latin Square Designinventionjournals
 
Lindley smith 1972
Lindley smith 1972Lindley smith 1972
Lindley smith 1972Julyan Arbel
 
Bayesian classification
Bayesian classificationBayesian classification
Bayesian classificationManu Chandel
 
A level further mathematics zimsec syllabus cambridge zimbabwe
A level further mathematics zimsec syllabus cambridge zimbabweA level further mathematics zimsec syllabus cambridge zimbabwe
A level further mathematics zimsec syllabus cambridge zimbabweAlpro
 
Lesson 22: Optimization (Section 021 handout)
Lesson 22: Optimization (Section 021 handout)Lesson 22: Optimization (Section 021 handout)
Lesson 22: Optimization (Section 021 handout)Matthew Leingang
 
Solution of a subclass of lane emden differential equation by variational ite...
Solution of a subclass of lane emden differential equation by variational ite...Solution of a subclass of lane emden differential equation by variational ite...
Solution of a subclass of lane emden differential equation by variational ite...Alexander Decker
 
11.solution of a subclass of lane emden differential equation by variational ...
11.solution of a subclass of lane emden differential equation by variational ...11.solution of a subclass of lane emden differential equation by variational ...
11.solution of a subclass of lane emden differential equation by variational ...Alexander Decker
 
TS-IASSL2014
TS-IASSL2014TS-IASSL2014
TS-IASSL2014ychaubey
 
A LEAST ABSOLUTE APPROACH TO MULTIPLE FUZZY REGRESSION USING Tw- NORM BASED O...
A LEAST ABSOLUTE APPROACH TO MULTIPLE FUZZY REGRESSION USING Tw- NORM BASED O...A LEAST ABSOLUTE APPROACH TO MULTIPLE FUZZY REGRESSION USING Tw- NORM BASED O...
A LEAST ABSOLUTE APPROACH TO MULTIPLE FUZZY REGRESSION USING Tw- NORM BASED O...ijfls
 
Comparison of the optimal design
Comparison of the optimal designComparison of the optimal design
Comparison of the optimal designAlexander Decker
 
Hyers ulam rassias stability of exponential primitive mapping
Hyers  ulam rassias stability of exponential primitive mappingHyers  ulam rassias stability of exponential primitive mapping
Hyers ulam rassias stability of exponential primitive mappingAlexander Decker
 
A NEW STUDY TO FIND OUT THE BEST COMPUTATIONAL METHOD FOR SOLVING THE NONLINE...
A NEW STUDY TO FIND OUT THE BEST COMPUTATIONAL METHOD FOR SOLVING THE NONLINE...A NEW STUDY TO FIND OUT THE BEST COMPUTATIONAL METHOD FOR SOLVING THE NONLINE...
A NEW STUDY TO FIND OUT THE BEST COMPUTATIONAL METHOD FOR SOLVING THE NONLINE...mathsjournal
 

Was ist angesagt? (19)

On similarity of fuzzy triangles
On similarity of fuzzy trianglesOn similarity of fuzzy triangles
On similarity of fuzzy triangles
 
Characterization of student’s t distribution with some application to finance
Characterization of student’s t  distribution with some application to financeCharacterization of student’s t  distribution with some application to finance
Characterization of student’s t distribution with some application to finance
 
Solving Transportation Problems with Hexagonal Fuzzy Numbers Using Best Candi...
Solving Transportation Problems with Hexagonal Fuzzy Numbers Using Best Candi...Solving Transportation Problems with Hexagonal Fuzzy Numbers Using Best Candi...
Solving Transportation Problems with Hexagonal Fuzzy Numbers Using Best Candi...
 
2018 MUMS Fall Course - Introduction to statistical and mathematical model un...
2018 MUMS Fall Course - Introduction to statistical and mathematical model un...2018 MUMS Fall Course - Introduction to statistical and mathematical model un...
2018 MUMS Fall Course - Introduction to statistical and mathematical model un...
 
1641
16411641
1641
 
Bayesian Estimation for Missing Values in Latin Square Design
Bayesian Estimation for Missing Values in Latin Square DesignBayesian Estimation for Missing Values in Latin Square Design
Bayesian Estimation for Missing Values in Latin Square Design
 
Lindley smith 1972
Lindley smith 1972Lindley smith 1972
Lindley smith 1972
 
Bayesian classification
Bayesian classificationBayesian classification
Bayesian classification
 
Acet syllabus 1 fac
Acet   syllabus 1 facAcet   syllabus 1 fac
Acet syllabus 1 fac
 
A level further mathematics zimsec syllabus cambridge zimbabwe
A level further mathematics zimsec syllabus cambridge zimbabweA level further mathematics zimsec syllabus cambridge zimbabwe
A level further mathematics zimsec syllabus cambridge zimbabwe
 
Lesson 22: Optimization (Section 021 handout)
Lesson 22: Optimization (Section 021 handout)Lesson 22: Optimization (Section 021 handout)
Lesson 22: Optimization (Section 021 handout)
 
Solution of a subclass of lane emden differential equation by variational ite...
Solution of a subclass of lane emden differential equation by variational ite...Solution of a subclass of lane emden differential equation by variational ite...
Solution of a subclass of lane emden differential equation by variational ite...
 
11.solution of a subclass of lane emden differential equation by variational ...
11.solution of a subclass of lane emden differential equation by variational ...11.solution of a subclass of lane emden differential equation by variational ...
11.solution of a subclass of lane emden differential equation by variational ...
 
TS-IASSL2014
TS-IASSL2014TS-IASSL2014
TS-IASSL2014
 
Chi square
Chi squareChi square
Chi square
 
A LEAST ABSOLUTE APPROACH TO MULTIPLE FUZZY REGRESSION USING Tw- NORM BASED O...
A LEAST ABSOLUTE APPROACH TO MULTIPLE FUZZY REGRESSION USING Tw- NORM BASED O...A LEAST ABSOLUTE APPROACH TO MULTIPLE FUZZY REGRESSION USING Tw- NORM BASED O...
A LEAST ABSOLUTE APPROACH TO MULTIPLE FUZZY REGRESSION USING Tw- NORM BASED O...
 
Comparison of the optimal design
Comparison of the optimal designComparison of the optimal design
Comparison of the optimal design
 
Hyers ulam rassias stability of exponential primitive mapping
Hyers  ulam rassias stability of exponential primitive mappingHyers  ulam rassias stability of exponential primitive mapping
Hyers ulam rassias stability of exponential primitive mapping
 
A NEW STUDY TO FIND OUT THE BEST COMPUTATIONAL METHOD FOR SOLVING THE NONLINE...
A NEW STUDY TO FIND OUT THE BEST COMPUTATIONAL METHOD FOR SOLVING THE NONLINE...A NEW STUDY TO FIND OUT THE BEST COMPUTATIONAL METHOD FOR SOLVING THE NONLINE...
A NEW STUDY TO FIND OUT THE BEST COMPUTATIONAL METHOD FOR SOLVING THE NONLINE...
 

Andere mochten auch

Audit & auditors companies act 2013
Audit & auditors companies act 2013Audit & auditors companies act 2013
Audit & auditors companies act 2013Novojuris
 
Lecture 7 Hypothesis Testing Two Sample
Lecture 7 Hypothesis Testing Two SampleLecture 7 Hypothesis Testing Two Sample
Lecture 7 Hypothesis Testing Two SampleAhmadullah
 
Companies Act, 2013 - ICSI Thrissur - Directors, Meetings, Public vs Private ...
Companies Act, 2013 - ICSI Thrissur - Directors, Meetings, Public vs Private ...Companies Act, 2013 - ICSI Thrissur - Directors, Meetings, Public vs Private ...
Companies Act, 2013 - ICSI Thrissur - Directors, Meetings, Public vs Private ...SASPARTNERS
 
Guidance notes on audit and auditor under companies act, 2013
Guidance notes on audit and auditor under companies act, 2013Guidance notes on audit and auditor under companies act, 2013
Guidance notes on audit and auditor under companies act, 2013Amit Kumar
 

Andere mochten auch (7)

ARCHITECTURAL AND TYPE
ARCHITECTURAL AND TYPE ARCHITECTURAL AND TYPE
ARCHITECTURAL AND TYPE
 
Audit & auditors companies act 2013
Audit & auditors companies act 2013Audit & auditors companies act 2013
Audit & auditors companies act 2013
 
Key account management quarterly research 2011
Key account management   quarterly research 2011Key account management   quarterly research 2011
Key account management quarterly research 2011
 
Lecture 7 Hypothesis Testing Two Sample
Lecture 7 Hypothesis Testing Two SampleLecture 7 Hypothesis Testing Two Sample
Lecture 7 Hypothesis Testing Two Sample
 
Companies Act, 2013 - ICSI Thrissur - Directors, Meetings, Public vs Private ...
Companies Act, 2013 - ICSI Thrissur - Directors, Meetings, Public vs Private ...Companies Act, 2013 - ICSI Thrissur - Directors, Meetings, Public vs Private ...
Companies Act, 2013 - ICSI Thrissur - Directors, Meetings, Public vs Private ...
 
Company Auditor ppt
Company Auditor pptCompany Auditor ppt
Company Auditor ppt
 
Guidance notes on audit and auditor under companies act, 2013
Guidance notes on audit and auditor under companies act, 2013Guidance notes on audit and auditor under companies act, 2013
Guidance notes on audit and auditor under companies act, 2013
 

Ähnlich wie A. spanos slides ch14-2013 (4)

Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksFederico Cerutti
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
 
Descriptive Statistics Formula Sheet Sample Populatio.docx
Descriptive Statistics Formula Sheet    Sample Populatio.docxDescriptive Statistics Formula Sheet    Sample Populatio.docx
Descriptive Statistics Formula Sheet Sample Populatio.docxsimonithomas47935
 
Regression on gaussian symbols
Regression on gaussian symbolsRegression on gaussian symbols
Regression on gaussian symbolsAxel de Romblay
 
  Information Theory and the Analysis of Uncertainties in a Spatial Geologi...
  Information Theory and the Analysis of Uncertainties in a Spatial Geologi...  Information Theory and the Analysis of Uncertainties in a Spatial Geologi...
  Information Theory and the Analysis of Uncertainties in a Spatial Geologi...The University of Western Australia
 
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...Christian Robert
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Frank Nielsen
 
Yet another statistical analysis of the data of the ‘loophole free’ experime...
Yet another statistical analysis of the data of the  ‘loophole free’ experime...Yet another statistical analysis of the data of the  ‘loophole free’ experime...
Yet another statistical analysis of the data of the ‘loophole free’ experime...Richard Gill
 
Spanos lecture+3-6334-estimation
Spanos lecture+3-6334-estimationSpanos lecture+3-6334-estimation
Spanos lecture+3-6334-estimationjemille6
 
Quantitative Analysis For Management 11th Edition Render Test Bank
Quantitative Analysis For Management 11th Edition Render Test BankQuantitative Analysis For Management 11th Edition Render Test Bank
Quantitative Analysis For Management 11th Edition Render Test BankRichmondere
 
An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...Alexander Decker
 
Maths questiion bank for engineering students
Maths questiion bank for engineering studentsMaths questiion bank for engineering students
Maths questiion bank for engineering studentsMrMRubanVelsUniversi
 
Bayesian Variable Selection in Linear Regression and A Comparison
Bayesian Variable Selection in Linear Regression and A ComparisonBayesian Variable Selection in Linear Regression and A Comparison
Bayesian Variable Selection in Linear Regression and A ComparisonAtilla YARDIMCI
 

Ähnlich wie A. spanos slides ch14-2013 (4) (20)

Chapter1
Chapter1Chapter1
Chapter1
 
HW1 MIT Fall 2005
HW1 MIT Fall 2005HW1 MIT Fall 2005
HW1 MIT Fall 2005
 
Introduction to Evidential Neural Networks
Introduction to Evidential Neural NetworksIntroduction to Evidential Neural Networks
Introduction to Evidential Neural Networks
 
U unit7 ssb
U unit7 ssbU unit7 ssb
U unit7 ssb
 
chap2.pdf
chap2.pdfchap2.pdf
chap2.pdf
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
2018 MUMS Fall Course - Problem Set 1 (Attachment for Class Presentation) - J...
2018 MUMS Fall Course - Problem Set 1 (Attachment for Class Presentation) - J...2018 MUMS Fall Course - Problem Set 1 (Attachment for Class Presentation) - J...
2018 MUMS Fall Course - Problem Set 1 (Attachment for Class Presentation) - J...
 
Descriptive Statistics Formula Sheet Sample Populatio.docx
Descriptive Statistics Formula Sheet    Sample Populatio.docxDescriptive Statistics Formula Sheet    Sample Populatio.docx
Descriptive Statistics Formula Sheet Sample Populatio.docx
 
Regression on gaussian symbols
Regression on gaussian symbolsRegression on gaussian symbols
Regression on gaussian symbols
 
  Information Theory and the Analysis of Uncertainties in a Spatial Geologi...
  Information Theory and the Analysis of Uncertainties in a Spatial Geologi...  Information Theory and the Analysis of Uncertainties in a Spatial Geologi...
  Information Theory and the Analysis of Uncertainties in a Spatial Geologi...
 
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...On the vexing dilemma of hypothesis testing and the predicted demise of the B...
On the vexing dilemma of hypothesis testing and the predicted demise of the B...
 
Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...Pattern learning and recognition on statistical manifolds: An information-geo...
Pattern learning and recognition on statistical manifolds: An information-geo...
 
Yet another statistical analysis of the data of the ‘loophole free’ experime...
Yet another statistical analysis of the data of the  ‘loophole free’ experime...Yet another statistical analysis of the data of the  ‘loophole free’ experime...
Yet another statistical analysis of the data of the ‘loophole free’ experime...
 
Spanos lecture+3-6334-estimation
Spanos lecture+3-6334-estimationSpanos lecture+3-6334-estimation
Spanos lecture+3-6334-estimation
 
Quantitative Analysis For Management 11th Edition Render Test Bank
Quantitative Analysis For Management 11th Edition Render Test BankQuantitative Analysis For Management 11th Edition Render Test Bank
Quantitative Analysis For Management 11th Edition Render Test Bank
 
An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...An investigation of inference of the generalized extreme value distribution b...
An investigation of inference of the generalized extreme value distribution b...
 
Maths questiion bank for engineering students
Maths questiion bank for engineering studentsMaths questiion bank for engineering students
Maths questiion bank for engineering students
 
Bayesian Variable Selection in Linear Regression and A Comparison
Bayesian Variable Selection in Linear Regression and A ComparisonBayesian Variable Selection in Linear Regression and A Comparison
Bayesian Variable Selection in Linear Regression and A Comparison
 
Chi square test
Chi square testChi square test
Chi square test
 
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
Program on Mathematical and Statistical Methods for Climate and the Earth Sys...
 

Mehr von jemille6

“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”jemille6
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfjemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfjemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inferencejemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?jemille6
 
What's the question?
What's the question? What's the question?
What's the question? jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metasciencejemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredgingjemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probabilityjemille6
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severityjemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)jemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...jemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (jemille6
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...jemille6
 

Mehr von jemille6 (20)

“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...
 

Kürzlich hochgeladen

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

A. spanos slides ch14-2013 (4)

  • 1. CHAPTER 14: Frequentist Hypothesis Testing: a Coherent Account Aris Spanos [Fall 2013] 1 Inherent difficulties in learning statistical testing Statistical testing is arguably the most important, but also the most difficult and confusing chapter of statistical inference for several reasons, including the following. (i) The need to introduce numerous new notions, concepts and procedures before one can paint — even in broad brushes — a coherent picture of hypothesis testing. (ii) The current textbook discussion of statistical testing is both highly confusing and confused. There are several sources of confusion. I (a) Testing is conceptually one of the most sophisticated sub-fields of any scientific discipline. I (b) Inadequate knowledge by textbook writers who often do not have the technical skills to read and understand the original sources, and have to rely on second hand accounts of previous textbook writers that are often misleading or just outright erroneous. In most of these textbooks hypothesis testing is poorly explained as an idiot’s guide to combining off-the-shelf formulae with statistical tables like the Normal, the Student’s t, the chi-square, etc., where the underlying statistical model that gives rise to the testing procedure is hidden in the background. I (c) The misleading portrayal of Neyman-Pearson testing as essentially decision-theoretic in nature, when in fact the latter has much greater affinity with the Bayesian rather than the frequentist inference. I (d) A deliberate attempt to distort and cannibalize frequentist testing by certain Bayesian statisticians who revel in (unfairly) maligning frequentist inference in their misguided attempt to motivate their preferred viewpoint of statistical inference. You will often hear such Bayesians tell you that "what you really want (require) is probabilities attached to hypotheses" and other such misadvised promptings! In frequentist inference probabilities are always attached to the different possible values of the sample x∈R ; never to hypotheses!  (iii) The discussion of frequentist testing is rather incomplete in so far as it has been beleaguered by serious foundational problems since 1
  • 2. the 1930s. As a result, different applied fields have generated their own secondary literatures attempting to address these problems, but often making things much worse! Indeed, in some fields like psychology it has reached the stage where one has to correct the ‘corrections’ of those chastising the initial ‘correctors’! In an attempt to alleviate problem (i), the discussion that follows uses a sketchy historical development of frequentist testing. To ameliorate problem (ii), the discussion includes ‘red flag’ pointers (¥) designed to highlight important points that shed light on certain erroneous interpretations or misleading arguments. The discussion will pay special attention to (iii), addressing some of the key foundational problems. 2 Francis Edgeworth A typical example of a testing procedure at the end of the 19th century is provided by Edgeworth (1885). From today’s perspective his testing takes the form of viewing data x0 :=(11  12      1 ; 21  22   2 ) in the context of the following bivariate simple Normal model: ¶ µµ ¶ µ 2 ¶¶ µ 1  0 1 v NIID   =1 2    X:= 2 2 0 2 (1 )=1  (2)=2   (1 )= (2 ) =  2  (1  2 )=0 The hypothesis of interest concerns the equality of the two means: Hypothesis of interest: 1 = 2  The 2-dimensional sample is: X:=(11  12      1 ; 21  22   2 ) Common sense, combined with the statistical knowledge at the time suggested using the difference between the estimated means: P P 1 − 2  where 1 =  =1 1  2 =  =1 2  b b b 1 b 1 as a basis for deciding whether the two means are different or not. To render the difference (b1 − 2 ) free of the units of measurement,  b Edgeworth divided it by its standard deviation: q   p P P b2 1  b2 1   (b1 −b2 )= (b2 + 2 )  1 =  (1 −b1 )2   2 =  (2 −b1 )2     1 b1 =1 2 =1
  • 3. |ˆ  to define the distance function: (X)= √1 −ˆ 2 |2  2 (1 +1 )   The last issue he needed to address is how big the observed distance (x0 ) should be to justify inferring that the two means are different. Edgeworth argued that (1 −2 )6=0 could not be ‘accidental’ (due to chance) if: √ |ˆ  (X) = √1 −ˆ2 |2  2 2 (1) 2 (1 +1 )   √ I Where did the threshold 2 2 come from? The tail area of N(0 1) √ beyond ±2 2 is approximately 005, which was viewed as a reasonable value for the probability of a ‘chance’ error, i.e. the error of inferring a significant discrepancy. But why Normal? At the time statistical inference relied heavily on large sample size  (asymptotic) results by invoking the Central Limit Theorem. In summary, Edgeworth introduced 3 features of statistical testing: (i) a hypothesis of interest: 1 = 2  (ii) the notion of a standardized distance: (X),√ (iii) a threshold value for significance: (X)  2 2 3 Karl Pearson The Pearson approach to statistics can be summarized as follows: Histogram Data =⇒ x0 :=(1    ) =⇒ Fitting a frequency curve from the Pearson family  (; b1  b2  b3  b4 )     The Pearson family of frequency curves can be expressed in terms of the following differential equation:  ln  ()  = (−1 )  2 +3 +4 2 (2) Depending on the values taken by the parameters (1  2  3  4 ) this equation can generate numerous frequency curves such as the Normal, the Student’s , the Beta, the Gamma, the Laplace, the Pareto, etc. 3
  • 4. The first four raw data moments 01 , 02 , 03 and 04 are sufficient to b b b b determine (b1  b2  b3  b4 ) and that enables one can select the particular      () from the Pearson family. Having selected a particular frequency curve 0 () on the basis of b  (; b1  b2  b3  b4 )= (; θ) Karl Pearson proposed to assess the appro    priateness of this choice. His hypothesis of interest is the form: 0 () =  ∗ ()∈Pearson(1  2  3  4 )  ∗ ()-true density. Pearson proposed the standardized distance function: X ( − )2 X (( )−( ))2     (X)= = v 2 ()  ( ) =1 =1  (3) b where (  =1 2  ) and ( =1 2  ) denote the empirical and assumed (as specified by 0 ()) frequencies, and introduced the notion of a p-value: P((X)  (x0 )) = (x0 ) (4) as a basis for inferring whether the choice of 0 () was appropriate or not. The smaller the (x0 ) the worst the fit. ¥ It is important to Note that this hypothesis of interest 0 ()= ∗ () is not about a particular parameter, but about the adequacy of the choice of the probability model. In this sense, this was the first MisSpecification (M-S) test for a distributional assumption! Example. Mendel’s cross-breeding experiments (1865-6) were based on pea-plants with different shape and color, which can be framed in terms of two Bernoulli random variables: R W z }| { z }| { (round)=0 (wrinkled)=1 Y G z }| { z }| {  (yellow)=0  (green)=1 His theory of heredity was based on two assumptions: (i) the two random variables  and  are independent, (ii) round (=0) and yellow ( =0) are dominant traits, but ‘winkled’ (=1) and ‘green’ ( =1) are recessive traits with probabilities: P(=0)=P( =0) = 75 P(=1)=P( =1) = 25 This substantive theory gives rise to Mendel’s model based on the bivariate distribution below: 4
  • 5.  0 1 0 1  () 5625 1875 750 1875 0625 250  () 750 250 100 Mendel’s theory model  ( ) Data. The 4 gene-pair experiments Mendel carried out with =556, gave rise the observed frequencies and relative frequencies given below: R,Y R,G W,Y W,G (0 0) (0 1) (1 0) (1 1) 315 108 101 32 Observed frequencies ⇒  0 1 b  () 0 1 5666 1942 7608 1817 0576 2393 b  () 7483 2518 1001 Observed relative frequencies How adequate is Mendel’s theory in light of the data? One can answer that question using Pearson’s goodness-of-fit chi-square test. X (( )−( ))2   The chi-square test statistic (X)= compares ( ) =1 how close are the observed to the expected relative frequencies. Using the above data this test statistic yields: ³ ´ (5666−5625)2 (1942−1875)2 (1817−1875)2 (0576−0625)2 (X)=556 + 1875 + 1875 + 0625 =463 5625 In light of the fact that the tail area of 2 (3) yields: P((X)  0470)= 927 suggests accordance with Mendel’s theory. Karl Pearson’s primary contributions to testing were: (a) the broadening of the scope of the hypothesis of interest, initiating misspecification testing with a goodness-of-fit test, (b) a distance function whose distribution is known (at least asymptotically), and (c) the use of the tail probability as a basis of deciding how good the fit with data x0 is. 5
  • 6. 4 William Gosset (aka ‘Student’) Gosset’s 1908 seminal paper provided the cornerstone upon which Fisher founded modern statistical inference. At that time it was known that in the case when (1  2    ) are NIID (the simple Normal model), P 1 the MLE estimator  :=  =  =1  had the following ‘sampling’ b distribution: ³ 2´ √   v N   ⇒ (  −) v N (0 1)    It was also known that in the case where 2 is replaced by its estimator: P 1 2 = −1 =1 ( −ˆ  )2   √ (  −)  the distribution of but asymptotically it can √ ? (  −) is unknown, i.e. vD (0 1)   √ (  −) v N (0 1)  be approximated by:   ? Gosset (1908) derived the unknown D ()  showing that for any 1: √ (  −)  v St(−1) (5) where St(−1) denotes a Student’s t distribution with (−1) degrees of freedom. The subtle question that needs to be answered is: what does the result mean in light of the fact that  is unknown. I This was the first finite sample result [a distributional result that is valid for any   1] that inspired R. A. Fisher to pioneer modern frequentist inference. 5 R. A. Fisher’s significance testing The result (5) was formally proved and extended by Fisher (1915) and used subsequently as a basis for several tests of hypotheses associated with a number of different statistical models in a series of papers. A key element in Fisher’s framework was the introduction of the notion of a statistical model, whose generic form: M (x)={ (x; θ) θ∈Θ} x∈R   6 (6)
  • 7. where  (x; θ) x∈R denotes the (joint) distribution of the sample  X:=(1    ) that encapsulates the prespecified probabilistic structure of the underlying stochastic process {  ∈N} The link to phenomenon of interest comes from viewing data x0 :=(1  2       ) as a ‘truly typical’ realization of the process {  ∈N}. Fisher used the result (5) to construct a test of significance for the: null hypothesis: 0 : =0  (7) in the context of the simple Normal model (table 1). Table 1 - The simple Normal model Statistical GM:  =  +   ∈N [1] Normal:  v N( ) [2] Constant mean: ( ) =  for all ∈N [3] Constant variance:  ( ) = 2  for all ∈N [4] Independence: {  ∈N} - independent process. ¥ Fisher must have realized that the result in (5) by Gosset could not be used directly as a basis for a test because it involved the unknown parameter  One can only speculate, but it must have dawn on him that the result can be rendered operational if it is interpreted in terms of factual reasoning in the sense that it holds under the True State of Nature (TSN) (=∗ the true value). Note that, in general, the expression ‘∗ denotes the true value of ’ is a shorthand for saying that ‘data x0 constitute a realization of the sample X with distribution  (x; ∗ )’. That is, (5) is a pivot (pivotal quantity): √ ∗ TSN  (X; )= ( − ) v St(−1) (8) in the sense that it is a function of both X and  whose distribution under TSN is known. Fisher (1956), p. 47, was very explicit about the nature of reasoning underlying his significance testing: “In general, tests of significance are based on hypothetical probabilities calculated from their null hypotheses.” [emphasis in the original]. His problem was to adapt the result in (8) to hypothetical reasoning with a view to construct a test. Fisher accomplished that in three steps. 7
  • 8. Step 1. He replaced the unknown ∗ in (8) with the known null value 0  transforming the pivot  (X; ) into a statistic √  (X)= ( −0 )  A statistic is only a function of the sample X:=(1  2    ) Step 2. He adapted the factual reasoning underlying the pivot (8) into hypothetical reasoning by evaluating the test statistic under 0 : √ 0  (X)= ( −0 ) v St(−1) (9) Step 3. He employed (9) to define the -value: P( (X)   (x0 ); 0 ) = (x0 ) (10) as an indicator of disagreement (inconsistency, contradiction) between data x0 and 0 ; the bigger the value of  (x0 ) the smaller the -value. £ It is crucial to note that it is a serious mistake — committed often by misguided authors — to interpret the above probabilistic statement defining the p-value as conditional on the null hypotheses 0 . Avoid the highly misleading notation, P( (X)   (x0 ) | 0 ) = (x0 ) × of the vertical line (|) instead of a semi-colon (;). Conditioning on 0 makes no sense in frequentist inference since  is not a random variable; the evaluation in (10) is made under the scenario that 0 : =0 is true. Although it is impossible to pin down Fisher’s interpretation of the p-value because his views changed over time and he held several interpretations at any one time, there is one fixed point among his numerous articulations (Fisher, 1925, 1955): ‘a small p-value can be interpreted as a simple logical disjunction: either an extremely rare event has occurred or 0 is not true’. The focus on "a small p-value" needs to be read in conjunction with Fisher’s falsificationist stance about testing in the sense that significance tests can falsify but never verify hypotheses (Fisher, 1955): “... tests of significance, when used accurately, are capable of rejecting or invalidating hypotheses, in so far as these are contradicted by the data; but that they are never capable of establishing them as certainly true.” 8
  • 9. His logical disjunction combined with the falsificationist stance are not helpful in shedding light on the key issue of learning from data: what do data x0 convey about the truth or falsity of 0 ? Occasionally, Fisher would go further and express what sounds more like an aspiration for an evidential interpretation: “The actual value of  ... indicates the strength of evidence against the hypothesis” (see Fisher, 1925, p. 80), but he never articulated a coherent evidential interpretation based on the p-value. Indeed, as late as 1956 he offered ‘a degree of rational disbelief’ interpretation (Fisher, 1956, pp. 46-47). ¥ A more pertinent interpretation of the -value (section 8) is : ‘the p-value is the probability – evaluated under the scenario that 0 is true – of all possible outcomes x∈R that accord less well with  0 than x0 does. In light of the fact that data x0 were generated by the ‘true’ (θ=θ∗ ) statistical Data Generating Mechanism (DGM): M∗ (x)={ (x; θ∗ )} x∈R   √ ∗ ↓ and the test statistic  (X)= ( −0 ) measures the standardized difference between ∗ and 0  the p-value P( (X)   (x0 ); 0 )=(x0 ) seeks to measure the discordance between 0 and M∗ (x). This precludes several erroneous interpretations of the p-value: £ The p-value is the probability that 0 is true. £ 1−(x0 ) is the probability that the 1 is true. £ The p-value is the probability that the particular result is due to ‘chance error’. £ The p-value is an indicator of the size or the substantive importance of the observed effect. £ The p-value is the conditional probability of obtaining a test statistic (x) more extreme than (x0 ) given 0 ; there is no conditioning involved. Example 1. Consider testing the null hypothesis: 0 :  = 70 in the context of the simple Normal model (see table 1), using the exam score data in table 1.6 (see chapter 1) where:  =71686 2 =13606 and =70. The test statistic (9) yields: ˆ √ √  (x0 )= 70(71686−70) =3824 13606 P( (X)  3824; 0 =70)=00007 9
  • 10. where (x0 )=00007 is found from the St(69) tables. The tiny -value suggests that x0 indicate strong discordance with 0 . Table 2 - The simple Bernoulli model Statistical GM:  =  +   ∈N [1] Bernoulli:  v Ber( ) [2] Constant mean: ( ) =  for all ∈N [3] Constant variance:  ( )=(1 − ) for all ∈N [4] Independence: {  ∈N} - independent process. Example 2. Arbuthnot’s 1710 conjecture: the ratio of males to females in newborns might not be ‘fair’. This can be tested using the statistical null hypothesis: 0 :  = 0  where 0 =5 denotes ‘fair’ in the context of the simple Bernoulli model (table 2), based on the random variable  defined by: {male}={=1} {female}={=0} P 1 Using the MLE of  ˆ :=  =  =1   whose sampling distribu tion: ³ ´ (1−)   v Bin   ;   one can derive the test statistic: √ 0 ( (X)= √  −0 ) v Bin (0 1; )  (11) 0 (1−0 ) Data: =30762 newborns during the period 1993-5 in Cyprus, out of which 16029 were boys and 14833 girls. The test statistic takes the form (11) and ˆ = 16029 =521  30762 (x0 )= √ 30762(521−5) √ 5(5) =7366 P((X)  7366; =5)=00000017 The tiny -value indicates strong discourdance with 0 . In general one needs to specify a certain threshold for deciding how small the p-value is ‘small enough’ to falsify 0 . Fisher suggested several such thresholds, 01 025 05, 1, but insisted that the choice has to be made on a case by case basis. 10
  • 11. The main elements of a Fisher significance test { (X) (x0 )}. Components of a Fisher significance test (i) a prespecified statistical model: M (x) (ii) a null (0 : =0 ) hypothesis, (iii) a test statistic (distance function)  (X) (iv) the distribution of  (X) under 0  (v) the -value P( (X)   (x0 ); 0 )=(x0 ) (vi) a threshold value 0 [e.g.01 025 05], such that: (x0 )  0 ⇒ x0 falsifies (rejects) 0  Example. Consider a Fisher test based on a simple (one parameter) Normal model (table 3). Table 3 - The simple (one parameter) Normal model Statistical GM:  =  +   ∈N [1] Normal:  v N( ) [2] Constant mean: ( ) =  for all ∈N [3] Constant variance:  ( )= 2 -known, for all ∈N [4] Independence: {  ∈N} - independent process. (i) M (x) :  v N ( 2 ) [2 is known],  = 1 2    (ii) Null hypothesis: 0 :  = 0 (e.g. 0 = 0) √  (iii) a test statistic:  (X) = ( −0 )  0 (iv) the distribution of  (X) under 0 :  (X) v N (0 1) (v) the -value: P( (X)   (x0 ); 0 )=(x0 ) (vi) a threshold value for discordance, say 0 = 05 Distribution Plot Distribution Plot Normal, Mean=0, StDev=1 Normal, Mean=0, StDev=1 0.3 Density 0.4 0.3 Density 0.4 0.2 0.1 0.2 0.1 0.05 0.0 0.025 -1.64 0.0 0 X Normal: P(  −164) = 05 0 X 1.96 Normal: P(  196) = 025 11
  • 12. 6 The Neyman-Pearson (N-P) framework Neyman and Pearson (1933) motivated their own approach to testing as an attempt to improve upon Fisher’s testing by addressing what they consider as ad hoc features of that approach: [a] Fisher’s ad hoc choice of a test statistics (X) pertaining to the optimality of tests, [b] Fisher’s use of a post-data [(x0 ) is known] threshold for the value to indicate falsification (rejection) of 0  and [c] Fisher’s denial that (x0 ) can indicate confirmation (accordance) of 0  Neyman-Pearson (N-P) proposed solutions for [a]-[c] by: ¥ [i] Introducing the notion of an alternative hypothesis 1 as the complement to the null with respect to Θ, within the same statistical model: M (x)={ (x; θ) θ∈Θ} x∈R  (12)  [ii] Replacing the post-data p-value and its threshold with a pre-data [before (x0 ) is available] significance level  determining the decision rules: (i) Reject 0  (ii) Accept 0  which are calibrated using the error probabilities: type I: P(Reject 0  when 0 is true)=, type II: P(Accept 0  when 0 false)=. These error probabilities enabled Neyman and Pearson to introduce the key notion of an optimal N-P test: one that minimizes  subject to a fixed , i.e. [iii] Selecting the particular (X) in conjection with a rejection region for a given  that has the highest pre-data capacity (1 − ) to detect discrepancies in the direction of 1  In this sense, the N-P proposed solutions to the perceived weaknesses of Fisher’s significance testing, have changed the original Fisher framework in four important respects: ¥ (a) the N-P testing takes place within a prespecified M (x) by partitioning it into the null and alternative hypotheses, 12
  • 13. (b) Fisher’s post-data p-value has been replaced with a pre-data  and  (c) the combination of (a)-(b) render ascertainable the pre-data capacity (power) of the test to detect discrepancies in the direction of 1 , (d) Fisher’s evidential aspirations for the p-value based on discordance, were replaced by the more behavioristic accept/reject decision rules that aim to provide input for alternative actions: when accept 0 take action  when reject 0 take action . As claimed by Neyman and Pearson (1928): “the tests themselves give no final verdict, but as tools help the worker who is using them to form his final decision.” 6.1 The archetypal N-P hypotheses specification In N-P testing there has been a long debate concerning the proper way to specify the null and alternative hypotheses. It is often thought to be rather arbitrary and subject to abuse. This view stems primarily from inadequate understanding of the role of statistical vs. substantive information. ¥ It is argued that there is nothing arbitrary about specifying the null and alternative hypotheses in N-P testing. The default alternative is always the complement to the null relative to the parameter space of M (x). Consider a generic statistical model: M (x)={ (x; θ) θ∈Θ} x∈X:=R   The archetypal way to specify the null and alternative hypotheses for N-P testing is: 0 : ∈Θ0 vs. 1 : ∈Θ1  (13) where Θ0 and Θ1 constitute a partition of the parameter space Θ: Θ0 ∩ Θ1 =∅ Θ0 ∪ Θ1 =Θ In the case where the sets Θ0 or Θ1 contain a single point, say Θ0 ={0 } that determines  (x; 0 ) the hypothesis is said to be simple, otherwise it is composite. Example. For the simple Bernoulli model, 0 :  = 0 is simple, but 1 :  6= 0 is composite. 13
  • 14. ¥ An equivalent formulation that brings out the fact that N-P testing poses questions pertaining to the ‘true’ statistical Data Generating Mechanism (DGM) M∗ (x)={ ∗ (x)} x∈X most clearly is to re-write (13) in the equivalent form: 0 :  ∗ (x)∈M0 (x) vs. 1 :  ∗ (x)∈M1 (x) where  ∗ (x) denotes the ‘true’ distribution of the sample, and: M0 (x)={ (x; θ) θ∈Θ0 } or M1 (x)={ (x; θ) θ∈Θ1 } x∈X:=R ?  constitute a partition of M (x) P(x) denotes the set of all possible statistical models that could have given rise to data x0 :=(1  2       ). P (x ) H0 H1 N-P testing within M (x) ¥ When properly understood in the context of a statistical model M (x) N-P testing leaves very little leeway in the specification of 0 and 1 hypotheses. This is because the whole of Θ [all possible values of ] is relevant on statistical grounds, despite the fact that only a small subset is often deemed relevant on substantive grounds. The reasoning underlying this argument is that N-P tests invariably pose hypothetical questions — framed in terms of ∈Θ — pertaining to the true statistical DGM M∗ (x)={ ∗ (x)} x∈X Partitioning M (x) provides the most effective way to learn from data by zeroing in on the true  in the same way one can zero in on an unknown prespecified city on a map using sequential partitioning. ¥ This foregrounds the importance of securing the statistical adequacy of M (x) [validating its probabilistic assumptions] before it provides the basis of any inference. Hence, whenever the union of Θ0 and Θ1 the values of  postulated by 0 and 1 , respectively, do not 14
  • 15. exhaust Θ there is always the possibility that the true  lies in the complement: Θ − (Θ0 ∪ Θ1 ). What constitutes an N-P test? A misleading idea that needs to be done away with immediately is that a N-P test is not a formula with some associated statistical tables! It is a combination of a test statistic whose sampling distribution under the null and alternative hypotheses is tractable in conjunction with a rejection region. Test statistic. Like an estimator, an N-P test statistic is a mapping from the sample (X:=R ) to the parameter space (Θ):  () : X → R that partitions the sample space into an acceptance 0 and a rejection region 1 : 0 ∩ 1 =∅ 0 ∪ 1 =X in a way that correspond to Θ0 and Θ1  respectively: ¾ ½ 0 ↔ Θ0 =Θ X= 1 ↔ Θ1 The N-P decision rules, based on a test statistic (X) take the form: [i] if x0 ∈0  accept 0  [ii] if x0 ∈1  reject 0  (14) and the associated error probabilities are: type I: P(x0 ∈1 ; 0 () true)=() for ∈Θ0  type II: P(x0 ∈0 ; 1 (1 ) true)=(1 ) for 1 ∈Θ1  True State of Nature N-P rule Accept 0 Reject 0 0 true √ Type I error 0 false Type √ error II £ Some misinformed statistics books go into great lengths to explain that the above error probabilities should be viewed as conditional on 0 or 1  and they used the misleading notation (|): P(x0 ∈1 |0 () true)=() for ∈Θ0  15
  • 16. Such conditioning makes no sense in a frequentist framework because  is treated as an unknown constant. Instead, they should be interpreted as evaluations of the sampling distribution of the test statistic under different scenarios pertaining to different hypothesized values of  ¥ Some statistics books make a big deal of the distinction ‘accept 0 ’ vs. ‘fail to reject 0 ’, in their commendable attempt to bring out the problem of misinterpreting ‘accept 0 ’ as tantamount to ‘there is evidence for 0 ’; this is known as the fallacy of acceptance. By the same token there is an analogous distinction between ‘reject 0 ’ vs. ‘accept 1 ’, that highlights the problem of misinterpreting ‘reject 0 ’ as tantamount to ‘there is evidence for 1 ’; this is known as the fallacy of rejection. What is objectionable about this practice is that the verbal distinctions by themselves do nothing but perpetuate these fallacies; we need to address them, not play clever with words! Table 3 - The simple (one parameter) Normal model Statistical GM:  =  +   ∈N [1] Normal:  v N( ) [2] Constant mean: ( ) =  for all ∈N [3] Constant variance:  ( )= 2 -known, for all ∈N [4] Independence: {  ∈N} - independent process. Example 1. Consider the following hypotheses in the context of the simple (one parameter) Normal model: 0 :  ≤ 0  vs. 1 :   0  Let us choose a test  :={(X) 1 ()} of the form: √ P  1 test statistic: (X)= ( −0 )  where   =  =1   rejection region: 1 ()={x : (x)   } (15) (16) To evaluate the two types of error probabilities the distribution of (X) under both the null and alternatives are needed: √ 0 ( )  [I] (X)= ( −0 ) v 0 N(0 1) √ 1 ( )  [II] (X)= ( −0 ) v 1 N( 1  1) 16 (17) for all 1  0 
  • 17. where the non-zero mean  1 takes the form: √ 1  1 = ( −0 )  0 for all 1  0  The evaluation of the type I error probability is based on (I): = max≤0 P((X)  ; 0 ())=P((X)   ; =0 ) Notice that in cases where the null is for the form 0 :  ≤ 0  the type I error probability is defined as the maximum over all values in the interval (−∞ 0 ] which turns out to be the one evaluated at the last point of the interval  = 0  Hence, the same test will be optimal also in the case where the null is of the form 0 :  = 0  The evaluation of type II error probabilities is based on (II): (1 )=P((X) ≤  ; 1 (1 )) for all 1  0  The power [rejecting the null when false] is equal to 1−(1 ) i.e. (1 )=P((X)   ; 1 (1 )) for all 1  0  Why do we care about power? The power (1 ) measures the pre-data (generic) capacity of test  :={(X) 1 ()} to detect a discrepancy, say =1 −0 = 1 when present. Hence, when (1 )=35 this test has very low capacity to detect such a discrepancy. How is this information helpful in practice? If =1 is the discrepancy of substantive interest, this test is practically useless for that purposes because we know beforehand that this test does not enough capacity to detect  even if present! 17
  • 18. What can one do in such a case? The power of the above test is √ 1 monotonically increasing with  1 = ( −0 )  and thus, increasing the sample size , increases the power. Hypothesis testing reasoning. ¥ As with Fisher’s significance testing, the reasoning underlying NP testing is hypothetical in the sense that both error probabilities are based on evaluating the relevant test statistic under several hypothetical scenarios, e.g. under 0 (=0 ) or under 1 (=1 ) for all 1 0  These hypothetical sampling distributions are then used to compare 0 or 1 via (x0 ) to the True State of Nature (TSN) represented by data x0  This is in contrast to frequentist estimation which is factual in nature, i.e. the sampling distributions of estimators is evaluated under the TSN, i.e.  = ∗  Significance level vs. -value It is important to note that there is a mathematical relationship between the type I error probability (significance level) and the p-value. Placing them side by side: P(type I error): P((X)   ; =0 ) =  (18) -value: P((X)  (x0 ); =0 )=(x0 ) it becomes obvious that: (a) they are both specified in terms of the distribution of the same test statistic (X) under 0  (b) they both evaluate the probability of tail events of the form {x: (x)  } x∈X, but (c) they differ in terms of the threshold :  vs. (x0 ); in that sense  is a pre-data and (x0 ) is a post-data error probability. In light of (a)-(c), three comments are called for. First, the p-value can be viewed as the smallest significance level  at which 0 would have been rejected with data x0 . For that reason the p-value is often referred to as the observed significance level. Hence, it should come as no surprise to learn that the above N-P decision rules (14) could be recast in terms of the p-value: [i]* if (x0 )   accept 0  [ii]* if (x0 ) ≤  reject 0  18
  • 19. Indeed, practitioners often prefer to use the rules [i]*-[ii]* because the p-value conveys additional information when it is not close to the predesignated threshold . For instance, rejecting 0 with (x0 )=0001 seems more informative than just 0 was rejected at  = 05 ¥ Second, the p-value seems to have an implicit alternative built into the choice of the tail area. For instance, a 2-sided alternative 1 : 6=0 would require one to evaluate P(|(X)|  |(x0 )|; =0 ) How did Fisher get away with such an obvious breach of his own preaching? Speculating on that, his answer might have been: 2-sided evaluation of the p-value makes no sense, because, post-data, the sign of the test statistic (x0 ) determines the direction of departure! ¥ Third, neither the significance level  nor the p-value can be interpreted as probabilities attached to particular values of  associated with 0 or 1 , since the probabilities in (18) are firmly attached to the sample realizations x∈X. Indeed, attaching probabilities to the unknown constant  makes no sense in frequentist statistics. As argued by Fisher (1921), p. 25: “We may discuss the probability of occurrence of quantities which can be observed or deduced from observations, in relation to any hypotheses which may be suggested to explain these observations. We can know nothing of the probability of hypotheses...” This scotches another two erroneous interpretations of the p-value: £ (x0 ) = 005 does not mean that there is a 5% chance of a Type I error (i.e. false positive). £ (x0 ) = 005 does not mean that there is a 95% chance that the results would replicate if the study were repeated. Example 1 (continued). Consider applying test :={(X) 1 ()} for: 0 =10 =1 =100  =104  = 05 ⇒  =1645 √ √  (x0 )= ( −0 ) = 100(10175−10) =175    1 The p-value is: P((X)  175; =0 )=04 Reject 0  standard Normal tables One-sided values Two-sided values =100 :  =128 =100 :  =1645 =050 :  =1645 =050 :  =196 =025 :  =196 =025 :  =200 =010 :  =233 =010 :  =258 19
  • 20. The trade-off between type I and II error probabilities I How does the introduction of type I and II error probabilities addresses the arbitrariness associated with Fisher’s choice of a test statistic? The ideal test is one whose error probabilities are zero, but no such test exist for a given ; an ideal test, like the ideal estimator, exists as  → ∞! Worse, there is a trade-off between the above error probabilities: as one increases the other decreases and vice versa. Example. In the case of test  :={(X) 1 ()} in (16), decreasing the probability of type I error from =05 to =01 increases the threshold from  =1645 to  =233 which makes it easier to accept 0  and this increases the probability of type II error. To address this trade-off Neyman and Pearson (1933) proposed: (i) to specify 0 and 1 in such a way so as to render the type I error the more serious one, and then: (ii) to choose an  significance level test {(X) 1 ()} so as to maximize its power for all ∈Θ1  N-P rationale. Consider the analogy with a criminal offense trial, where the jury in such a trial are instructed by the judge to find the defendant ‘not guilty’ unless they have been convinced ‘beyond any reasonable doubt’ by the evidence: 0 : not guilty, vs. 1 : guilty. The clause ‘beyond any reasonable doubt’ amounts to fixing the type I error to a very small value, to reduce the risk of sending innocent people to prison, or even death. At the same time, one would want the system to minimize the risk of letting guilty people get off scot-free". More formally, the choice of an optimal N-P test is based on fixing an upper bound  for the type I error probability: P(x∈1 ; 0 () true) ≤  for all ∈Θ0  and then select a test {(X) 1 ()} that minimizes the type II error probability, or equivalently, maximizes the power: ()=P(x0 ∈1 ; 1 () true)=1−() for all ∈Θ1  I The general rule is that in selecting an optimal N-P test the whole of the parameter space Θ is relevant. This is why partitioning both Θ and the sample space X:=R using a test statistic (X) provides the  key to N-P testing. 20
  • 21. Optimal properties of tests The property that sets the gold standard for optimal test is: [1] Uniformly Most Powerful (UMP): A test  :={(X) 1 ()} is said to be UMP if it has higher power than any e other -level test   for all values ∈Θ1 , i.e. e (;  ) ≥ (;  ) for all ∈Θ1  Example. Test  :={(X) 1 ()} in (16) is UMP; Lehmann (1986). Additional properties of N-P tests [2] Unbiasedness: A test  :={(X) 1 ()} is said to be unbiased if the probability of rejecting 0 when false is always greater than that of rejecting 0 when true, i.e. max P(x0 ∈1 ; 0 ()) ≤   P(x0 ∈1 ; 1 ()) for all ∈Θ1  ∈Θ0 [3] Consistency: A test  :={(X) 1 ()} is said to be consistent if its power goes to one for all ∈Θ1 as  → ∞, i.e. lim   () = P(x0 ∈1 ; 1 () true)=1 for all ∈Θ1  →∞ Like in estimation, this is a minimal property for tests. Evaluating power. How does one evaluate the power of test  :={(X) 1 ()} for different discrepancies =1 −0 ? Step 1. In light of the fact that for the relevant sampling distribution for the evaluation of (1 ) is: √ 1 ( )  [II] (X)= ( −0 ) v 1 N( 1  1) for all 1  0  the use the N(0 1) tables requires on to split the test statistic into: √ √ √ (  −0 )  1 = ( −1 ) +  1   1 = ( −0 )  since under 1 (1 ) the distribution of the first component is: ³√ ´  ( ) √ 1 1 (  −1 ) (  −0 ) = − 1 v N(0 1) for 1  0    Step 2. Evaluating of power the test  :={(X) 1 ()} for different discrepancies yields: =1 −0 =1 =2 =3 √ 1  1 = ( −0 ) : =1: =2: =3: √  (1 )=P( ( −1 )  − 1 + ; 1 ) (101)=P(  −1 + 1645) = 259 (102)=P(  −2 + 1645) = 639 (103)=P(  −3 + 1645) = 913 21
  • 22. where  is a generic standard Normal r.v., i.e.  v N(0 1) The power of this test is typical of an optimal test since (1 ) √ 1 increases with the non-zero mean  1 = ( −0 )  and thus: (a) the power increases with the sample size  (b) the power increases with the discrepancy =(1 −0 ) and (c) the power decreases with  It is very important to emphasize three features of an optimal test. First, a test is not just a formula associated with particular statistical tables. It is a combination of a test statistic and a rejection region; hence the notation :={(X) 1 ()}. Second, the optimality of an N-P test is inextricably bound up with the optimality of the estimator the test statistic is based on. Hence, it is no accident that most optimal N-P tests are based on consistent, fully efficient and sufficient estimators. Example. In the case of the simple (one parameter) Normal model, consider replacing   with the 2 unbiased estimator 2 =(1 + )2 vN( 2 ) The resulting test: b ˇ  :={(X) 1 ()} where (X)= √ 2(2 −0 )   and 1 ()={x: (x)   } will not be optimal because its power is much lower than that of  , and ˇ  is inconsistent! This is because the non-zero mean of the sampling √ 1 distribution under =1 will be  = 2(−0 ) which does not change as  → ∞ Third, it is important to note that by changing the rejection region one can render an optimal N-P test useless! For instance, replacing the 22
  • 23. rejection region of :={(X) 1 ()} with:  1 ()={x: (x)   } the resulting test T :={(X)  1 ()} is practically useless because it ¸ is biased and its power decreases as the discrepancy  increases. Example 2 - the Student’s t test. Consider the 1-sided hypotheses: (1-s): 0 :  = 0  vs. 1 :   0 (19) or (1-s*): 0 :  ≤ 0  vs. 1 :   0  in the context of the simple Normal model (table 1). In this case, a UMP (consistent) test  :={ (X) 1 ()} exists. It is the well-known Student’s t test, and takes the form: test statistic: √  (X)= ( −0 )  2  1 = −1  P ( −  )2  =1 (20) rejection region: 1 ()={x :  (x)   } where  can be evaluated using the Student’s t tables. To evaluate the two types of error probabilities the distribution of  (X) under both the null and alternatives are needed: √ (  −0 ) 0 (0 ) v  √ 1 ( )  (X)= ( −0 ) v 1 [I]*  (X)= [II]* where  1 = √ (1 −0 )  St(−1) (21) St( 1 ; −1) for all 1  0  is the non-centrality parameter. Student’s t tables (=60) One-sided values Two-sided values =100 :  =1296 =100 :  =1671 =050 :  =1671 =050 :  =2000 =010 :  =2390 =010 :  =2660 =001 :  =3232 =001 :  =3460 In summary, the main components of a Neyman-Pearson (N-P) test { (X) 1 ()} are given below. 23
  • 24. Components of a Neyman-Pearson (N-P) test (i) a prespecified statistical model: M (x) (ii) a null (0 ) and the alternative (1 ) within M (x), (iii) a test statistic (distance function)  (X) (iv) the distribution of  (X) under 0  (v) prespecifying the significance level  [.01, .025, .05], (vi) the rejection region 1 () (vii) the distribution of  (X) under 1  6.2 The N-P Lemma and its extensions The cornerstone of the Neyman-Pearson (N-P) approach is the Neyman-Pearson lemma. Contemplate the simple generic statistical model: M (x)={ (x; )} ∈Θ:={0  1 }} x∈R  (22)  and consider the problem of testing the simple hypotheses: 0 :  = 0 vs. 1 :  = 1  (23) ¥ The fact that the assumed parameter space is Θ:={0  1 } and (23) constitute a partition, is often left out from most statistics textbook discussions of this famous lemma! Existence. There is exists an -significance level Uniformly Most Powerful (UMP) [-UMP] test, whose generic form is: (X)=(  (x;1 ) )  (x;0 ) 1 ()={x: (x)   } (24) where () is a monotone function. Sufficiency. If an -level test of the form (24) exists, then it is UMP for testing (23). Necessity. If {(X) 1 ()} is -UMP test, then it will be given by (24). At first sight the N-P lemma seems rather contrived because it is an existence result for a simple statistical model M (x) whose parameter space is artificial Θ:={0  1 }, but fits perfectly into the archetypal formulation. In addition, to implement this result one would need to find () yielding a test statistic (X) whose distribution is known under 24
  • 25. both 0 and 1 . Worse, this lemma is often misconstrued as suggesting that for an -UMP test to exist one needs to confine testing to simple-vs-simple cases even when Θ is uncountable! ¥ The truth of the matter is that the construction of an -UMP test in more realistic cases has nothing to do with simple-vs-simple hypotheses, but instead it is invariably based on the archetypal N-P testing formulation in (13), and relies primarily on monotone likelihood ratios and other features of the prespecified statistical model M (x). Example. To illustrate these comments consider the simple-vssimple hypotheses: (i) (1-1): 0 : =0 vs. 1 : =1  (25) in the context of a simple Normal (one parameter) model (table 3). Applying the N-P lemma requires setting up the ratio: © ª  (x;1 )  = exp 2 (1 − 0 )  − 22 (2 − 2 )  (26) 1 0  (x;0 ) which is clearly not a test statistic, as it stands. However, there exists a monotone function () which transforms (26) into a familiar test statistic (Spanos, 1999, pp. 708-9): h i √ (x;1 )  (x;1 ) 1  1 (X)=( (x;0 ) )= ( 1 ) ln(  (x;0 ) )+ 2 = ( −0 )  with ascertainable type I and II error probabilities based on: = = (X) v 0 N(0 1) (X) v 1 N( 1  1)  1 = √ (1 −0 )   In this particular case, it is clear that the Neyman-Pearson lemma does not apply because the hypotheses in (25) do not constitute a partition of the parameter space Θ=R, √ in (13). However, it turns out as  that when the test statistic (X)= ( −0 ) is combined with information relating to the values 0 and 1 , it can provide the basis for constructing several optimal tests (Lehmann, 1986):   [1] For 1  0  the test  :={(X) 1 ()} where  1 ()={x: (x)   } is a -UMP for the hypotheses: (ii) (1-s≥): 0 :  ≤ 0 vs. 1 :   0  (iii) (1-s): 0 :  = 0 vs. 1 :   0  25
  • 26. Despite the difference between (ii) and (iii), the relevant error probabilities coincide. The type I error probability for (ii) is defined as the maximum over all  ≤ 0 , but coincides with that of (iii) (=0 ): = max≤0 P((X)  ; 0 ())=P((X)   ; =0 ) Similarly, the p-value is the same because:  (x0 )= max P((X)(x0 ); 0 ())=P((X)(x0 ); =0 ) ≤0   [2] For 1  0  the test  :={(X) 1 ()} where  1 ()={x: (x)   } is a -UMP for the hypotheses: (iv) (1-s≤): 0 :  ≥ 0 vs. 1 :   0  (v) (1-s): 0 :  = 0 vs. 1 :   0  Again, the type I error probability and p-value are defined by: = max P((X)   ; 0 ())=P((X)  ; =0 ) ≥0  (x0 )= max P((X)(x0 ); 0 ())=P((X)(x0 ); =0 ) ≥0 The existence of these -UMP tests extends the N-P lemma to more realistic cases by invoking two regularity conditions: ¥ [A] The ratio (26) is a monotone function of the statistic    in the sense that for any two values 1 0  (x;1 ) changes monotonically (x;0 )  (x;1 ) with    This implies that  (x;0 )  if and only if    for some  This regularity condition is valid for most statistical models of interest in practice, including the one parameter Exponential family of distributions [Normal, Gamma, Beta, Binomial, Negative Binomial, Poisson, etc.], the Uniform, the Exponential, the Logistic, the Hypergeometric etc.; Lehmann (1986). ¥ [B] The parameter space under 1  say Θ1  is convex, i.e. for any two values (1  2 ) ∈Θ1  their convex combinations 1 + (1 − )2 ∈Θ1  for any 0 ≤  ≤ 1 When convexity does not hold, like the 2-sided alternative: (vi) (2-s): 0 :  = 0 vs. 1 :  6= 0  [3] the test  :={(X) 1 ()} 1 ()={x: |(x)|    } 2 is -UMPU (Unbiased); the -level and p-value are: =P(|(X)|    ; =0 ) q (x0 )=P(|(X)| |(x0 )|; =0 ) 2 26
  • 27. 6.3 Constructing optimal tests: Likelihood Ratio The likelihood ratio test procedure can be viewed as a generalization of the Neyman-Pearson lemma to more realistic cases where the null and/or the alternative are composite hypotheses. (a) The hypothesis of interest are of the general form: 0 : θ∈Θ0 vs. 1 : θ∈Θ1  (b) The test statistic is related to the ‘likelihood’ ratio: max∈Θ  (X)= max∈Θ (;X) = (;X) 0  (;X)  (;X) (27) Note that the max in the numerator is over all θ∈Θ [yielding the MLE b θ], but that of the denominator is confined to all values under 0 : e θ∈Θ0 [yielding the constrained MLE θ]. (c) The rejection region is defined in terms of a transformed ratio, where () is chosen to yield a known sampling distribution under 0 : 1 ()= {x:  (X)= ( (X))   }  (28) In terms of procedures to construct optimal tests, the LR procedure can be shown to yield several optimal tests; Lehmann (1986). Moreover, in cases where no such () can be found, one can use the asymptotic distribution which states that under certain restrictions (Wilks, 1938): ³ ´ b X− ln (θ; X) 0 2 () e v 2 ln  (X) =2 ln (θ;   0 where ∼ reads “under 0 is asymptotically distributed as” and  de notes the number of restrictions involved in Θ0  Example. Consider testing the hypotheses: 0 :  2 =2 vs. 1 :  2 6=2  0 0 in the context of the simple Normal model (table 1). From the previous chapter we know that the Maximum Likelihood method gives rise to: P P b ˆ 1 b 1  max(θ; x)=(θ; x) ⇒  =  =1  and  2 =  =1 ( −ˆ  )2  ∈Θ P e max(θ; x)=(θ; x) ⇒  =  =1  and  2 =2  ˆ 1 0 ∈Θ0 27
  • 28. Moreover, the two estimated likelihoods are:  ¡ 2 ¢− 2 max(θ; x)= 2b   ∈Θ n o  2  2 −2 max(θ; x)= (2 0 ) exp − 22  ∈Θ0 0 Hence, the likelihood ratio is: − 2 2  (;X) (20 )  (X)= (;X) =  exp{−(2 22 )}  0 − 2  (22 ) h 2 i 2   2  = ( 2 ) exp{−( 2 )+1}  0 0 The function () that transforms this ratio into a test statistic is: 2  0  (X)=( (X))=( 2 ) v 2 (−1)  0 and a rejection region of the form: 1 ={x: (X)  2 or (X)  1 } where for a size  test, the constants 1  2 are chosen in such as way R∞ R 1 so as: () = 2 () =   0 2 () denotes the chi-square density function. The test defined by {(X) 1 ()} turns out to be UMP Unbiased (UMPU) (see Lehmann, 1986). 6.4 Substantive vs. statistical significance in N-P testing We consider two substantive values of interest that concern the ratio of Boys (B) to Girls (G) in newborns: Arbuthnot: #B=#G, Bernoulli: 18B to 17G. (29) Statistical hypotheses. The Arbuthnot and Bernoulli conjectured values for the proportion can be embedded into a simple Bernoulli model (table 2), where =P(=1)=P(Boy): Arbuthnot:  = 1  2 Bernoulli:  = 18  35 (30) How should one proceed to appraise  and  vis-a-vis the above data? The archetypal specification (13) suggests probing each value separately using the difference ( −  ) as the discrepancy of interest. 28
  • 29. Despite the fact that on substantive grounds the only relevant values of  are (  ), on statistical inference grounds, the rest of the parameter space is relevant; M (x) provides the relevant inductive premises. This argument, however, has been questioned in the literature on the grounds that N-P testing should probe these two values using simple-vssimple hypotheses? After all, the argument goes, the Neyman-Pearson lemma would secure a -UMP test! This argument shows insufficient understanding of the NeymanPearson lemma, because the parameter space of the simple Bernoulli model is not Θ={   } but Θ=[0 1] Moreover, the fact that the simple Bernoulli model has a monotone likelihood ratio, i.e.: ³ ´ ³ ´  =1  (x;1 ) 1 (1−0 ) = (1−1 )  for (0  1 ) s.t. 00 1 1  (x;0 ) (1−0 ) 0 (1−1 ) P is a monotonically increasing function of the statistic   = =1  since 1 (1−0 )  1 ensuring the existence of several -UMP tests. 0 (1−1 )   In particular, [i]  :={(X) C1 ()}: √ (  −0 ) 0 (X)= √ 0 (1−0 )  v Bin (0 1; )  C1 ()={x: (x)   } (31) is a -UMP test for the N-P hypotheses: (ii) (1-s≥): 0 :  ≤ 0 vs. 1 :   0  (iii) (1-s): 0 :  = 0 vs. 1 :   0    and [ii]  :={(X) C1 ()}: √ (  −0 ) 0 (X)= √ 0 (1−0 )  v Bin (0 1; )  C1 ()={x: (x)   } (32) (iv) (1-s≤): 0 :  ≥ 0 vs. 1 :   0  (v) (1-s): 0 :  = 0 vs. 1 :   0  is a -UMP test for the N-P hypotheses. These results stem from the fact that the formulations (ii)-(v) ensure the convexity of the parameter space under 1  Moreover, in the case of a two-sided hypotheses: (i) (2-s): 0 :  = 0 vs. 29 1 :  6= 0 
  • 30. the test [iii]  :={(X) C1 ()}: √  0 ( (X)= √  −0 ) v Bin (0 1; )  C1 ()={x: |(x)|   } 0 (1−0 ) (33) is a -UMP, Unbiased; see Lehmann (1986). ¥ Randomization? No! In cases where the test statistic has a discrete sampling distribution under 0  as in (32)-(33), one might not be able to define  exactly. The traditional way of dealing with this problem is to randomize, but this ‘solution’ raises more problems than it solves. Instead, the best way to address the discreteness issue is to select a value  that is attained for the given  or approximate the discrete distribution with a continuous one to circumvent the problem. In practice, there are several continuous distributions one can use, depending on whether the original distribution is symmetric or not. In the above case, even for moderately small size sizes, say =20 the Normal distribution provides an excellent approximation for values of  around .5; see graph. With =30762, the approximation is nearly exact. D is tr ib u tio n P lo t 0 .2 0 D is tr ib u tio n B in o m ia l n 20 D is tr ib u tio n N o rm al p 0.5 M ean 10 S tD e v 2.236 Density 0 .1 5 0 .1 0 0 .0 5 0 .0 0 2 4 6 8 10 X 12 14 16 18 Normal approx. of Binomial:  (; =5 =20) What about the choice between one-sided vs. two-sided? The answer depends crucially on whether there exists reliable substantive information that renders part of the parameter space Θ irrelevant for both substantive and statistical purposes. 6.5 N-P testing of Arbuthnot’s value In this case there is reliable substantive information that the probability of the newborn being a male, =P(=1), is systematically slightly above 1 ; the human sex ratio at birth is approximately  =512 (Hardy, 2 30
  • 31. 2002). Hence, a statistically more informative way to assess =5 might be to use one-sided (1-s) directional probing. Data: =30762 newborns during the period 1993-5 in Cyprus, out of which 16029 were boys and 14833 girls. In view of the huge sample size it is advisable to choose a smaller significance level, say =01 ⇒  =2326 Case 1. For testing Arbuthnot’s value 0 = 1 the relevant hypotheses 2 are: (1-s): 0 :  = 1 vs. 1 :   1  2 2 (34) 1 or (1-s*≤): 0 :  ≤ 2 vs. 1 :   1  2   and  := {(X) C1 ()} is a -UMP test. In view of the fact that: (x0 )= √ 30762( 16029 − 1 ) 30762 2 √ 5(5) =7389 (35) (x0 )=P((X)7389; 0 ) = 00000 the null hypothesis 0 : = 1 is strongly rejected. 2 What if one were to ignore the substantive information and apply a 2-sided test? Case 2. For testing Arbuthnot’s value this takes the form: (2-s): 0 :  = 1 2 vs. 1 :  6= 1  2 (36)  :={(X) C1 ()} in (33) is a -UMP test; for =05   =2576 2 In view of the fact that: (x0 )= √ 30762( 16029 − 1 ) 30762 2 √ 5(5) =7389  2576 q (x0 )=P(|(X)|  7389; 0 )=00000 the null hypothesis 0 : = 1 is strongly rejected. 2 6.6 N-P testing of Bernoulli’s value The first question we need to answer is the appropriate form of the alternative in testing Bernoulli’s conjecture = 18 . 35 Using the same substantive information that indicates  is systematically slightly above 1 (Hardy, 2002), it seems reasonable to use a 2 1-sided alternative in the direction of 1 . 2 Case 1. For probing Bernoulli’s value the relevant hypotheses are: (1-s): 0 :  = 18 35 31 vs. 1 : 1  18  35 (37)
  • 32.   The optimal test (UMP) for (37) is  := {(X) C1 ()}, where for =01 ⇒  = −2326. Using the same data: √ 16029 18 30762( − ) (x0 )= √ 18 30762 35 =2379  (x0 )=P((X)  2379; 0 )=991 18 35 (1− 35 ) the null hypothesis 0 : = 18 is accepted with a high p-value. 35 What if one were to ignore the substantive information as pertaining only to Arbuthnot’s value, and apply a 2-sided test? Case 2. For testing Bernoulli’s value using the 2-s probing: (2-s): 0 :  = 18 vs. 1 :  6= 18  (38) 35 35 based on the UMPU test  :={(X) C1 ()} in (33): √ 30762( 16029 − 18 ) 30762 35 (x0 )= √ 18 18 35 (1− 35 ) =2379  2576 (39) q (x0 )=P(|(X)|  2379; 0 )=0174 yields accepting 0 , but the p-value is considerably smaller than  (x0 )=991 for the 1-sided test. The question that arises is which one is the relevant p-value? Indeed, a moments reflection suggests that these two are not the only choices. There is also the p-value for: (1-s): 0 :  = 18 vs. 1 :   18  (40) 35 35 which yields a much smaller p-value: (x0 )=P((X)  2379; 0 )=009 (41) reversing the previous results and leading to rejecting 0 . ¥ This raises the question: On what basis should one choose the relevant p-value? The short answer ‘on substantive grounds’ seems questionable because, first, it ignores the fact that the whole of the parameter space is relevant on statistical grounds, and second, one applies a test to assess the validity of substantive information, not foist it on the data at the outset. Even more questionable is the argument that one should choose a simple alternative on substantive grounds in order to apply the Neyman-Pearson lemma that guarantees a -UMP test; this is based on a misapprehension of the lemma. The longer answer, that takes account the fact that post-data (x0 ) is known, and thus there is often a clear direction of departure indicated by data x0 associated with the sign of (x0 ) will be elaborated upon in section 8. 32
  • 33. 7 Foundational issues raised by N-P testing Let us now appraise how successful the N-P framework was in addressing the perceived weaknesses of Fisher’s testing: [a] his ad hoc choice of a test statistics (X) [b] his use of a post-data [(x0 ) is known] threshold for the -value that indicates discordance (reject) with 0  and [c] his denial that (x0 ) can indicate accordance with 0  Very briefly, the N-P framework partly succeeded in addressing [a] and [c], but did not provide a coherent evidential account that answers the basic question (Mayo, 1996): when do data x0 provide evidence for or against (42) a hypothesis or a claim ? This is primarily because one could not interpret the decision results ‘Accept 0 ’ (‘Reject 0 ’) as data x0 provide evidence for (against) 0 [and thus provide evidence for 1 ]. Why? (i) A particular test  :={(X) C1 ()} could have led to ‘Accept 0 ’ because the power of that test to detect an existing discrepancy  was very low; the test had no generic capacity to detect an existing discrepancy. This can easily happen when the sample size  is small. (ii) A particular test  :={(X) C1 ()} could have led to ‘Reject 0 ’ simply because the power of that test was high enough to detect ‘trivial’ discrepancies from 0 , i.e. the generic capacity of the test was extremely high. This can easily happen when the sample size  is very large. In light of the fact that the N-P framework did not provide a bridge between the accept/reject rules and what scientific inquiry is after, i.e. when data x0 provide evidence for or against a hypothesis , it should come as no surprise to learn that most practitioners sought such answers by trying unsuccessfully (and misleadingly) to distill an evidential interpretation out of Fisher’s p-value. In their eyes the predesignated  is equally vulnerable to manipulation [pre-designation to keep practitioners ‘honest’ is unenforceable in practice], but the p-value is more informative and data-specific. Is Fisher’s p-value better in answering the question (42)? No. A small (large) — relative to a certain threhold  — p-value could not be interpreted as evidence for the presence (absence) of a substantive 33
  • 34. discrepancy  for the same reasons as (i)-(ii). The generic capacity of the test (its power) affects the p-value. For instance, a very small p-value can easily arise in the case of a very large sample size . Fallacies of Acceptance and Rejection The issues with the accept/reject 0 and p-value results raised above can be formalized into two classic fallacies. (a) The fallacy of acceptance: no evidence against 0 is misinterpreted as evidence for 0 . This fallacy can easily arise in cases where the test in question has low power to detect discrepancies of interest. (b) The fallacy of rejection: evidence against 0 is misinterpreted as evidence for a particular 1 . This fallacy arises in cases where the test in question has high power to detect substantively minor discrepancies. Since the power of a test increases with the sample size  this renders N-P rejections, as well as tiny p-values, with large  highly susceptible to this fallacy. In the statistics literature, as well as in the secondary literatures in several applied fields, there have been numerous attempts to circumvent these two fallacies, but none succeeded. 8 Post-data severity evaluation A moment’s reflection about the inability of the accept/reject 0 and p-value results to provide a satisfactory answer to the above question in (42) suggests that it is primarily due to the fact that there is a problem when such results are detached from the test itself, and are treated as providing the same evidence for a particular hypothesis  (0 or 1 ), regardless of the generic capacity (the power) of the test in question. That is, whether data x0 provide evidence for or against a particular hypothesis  (0 or 1 ) depends crucially on the generic capacity of the test (power) in question to detect discrepancies from 0 . This stems from the intuition that a small p-value or a rejection of 0 based on a test with low power (e.g. a small ) for detecting a particular discrepancy  provides stronger evidence for the presence of a particular discrepancy  than using a test with much higher power (e.g. a large ). Mayo (1996) proposed a frequentist evidential account based on harnessing this intuition in the form of a post-data 34
  • 35. severity evaluation of the accept/reject results. This is based on custom-tailoring the generic capacity of the test to establish the discrepancy  warranted by data x0 . This evidential account can be used to circumvent the above fallacies, as well as other (misplaced) charges against frequentist testing. The severity evaluation is a post-data appraisal of the accept/reject and p-value results that revolves around the discrepancy  from 0 warranted by data x0  ¥ A hypothesis  passes a severe test  with data x0 if: (S-1) x0 accords with , and (S-2) with very high probability, test  would have produced a result that accords less well with  than x0 does, if  were false. Severity can be viewed as an feature of a test  as it relates to a particular data x0 and a specific claim  being considered. Hence, the severity function has three arguments,  ( x0  ) denoting the severity with which  passes  with x0 ; see Mayo and Spanos (2006). Example. To explain how the above severity evaluation can be applied in practice, let us return to the problem of assessing the two substantive values of interest: Arbuthnot: = 1  Bernoulli:  = 18  2 35 When these values are viewed in the context of the error statistical perspective, it becomes clear that the way to frame the probing is to choose one of the values as the null hypothesis, and let the difference between them ( − ) represent the discrepancy of substantive interest. Data: =30762 newborns during the period 1993-5 in Cyprus, 16029 are boys and 14833 girls. Let us return to probing the Arbuthnot value  based on: (1-s): 0 : 0 = 1 vs. 1 : 0  1 2 2 that gave rise to the observed test statistic: (x0 )= √ 30762( 16029 − 1 ) 30762 2 √ 5(5) =7.389, (43) leading to a rejection of 0 at =01 ⇒ =2326 Indeed, 0 would have also been rejected by a (2-s) test since =2576. An important feature of the severity evaluation is that it is postdata, and thus the sign of the observed test statistic (x0 ) provides 35
  • 36. directional information that indicates the directional inferential claims that ‘passed’. In relation to the above example, the severity ‘accordance’ condition (S-1) implies that the rejection of 0 = 1 with (x0 )=7389  2 0, indicates that the form of the inferential claim that ‘passed’ is of the generic form:   1 = 0 + for some  ≥ 0 (44) The directional feature of the severity evaluation is very important in addressing several criticisms of N-P testing, including the potential arbitrariness and possible abuse of: [a] switching between one-sided, two-sided or simple-vs-simple hypotheses, [b] interchanging the null and alternative hypotheses, and [c] manipulating the level of significance in an attempt to get the desired testing result. To establish the particular discrepancy  warranted by data x0 , the severity post-data ‘discordance’ condition (S-2) calls for evaluating the probability of the event: "outcomes x that accord less well with   1 than x0 does", i.e. [x: (x) ≤ (x0 )] yielding:   ( ;   1 ) = P(x : (x) ≤ (x0 );   1 is false) (45) The evaluation of this probability is based on the sampling distribution: √ ˆ = ( (X)= √  −0 ) v1 Bin ((1 )  (1 ); )  for 1  0 0 (1−0 ) (46) √ (1 )= √(1 −0 ) ≥ 0  (1 )= 1 (1−1 )  0   (1 ) ≤ 1 0 (1−0 ) 0 (1−0 ) Since the hypothesis that ‘passed’ is of the form   1 =0 +, the  primary objective of SEV( ;   1 ) is to determine the largest discrepancy  ≥ 0 warranted by data x0 . Table 4 evaluates SEV(;   1 ) for different values of  with the evaluations based on the standardized (X): [(X)−(1 )] =1 √  (1 ) v Bin (0 1; ) ' N(0 1) For example, when 1 = 5142 : √ ˆ ( −0 ) √ 0 (1−0 ) − (1 )=7389 − 36 √ 30762(5142−5) √ 5(5) =2408 (47)
  • 37. and Φ( ≤ 2408)=992 where Φ() is the cumulative distribution function (cdf) of the standard Normal distribution. Note that the scaling p  (1 )=9996 ' 1 can be ignored. Table 4: Reject 0 :  = 5 vs. 1 :   5 Relevant claim Severity  1 =[5+] P(x: (X)≤(x0 ); 1 ) 010   0510 999 013   0513 997 0142   05142 992 0143   05143 991 015   0515 983 016   0516 962 017   0517 923 018   0518 859 020   0520 645 021   0521 500 025   0525 084 An evidential interpretation. Taking a very high probability, say 95 as a threshold, the largest discrepancy from the null warranted by this data is:   ≤ 01637 since  ( ;   51637) = 950 Is this discrepancy substantively significant? In general, to answer this question one needs to appeal to substantive subject matter information to assess the warranted discrepancy on substantive grounds. In human biology it is commonly accepted that the sex ratio at birth is approximately 105 boys to 100 girls; see Hardy (2002). Translating this ratio in terms of  yields  = 105 =5122 which suggests that the above 205 warranted discrepancy  ≥ 01637 is substantively significant since this outputs  ≥ 5173 which exceeds  . In light of that, the substantive discrepancy of interest between  and  :  ∗ =(1835) − 5 = 0142857  yields SEV( ;   5143)=991 indicating that there is excellent evidence for the claim   1 =0 + ∗ . I In terms of the ultimate objective of statistical inference, learning from data, this evidential interpretation seems highly effective because it narrows down an infinite set Θ to a very small subset! 37
  • 38. 8.1 Revisiting issues pertaining to N-P testing The above evidential interpretation based on the severity assessment can also be used to shed light on a number of issues raised in the previous sections. (a) The arbitrariness of N-P hypotheses specification The question that naturally arises at this stage is whether having good evidence for inferring   18 depends on the particular way one 35 has chosen to specify the hypotheses for the N-P test. Intuitively, one would expect that the evidence should not be dependent on the specification of the null and alternative hypotheses as such, but on x0 and the statistical model M (x) x∈R .  To explore that issue let us focus on probing the Bernoulli value: 0 : 0 = 18  where the test statistic yields: 35 √ 16029 18 30762( − ) (x0 )= √ 18 30762 35 =2379 18 35 (1− 35 ) (48) This value leads to rejecting 0 at =01 when one uses a (1-s) test [ =2326], but accepting 0 when one uses a (2-s) test [ =2576]. Which result should one believe? Irrespective of these conflicting results, the severity ‘accordance’ condition (S-1) implies that, in light of the observed test statistic (x0 )=23790, the directional claim that ‘passed’ on the basis of (48) is of the form:   1 =0 + To evaluate the particular discrepancy  warranted by data x0 , condition (S-2) calls for the evaluation of the probability of the event: ‘outcomes x that accord less well with   1 than x0 does’, i.e. [x: (x) ≤ (x0 )] giving rise to (45), but with a different 0 .  Table 5 provides several such evaluations of SEV( ;   1 ) for different values of ; note that these evaluations are also based on (46). A number of negative values of  are included in order to address the original substantive hypotheses of interest, as well as bring out the fact that the results of table 4 and 5 are identical when viewed as inferences pertaining to ; they simply have a different null values 0  and thus different discrepancies, but identical 1 values. 38
  • 39. The severity evaluations in tables 4 and 5, not only render the choice between (2-s), (1-s) and simple-vs-simple irrelevant, they also scotch the well-rehearsed argument pertaining to the asymmetry between the null and alternative hypotheses. The N-P convention of selecting a small  is generally viewed as reflecting a strong bias against rejection of the null. On the other hand, Bayesian critics of frequentist testing argue the exact opposite: N-P testing is biased against accepting the null; see Lindley (1993). The evaluations in tables 4 and 5 demonstrate that the evidential interpretation of frequentist testing based on severe testing addresses all asymmetries, notional or real. Table 5: Accept/Reject 0 : = 18 vs. 1 : ≷ 18 35 35 Relevant claim Severity  1 =[5143+] P(x: (x)≤(x0 ); 1 ) −0043   0510 999 −0013   0513 997 −0001   05142 992 0000   05143 991 0007   0515 983 0017   0516 962 0027   0517 923 0037   0518 859 0057   0520 645 0067   0521 500 0107   0525 084 (b) Addressing the Fallacy of Rejection The potential arbitrariness of the N-P specification of the null and alternative hypotheses and the associated p-values is brought out in probing the Bernoulli value  = 18 using different alternatives: 35 (1-s):  (x0 )=P((X)  2379; 0 )=991 (2-s): q (x0 )=P(|(X)|  2379; 0 )=017 (1-s):  (x0 )=P((X)  2379; 0 )=009 Can the above severity evaluation explain away these highly conflicting and confusing results? 39
  • 40. The first choice 1 :   18 was driven solely by substantive infor35 mation relating to = 1 , which is often a bad idea because it ignores 2 the statistical dimension of inference. The second choice was based on lack of information about the direction of departure, which makes sense pre-data, but not post-data. The third choice reflects the post-data direction of departure indicated by (x0 )=2379  0 In light of that, the severity evaluation renders  (x0 )=009 the relevant p-value, after all! ¥ The key problem with the p-value. In addition, the severity evaluation brings out the vulnerability of both the N-P reject rule and the p-value to the fallacy of rejection in cases where  is large. Viewing the relevant p-value ((x0 )=009) from the severity vantage point, it is directly related to 1 passing a severe test because: the probability that test  would have produced a result that accords less well with 1 than x0 does (x: (x)  (x0 )), if 1 were false (0 true):  Sev( ; x0 ; 0 ) =P((X)  (x0 );  ≤ 0 ) = =1−P((X)(x0 ); =0 )=991 is very high; Mayo (1996). The key problem with the p-value, however, is that is establishes the existence of some discrepancy  ≥ 0 but provides no information concerning the magnitude licensed by x0  The  severity evaluation remedies that by relating Sev( ; x0 ;   0 )=991 to the inferential claim   18  with the implicit discrepancy associated 35 with the p-value being =0. (c) Addressing the Fallacy of Acceptance Example. Let us return to the hypotheses: (1-s): 0 : 0 = 1 vs. 1 : 0  1 2 2 in the context of the simple Bernoulli model (table 2), and consider  applying test  with =025 ⇒  =196 to the case where data x0 are based on =2000 and yield =0503: √ √ (x0 )= 2000(503−5) =268 P((x)  268; 0 ) = 394 5(5) (x0 ) = 268 leads to accepting 0 , and the p-value indicates no rejection of 0  Does this mean that data x0 provide evidence for 0 ? or more accurately, does x0 provide evidence for no substantive discrepancy from 0 ? Not necessarily! 40
  • 41. To answer that question one needs to apply the post-data severe testing reasoning to evaluate the discrepancy  ≥ 0 for 1 =0 + war  ranted by data x0 . Test  :={(X) 1 ()} passes 0 , and the idea is to establish the smallest discrepancy  ≥ 0 from 0 warranted by data x0 by evaluating (post-data) the claim  ≤ 1 for 1 =0 + Condition (S-1) of severity is satisfied since x0 accords with 0 because (x0 )  , but (S-2) requires one to evaluate the probability  of "all outcomes x for which test  accords less well with 0 than x0 does", i.e. (x: (x)  (x0 )) under the hypothetical scenario that " ≤ 1 is false" or equivalently "  1 is true":   ( ; x0 ;  ≤ 1 )=P(x: (x)  (x0 );   1 ) for  ≥ 0 (49) The evaluation of (49) relies on the sampling distribution (46), and for different discrepancies ( ≥ 0) yields the results in table 6. Note  that Sev( ; x0 ;  ≤ 1 ) is evaluated at =1 because the probability increases with . The numerical evaluations are based on the standardized (X) from (47). For example, , when 1 = 515 : √ ˆ ( √  −0 ) − (1 )=268 − 0 (1−0 ) √ 2000(515−5) √ 5(5) = −1074 Φ(  −1074)=859 where Φ() is the cumulative distribution function q p (cdf) of N(0 1). Note that the scaling factor  (1 )= 515(1−515) = 5(5) 9996 can be ignored because its value is so close to 1.  Table 6 — Severity Evaluation of ‘Accept 0 ’ with ( ; x0 )  ≤ 1 = 0 +  Relevant claim:  Sev( ≤ 1 ) .001 .005 .01 .015 .017 .0173 .018 .429 .571 .734 .859 .895 .900 .02 .025 .910 .936 .975 .992 Assuming a severity threshold of, say 90 is high enough, the above results indicate that the ‘smallest’ discrepancy warranted by x0 , is  ≥ 0173 since:  ( ; x0 ;  ≤ 5173)=9 41 .03
  • 42. That is, data x0  in conjunction with test   provide evidence (with severity at least 9) for the presence of a discrepancy as large as  ≥ 0173. Is this discrepancy substantively insignificant? No! As mentioned above, the accepted ratio at birth in human biology is  =512 which suggests that the warranted discrepancy  ≥ 0173 is, indeed, substantively significant since this outputs  ≥ 5173 which exceeds  . This is an example where the post-data severity evaluation can be used to address the fallacy of acceptance. The severity assessment indicates that the statistically insignificant result (x0 )=268 at =025, actually provides evidence against 0 :  ≤ 5 and for  ≥ 5173 with severity at least 90 (d) The arbitrariness of choosing the significance level It is often argued by critics of frequentist testing that the choice of the significance level in the context of the N-P approach, or the threshold for the p-value in the case of Fisherian significance testing, is totally arbitrary and manipulable. To see how the severity evaluations can address this problem, let us try to ‘manipulate’ the original significance level (=01) to a different one that would alter the accept/reject results for 0 : = 18  Looking 35 back at results for Bernoulli’s conjectured value using the data from Cyprus, it is easy to see that choosing a larger significance level, say =02 ⇒  =2053 would lead to rejecting 0 for both N-P formulations: (2-s) (1 : 6= 18 ) and (1-s) (1 :  18 ) 35 35 since: (x0 )=2379  2053 (2-s): (x0 )=P(|(X)|  2379; 0 )=0174 (1-s):  (x0 )=P((X)  2379; 0 )=009 How does this change the original severity evaluations? The short answer is: it doesn’t! The severity evaluations remain invariant to any changes to  because they depend on on (x0 )=2379  0 and not on   which indicates that the generic form of the hypothesis that ‘passed’ is   1 =0 + and the same evaluations apply. 42
  • 43. 9 Summary and conclusions In frequentist inference learning from data x0 about the stochastic phenomenon of interest is accomplished by applying optimal inference procedures with ascertainable error probabilities in the context of a statistical model: M (x)={ (x; θ) θ∈Θ} x∈R  (50)  Hypothesis testing gives rise to learning from data by partitioning M (x) into two subsets: M0 (x)={ (x; θ) θ∈Θ0 } or M1 (x)={ (x; θ) θ∈Θ1 } x∈R ?  (51) framed in terms of the parameter(s): 0 : ∈Θ0  vs. 0 : ∈Θ1  (52) and inquiring x0 for evidence in favor one of the two subsets (51) using hypothetical reasoning. Note that one could specify (52) more perceptively as: ∈Θ ∈Θ 0 1 z }| { z }| { ∗ ∗ 0 :  (x)∈M0 (x)={ (x; θ) θ∈Θ0 } vs. 1 :  (x)∈M1 (x)={ (x; θ) θ∈Θ1 } where  ∗ (x)= (x; θ∗ ) denotes the ‘true’ distribution of the sample. This notation elucidates the basic question posed in hypothesis testing: I Given that the true M∗ (x) lies within the boundaries of M (x) can x0 be used to narrow it down to a smaller subset M0 (x)? A test  :={(X) 1 ()} is defined in terms of a test statistic ((X)) and a rejection region (1 ()), and its optimality is calibrated in terms of the relevant error probabilities: type I: P(x0 ∈1 ; 0 () true) ≤ () for ∈Θ0  type II: P(x0 ∈0 ; 1 (1 ) true)=(1 ) for 1 ∈Θ1  These error probabilities specify how often these procedures lead to erroneous inferences, and thus determine the pre-data capacity (power) of the test in question to give rise to valid inferences: (1 )=P(x0 ∈1 ; 1 (1 ) true)=(1 ) for all 1 ∈Θ1  An inference is reached by an inductive procedure which, with high probability, will reach true conclusions from valid (or approximately 43
  • 44. valid) premises (statistical model). Hence, the trustworthiness of frequentist inference depends on two interrelated pre-conditions: (a) adopting optimal inference procedures, in the context of: (b) a statistically adequate model. I How does hypothesis testing related to the other forms of frequentist inference? It turns out that testing is inextricably bound up with estimation (point and interval) in the sense that an optimal estimator is often the cornerstone of an optimal test and an optimal confidence interval. For instance, consistent, fully efficient and minimal sufficient estimators provide the basis for most of the optimal tests and confidence intervals! Having said that, optimal point estimation is considered rather noncontroversial, but optimal interval estimation and hypothesis testing have raised several foundational issues that have bedeviled frequentist inference since the 1930s. In particular, the use and abuse of the relevant error probabilities calibrating these latter forms of inductive inference have contributed significantly to the general confusion and controversy when applying these procedures. Addressing foundational issues The first such issue concerns the presupposition of the (approximately) true premises, i.e. the true M∗ (x) lies within the boundaries of M (x) This can be addressed by securing the statistical adequacy of M (x) vis-a-vis data x0  using trenchant Mis-Specification (MS) testing, before applying frequentist testing; see chapter 15. As shown above, when M (x) is misspecified, the nominal and actual error probabilities can be very different, rendering the reliability of the test in question, at best unknown and at worst highly misleading. That is, statistical adequacy ensures the ascertainability of the relevant error probabilities, and hence the reliability of inference. The second issue concerns the form and nature of the evidence x0 can provide for ∈Θ0 or ∈Θ1  Neither Fisher’s p-value, nor the N-P accept/reject rules can provide such an evidential interpretation, primarily because they are vulnerable to two serious fallacies: (c) Fallacy of acceptance: no evidence against the null is misinterpreted as evidence for it. (d) Fallacy of rejection: evidence against the null is misinterpreted as evidence for a specific alternative. 44
  • 45. As discussed above, these fallacies can be circumvented by supplementing the accept/reject rules (or the p-value) with a post-data evaluation of inference based on severe testing with a view to determined the discrepancy  from the null warranted by data x0  This establishes the inferential claim warranted [and thus, the unwarranted ones], in particular cases, so that it gives rise to learning from data. The severity perspective sheds ample light on the various controversies relating to p-values vs. type I and II error probabilities, delineating that the p-value is a post-data and the type I and II are pre-data error probabilities, and although inadequate for the task of furnishing an evidential interpretation, they were intended to fulfill complementary roles. Pre-data error probabilities, like the type I and II are used to appraise the generic capacity of testing procedures, and post-data error probabilities, like severity, are used to bridge the gap between the coarse ‘accept/reject’ and evidence provided by data x0 for the warranted discrepancy from the null, by custom-tailoring the pre-data capacity in light of (x0 ). This refinement/extension of the Fisher-Neyman-Pearson framework can be used to address, not only the above mentioned fallacies, but several criticisms that have beleaguered frequentist testing since the 1930s, including: (e) the arbitrariness of selecting the thresholds, (f) the framing of the null and alternative hypotheses in the context of a statistical model M (x), and (g) the numerous misinterpretations of the p-value. 45
  • 46. 10 Appendix: Revisiting Confidence Intervals (CIs) Mathematical Duality between Testing and Cls From a purely mathematical perspective a point estimator is a mapping () of the form: () : R → Θ  Similarly, an interval estimator is also a mapping () of the form: (): R → Θ, such that −1 ()={x: ∈(x)}⊂F for ∈Θ.  Hence, one can assign a probability to −1 ()=(x)={x: ∈(x)} in the sense that: P (x: ∈(x); =∗ ) :=P (∈(X); =∗ )  denotes the probability that the random set (X) contains ∗ -the true value of  The expression ‘∗ denotes the true value of ’ is a shorthand for saying that ‘data x0 constitute a realization of the sample X with distribution  (x; ∗ )’. One can define a (1−) Confidence Interval (CI) for  by attaching a lower bound to this probability: P (∈(X); =∗ ) ≥ (1 − ) Neyman (1937) utilized a mathematical duality between the N-P Hypothesis Testing (HT) and Confidence Intervals (CIs) to put forward an optimal theory for the latter. The -level test of the hypotheses: 0 : =0  vs. 1 : 6=0  takes the form {(X;0 ) 1 (0 ; )} (X;0 )-the test statistic and the rejection region defined by: 1 (0 ; )={x: |(x;0 )|    } ⊂ R   2 The complement of the latter with respect to the sample space (R ),  defines the acceptance region: 0 (0 ; )={x: |(x;0 )| ≤   } = R − 1 (0 ; )  2 The mathematical duality between HT and CIs stems from the fact that for each x∈R  one can define a corresponding set (x;) on the  parameter space by: (x;)={: x∈0 (0 ; )} ∈Θ e.g. (x;)={: |(x;0 )| ≤  } 2 46
  • 47. Conversely, for each (x;) there is a CI corresponding to the acceptance region 0 (0 ; ) such that: 0 (0 ; )={x: 0 ∈(x;)} x∈R  e.g. 0 (0 ; )={x: |(x;0 | ≤  }  2 This equivalence is encapsulated by: for each x∈R x∈0 (0 ; ) ⇔ 0 ∈(x; ) for each 0 ∈Θ When the lower b (X) and upper bound b (X) of a CI, i.e.   ´ ³ b (X) ≤  ≤ b (X) =1−  P  are monotone increasing functions of X the equivalence is: ´ ³ ´ ³ b (x) ≤  ≤ b (x) ⇔ x∈ b−1 () ≤ x ≤ b−1 ()     ∈  Optimality of Confidence Intervals The mathematical duality between HT and CIs is particularly useful in defining an optimal CI that is analogous to an optimal N-P test. The key result is that the acceptance region of a -significance level test  that is Uniformly Most Powerful (UMP), gives rise to (1−) CI that is Uniformly Most Accurate (UMA), i.e. the error coverage probability that (x; ) contains a value  6= 0  where 0 is assumed to be the true value, is minimized; Lehmann (1986). Similarly, a −UMPU test gives rise to a UMAU Confidence Interval. Moreover, it can be shown that such optimal CIs have the shortest length in the following sense: if [b (x) b (x)] is a (1−)-UMAU CI, then its length [width] is less   than or equal to that of any other (1−) CI, say [e (x) e (x)]:       ∗ (b (x) − b (x)) ≤ ∗ (e (x) − e (x)) Underlying reasoning: CI vs. HT Does this mathematical duality render HT and CIs equivalent in terms of their respective inference results? No! 47
  • 48. The primary source of the confusion on this issue is that they share a common pivotal function, and their error probabilities are evaluated using the same tail areas. But that’s where the analogy ends. To bring out how their inferential claims and respective error probabilities are fundamentally different, consider the simple (one parameter) Normal model (table 3), where the common pivot is: (X; )= √ (  −)  v N (0 1)  (53) Since  is unknown, (53) is meaningless as it stands, and it crucial to understand how the CI and HT procedures render it meaningful using different types of reasoning, referred to as factual and hypothetical, respectively: √ ∗ TSN  (X;)= ( − ) v N(0 1) (54) √ 0  (X)= ( −0 ) v N(0 1) Putting the (1−) CI side-by-side with the (1−) acceptance region: ´ ³   ∗ P   −  ( √ ) ≤  ≤   +  ( √ ); = =1− 2 2 ´ ³ (55)   P 0 −  ( √ ) ≤   ≤ 0 +  ( √ ); =0 =1− 2 2 it is clear that their inferential claims are crucially different: [a] The CI claims that with probability (1−) the random bounds  [  ±   ( √ )] will cover (overlay) the true ∗ [whatever that happens 2 to be], under the TSN. [b] The test claims that with probability (1−)  test  would yield a result that accords equally well or better with 0 (x: |(x;0 )| ≤   ), 2 if 0 were true. Hence, their claims pertain to the parameter vs. the sample space, and the evaluation under the TSN (=∗ ) vs. under the null (=0 ), render the two statements inferentially very different. The idea that a given (known) 0 and the true ∗ are interchangeable is an illusion. This can be seen most clearly in the simple Bernoulli case where  P needs to be estimated using b =  =1  :=  for the UMAU (1−)  1 CI: ³ ´  (1− )  (1− ) ∗     P   −  ( √ ) ≤     +  ( √ ); = =1− 2 2 48
  • 49. but the acceptance region of -UMPU test is defined in terms of the hypothesized value 0 : ´ ³ 0 (1−0 ) 0 (1−0 ) ( √ ( √ P 0 − 2 ) ≤    0 + 2 ); =0 =1−   The fact is that there is only one ∗ and an infinity of values 0  As a result, the HT hypothetical reasoning is equally well-defined pre-data and post data. In contrast, there is only one TSN which has already played out post-data, rendering coverage error probability degenerate, i.e. the observed CI:   [ −   ( √ )  +   ( √ )] 2 2 (56) either includes or excludes the true value ∗ . Observed Confidence Intervals and Severity In addition to addressing the fallacies of acceptance and rejection, the post-data severity evaluation can be used to address the issue of degenerate post-data error probabilities and the inability to distinguish between different values of  within an observed CI; Mayo and Spanos (2006). This is accomplished by making two key changes: (a) Replace the factual reasoning underlying CIs, which becomes degenerate post-data, with the hypothetical reasoning underlying the post-data severity evaluation. (b) Replace any claims pertaining to overlaying the true ∗ with severity-based inferential claims of the form:   1 =0 +  ≤ 1 =0 + for some  ≥ 0 (57) This can be achieved by relating the observed bounds:   ±   ( √ ) (58) 2  to particular values of 1 associated with (57), say 1 = −  ( √ ) and 2 evaluating the post-data severity of the claim:    1 = −  ( √ ) 2 A moment’s reflection, however, suggests that the connection between establishing the warranted discrepancy  from the null value 0 and the observed CI (58) is more apparent than real, because the severity evaluation probability is attached to the inferential claim (57), and not to the particular value 1 . Hence, at best one would be creating 49
  • 50. the illusion that the severity assessment on the boundary of a 1-sided observed CI is related to the confidence level; it does not! The truth of the matter is that the inferential claim (58) does not pertain directly to ∗  but to the warranted discrepancy  in light of x0 . Fallacious arguments for using Confidence Intervals A. Confidence Intervals are more reliable than p-values It is often argued in the social science statistical literature that CIs are more reliable than p-values because the latter are vulnerable to the large  problem; as the sample size  → ∞ the p-value becomes smaller and smaller. Hence, one would reject any null hypothesis given enough observations. What these critics do not seem to realize is that a CI like: ³ ´   ∗  ( √ ) ≤  ≤  +  ( √ ); = P   − 2  =1−   2 is equally vulnerable to the large  problem because the length of the interval: h i h i     ( √ ) −  −  ( √ )   + 2  = 2  ( √ )   2 2 shrinks to zero as  → ∞ B. The middle of a Confidence Interval is more probable? Despite the obvious fact that post-data the probability of coverage is either zero or one, there have been numerous recurring attempts in the literature to discriminate among the different values of  within an observed interval: “... the difference between population means is much more likely to be near the middle of the confidence interval than towards the extremes.” (Gardner and Altman, 2000, p. 22) This claim is clearly fallacious. A moment’s reflection suggests that viewing all possible observed CIs (56) as realizations from a single distribution in (54), leads one to the conclusion that, unless the particular  happens (by accident) to coincide with the true value ∗  values to the right or the left of  will be more likely depending on whether  is to the left or the right of ∗  Given that ∗ is unknown, however, such an evaluation cannot be made in practice with a single realization. This is illustrated below with a CI for  in the case of the simple (one parameter) Normal model. 50
  • 51. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. ∗ ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a ` − − − −  − − − − a 51