SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Downloaden Sie, um offline zu lesen
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
A conversation I had with Sir David Cox, (“Statistical Science and Philosophy of
Science: Where Do/Should They Meet (in 2011 and beyond)?” began with his
asking:
Cox: Deborah, in some fields foundations do not seem very important, but we both
think that foundations of statistical inference are important; why do you think that is?
Mayo: I think because they ask about fundamental questions about evidence,
inference, and probability. …In statistics we’re so intimately connected to the
scientific interest in learning about the world, we invariably cross into philosophical
questions about empirical knowledge and inductive inference.
When Sir David Cox was unable to make it to give my seminar at the LSE last
December, I presented my construal of his philosophy of “frequentist” statistical
methods (based on scads of papers over 60 years, 2 recent books of his on
“Principles”, our 2 joint papers)

1
 Moving away from caricatures of the standard methods (e.g., significance tests,
confidence intervals) on which critics often focus, the tools emerge as satisfying
piecemeal learning goals within a multiplicity of methods and interconnected
checks, and reports, of error.
 We deny the assumption that the overall unified principle for these methods
must be merely controlling the low-long run error probabilities of methods
(behavioristic rationale),
 We advance an epistemological principle that renders hypothetical error
probabilities relevant for learning about particular phenomena.
 We contrast it with current Bayesian attempts at unifications of frequentist and
Bayesian methods.

2
Decision and Inference: Cox also distinguishes uses of statistical methods for
inference and for decision--- the former is open ended (inferences are not fixed in
advance), without explicit quantification of costs
Background Knowledge: In the statistical analysis of scientific and technological
data, there is virtually always external information that should enter in reaching
conclusions about what the data indicate with respect to the primary question of
interest.
Typically, these background considerations enter not by a probability assignment (as
in the Bayesian paradigm), but by identifying the question to be asked, designing the
study, specifying an appropriate statistical model, interpreting the statistical results,
and relating those inferences to primary scientific ones, and using them to extend and
support underlying theory.
--a very large issue in current foundations

3
Roles of Probability in Induction
Inductive inference: the premises (background and evidence statements) can be true
while the conclusion inferred may be false without a logical contradiction:
the conclusion is "evidence transcending".
probability naturally arises in capturing induction, but there is more than one way this
can occur: To measure:
1. how much belief or support there is for a hypothesis
2. how reliable or how well tested a hypothesis is
 too often presupposed it can only be the former—probabilism leading to wellknown difficulties with using frequentist ideas (universes are not as plenty as
blackberries—Peirce)
 our interest is in how frequentist probability may be used for the latter (for
measuring corroboration, trustworthiness, severity of tests)…

4
For instance, significance testing uses probability to characterize the proportion of
cases in which a null hypothesis H0 would be erroneously rejected in a hypothetical
long-run of sampling, an error probability.
We may even refer to frequentist statistical inference as error probability statistics, or
just error statistics.
Cox’s analogy with calibrating instruments:
"Arguments involving probability only via its (hypothetical) long-run frequency
interpretation are called frequentist. That is, we define procedures for assessing
evidence that are calibrated by how they would perform were they used repeatedly.
In that sense they do not differ from other measuring instrument." (Cox 2006, p. 8)
As with the use of measuring instruments, we employ the performance features to
make inferences about aspects of the particular thing that is measured, (aspects that
the measuring tool is appropriately capable of revealing).

5
It’s not hard to see how we do so, if we remember the central problem of inductive
inference
The argument from:
H entails data y, (H “fits” y)
y is observed,
therefore H is correct,
is, of course, deductively invalid.
A central problem is to be able nevertheless to warrant inferring H, or regarding y as
good evidence for H.
Although y accords with H, even in deterministic cases, many rival hypotheses (some
would say infinitely many) would also fit or predict y, and would pass as well as H.

6
In order for a test to be probative, one wants the prediction from H to be something
that would be very difficult to achieve, and not easily accounted for, were H false and
rivals to H correct (whatever they may be)
i.e., the test should have high probability of falsifying H, if H is false….

7
Statistical Significance Tests
While taking elements from Fisherian and Neyman-Pearson approaches, our
conception differs from both
1. We have empirical data y treated as observed values of a random variable (vector)
Y.
y is of interest to provide information about the probability distribution of Y as
defined by the relevant statistical model—often an idealized representation of
the underlying data generating process.
2. We have a hypothesis framed in terms of parameters of the distribution, the
hypothesis under test or the null hypothesis H0.
An elementary example:
Data Y consists of n Independent and Identically Distributed (IID) components
(r.v’s), Normally distributed with unknown mean  and possibly unknown standard
deviation  :
Yi ~ NIID(,), i =1,2,…,n.

8
 A simple hypothesis is obtained if the value of  is known and H0 asserts that 
= 0, a given constant.
We find a function T = t(Y) of the sample, the test statistic, such that:
(i) The larger the value of t the more inconsistent or discordant the data are with
H0 in the respect tested
(ii) The statistic T = t(Y) has a (numerically) known probability distribution when
H0 is true.
The p-value (observed statistical significance level) corresponding to t(y) is:
P(t(Y) ≥ t(y); H0 ):=P(T ≥ t; H0 )
regarded as a measure of accordance with H0 in the respect tested.
So we have: (1) Data (corresponding sample space) , (2) hypotheses, (3) test statistic,
(4) p-value (or other error probability) as a measure of discordance or accordance
with H0 in the respect tested.

9
Low p-values, if properly computed, count as evidence against H0 (evidence of a
specified discrepancy or inconsistency with H0).

10
Our treatment contrasts with what is often taken as the strict Neyman-Pearson (N-P)
formulation:
 Rejects recipe-like view of tests, e.g., set a preassigned threshold value  and
"reject" H0
if and only if p ≤  .
However, such “behavioristic” construals have roles in this conception: While a
relatively mechanical use of p-values is widely lampooned, there are contexts where
it serves as a screening device, decreasing the rates of publishing misleading results
(e.g., microarrays)
Virtues: impartiality, relative independence from manipulation, gives protection of
known and desirable long-run properties.
But the main interest and novelty here is developing an inferential or evidential
rationale.
P(t(Y) ≥ t(y); H0 ):=P(T ≥ t; H0 )

11
For example: low p-values, if properly computed, count as evidence against H0
(evidence of a specified discrepancy or inconsistency with H0).
Why?

12
It’s true that such a rule provides low error rates (i.e., erroneous rejections) in the
long run when H0 is true, a behavioristic argument:
Suppose that we were to treat the data as just decisive evidence against H0, then, in
hypothetical repetitions, H0 would be rejected in a long-run proportion p of the
cases in which it is actually true.
(so a p-value is an error probability unlike what many allege)
But what matters for us, is that such a rule also provides a way to determine whether
a specific data set provides evidence of inconsistency with or discrepancy from H0.
…along the lines of the probative demand for tests…

13
FEV: Frequentist Inductive Inference Principle
The reasoning is based on the following frequentist inference principle (with respect
to H0):
(FEV) y fails to count as evidence against H0,
if, such discordant results are fairly frequent even if H0 is correct. Actually FEV(i)
(Mayo and Cox 2006/2010, 254)1
y counts as evidence of a discrepancy only if (and to the extent that) a less discordant
result would probably have occurred, if H0 correctly described the distribution
generating y.
So if the p-value is not small, it is fairly easy (frequent) to generate such discordant
results even if the null is true, so this is not good evidence of a discrepancy from the
null…
Where “such discordant results” are those as or even further from H0 than the
outcome observed.

14
Weight Gain Example.
To distinguish between this "evidential" use of significance test reasoning, and the
familiar appeal to “low long-run erroneous behavior” (N-P), consider a very informal
example:
Suppose that weight gain is measured by a multiplicity of well-calibrated and stable
methods, and the results show negligible change over a test period (e.g., before and
after England).
This is grounds for inferring that my weight gain is negligible within limits set by the
sensitivity of the scales.
Why?
While it is true that by following such a procedure in the long run one would rarely
report weight gains erroneously, that is not the rationale for the particular inference.

15
The justification is rather that the error probabilistic properties of the weighing
procedure reflect what is actually the case in the specific instance.
(It informs about the specific cause of the lack of a recorded increase in weight).
Low long run error, while necessary, is not sufficient for warranted evidence in a
particular case.
General Severity Account of Evidence for statistical and non-statistical cases:
FEV falls under a more general account of evidence or inductive inference, extending
beyond statistical hypotheses:
Data y provide evidence for a claim H, just to the extent that H passes a severe test
with y

16
The intuition behind requiring severity is that:
Data y0 in test T provide good evidence for inferring H (just) to the extent that H
passes severely with y0, i.e., to the extent that H would (very probably) not have
survived the test so well were H false.
This may be a quantitative or a qualitative assessment.
(A core question that permeates "inductive philosophy", both in statistics and
philosophy is:
What is the nature and role of probabilistic concepts, methods, and models in making
inferences in the face of limited data, uncertainty and error?

17
Mayo and Cox (2010) identify a key principle —
Frequentist Evidential Principle (FEV):
(FEV) — by which hypothetical error probabilities may be used for inductive
inference from specific data and consider how FEV may direct and justify:
(a) Different uses and interpretations of statistical significance levels in testing a
variety of different types of null hypotheses,
and
(b) when and why "selection effects" need to be taken account of in data dependent
statistical testing.

18
The Role of Probability in Frequentist Induction
The defining feature of an inductive inference is that the premises (background
and evidence statements) can be true while the conclusion inferred may be false
without a logical contradiction:
the conclusion is "evidence transcending".
Probability naturally arises but there is more than one way this can occur.
Two distinct philosophical traditions for using probability in inference are summed
up by Pearson (1950, p. 228) (probabilism and performance):
“For one school, the degree of confidence in a proposition, a quantity varying with
the nature and extent of the evidence, provides the basic notion to which the
numerical scale should be adjusted.
The other school notes the relevance in ordinary life of a knowledge of the relative
frequency of occurrence of a particular class of events in a series of repetitions, and
suggests that "it is through its link with relative frequency that probability has the
most direct meaning for the human mind"

19
Frequentist induction, employs probability in the second manner:
For instance, significance testing appeals to probability to characterize the proportion
of cases in which a null hypothesis H0 would be rejected in a hypothetical long-run of
repeated sampling, an error probability.
We may even refer to frequentist statistical inference as error probability statistics, or
just error statistics.
Some, following Neyman, describe the role of probability in frequentist error
statistics as “inductive behavior”:
 Here the inductive reasoner "decides to infer" the conclusion, and probability
quantifies the associated risk of error.
The idea that one role of probability arises in science to characterize the "riskiness"
or probativeness or severity of the tests to which hypotheses are put is reminiscent of
the philosophy of Karl Popper.

20
Popper and Neyman:
 Popper and Neyman have a broadly analogous approach based on the idea that
we can speak of a hypothesis having been well-tested in some sense, quite
different from its being accorded a degree of probability, belief or confirmation;
 Both also broadly shared the view that in order for data to "confirm" or
"corroborate" a hypothesis H, that hypothesis would have to have been subjected
to a test with high probability or power to have rejected it if false.
But the similarities between Neyman and Popper do not go as deeply as one might
suppose.
 Despite the close connection of the ideas, there appears to be no reference to
Popper in the writings of Neyman (Lehmann, 1995, p. 3) and the references by
Popper to Neyman are scarce (e.g.,LSD).
 Moreover, because Popper denied that any inductive claims were justifiable, his
philosophy forced him to deny that even the method he espoused (conjecture
and refutations) was reliable (not that it’s known to be unreliable)

21
 Although H might be true, Popper made it clear that he regarded corroboration
at most as a report of the past performance of H: it warranted no claims about
its reliability in future applications.

22
Popper and Fisher:
Popper’s focus on falsification by Popper as the goal of tests, and indeed, science, is
clearly strongly redolent of Fisher.
While, again, evidence of direct influence is virtually absent, Popper would have
concurred with Fisher (1935a, p. 16) that ‘every experiment may be said to exist only
in order to give the facts the chance of disproving the null hypothesis’.
However, Popper's position denies ever having grounds for inference about
reliability, or for inferring reproducible deviations.
 By contrast, a central feature of frequentist statistics is to actually assess and
control the probability that a test would have rejected a hypothesis, if false.
The advantage in the modern statistical framework is that the probabilities arise from
defining a probability (statistical) model to represent the phenomenon of interest.
Neyman describes frequentist statistics as modelling the phenomenon of the stability
of relative frequencies of results of repeated "trials", granting there are other

23
possibilities concerned with modelling psychological phenomena connected with
intensities of belief, or with readiness to bet specified sums, etc.
Frequentist statistical methods can (and often do) supply tools for inductive
inference by providing methods for evaluating the severity or probativeness of tests
—although they don’t directly do so, the severity rationale gives guidance in using
them in this way.

24
A hypothesis H passes a severe test T with data y0 if,
(S-1) y0 agrees with H, and
(S-2) with very high probability, test T would have produced a result that
accords less well with H than y0 does, if H were false.
In statistical contexts, the test statistic t(Y) must define an appropriate measure of
accordance or distance (as required by severity condition (S-1).
Note: when t(Y) is statistically significantly different from H0, it is agreeing with notH0
‘The severity with which inference H passes test T with outcome y’ may be
abbreviated by:
SEV(Test T, outcome y, claim H)

25
Just as one makes inferences about changes in body mass based on performance
characteristics of various scales, error probabilities of tests indicate the capacity
of the particular test to have revealed inconsistencies in the respects probed, and
this in turn allows relating p-values to statements about the underlying process.
Peirce: shall we assume instead the scales read my mind and mislead me just when I
don't know the weight, but doo fine with items of known weight, e.g., 5 lb potatoes.
—That itself a highly unreliable way to proceed in reasoning from instruments…
Note: Clue to showing counterinduction is unreliable.

26
A Multiplicity of Types of Null Hypotheses (Cox’s taxonomy)
Considerable criticisms and abuses of significance tests would have been avoided had
Cox’s taxonomy in (1958) been an integral part of significance testing.
Types of Null Hypothesis (multiple uses in piecemeal steps involved in
linking data and models to learn about modeled phenomena)
1. Embedded null hypotheses
2. Dividing null hypotheses:
3. Null hypotheses of absence of structure
4. Null hypotheses of model adequacy
5. Substantively-based null hypotheses.
Evaluating such local hypotheses for these piecemeal learning goals demands its
own criteria

27
1. Embedded null hypotheses
Consider a parametric family of distributions f(y;  indexed by unknown (vector of)
parameters = 
where  denotes unknown nuisance parameter(s).
The null and alternative hypotheses of interest are:
H0: vs. H1: 
Central tasks, often ignored, take central stage in Cox’s discussion and indeed are an
important part of the justification of this approach (though must be skimpy here):
(a) —need to choose appropriate test statistic T(Y)
(b) —need to be able to compute the probability distribution of T under the null (and
perhaps alternative) hypotheses (it should not depend on nuisance parameters)

28
(c) —need to collect data y to so that they satisfy adequately the assumptions of the
relevant probability model.
If we succeed with all that….
p-value not small—evidence the data are consistent with the null…
But no evidence against is not automatically evidence for the null…
but we can infer the absence of a discrepancy from (or concordance with) H0
To infer the absence of a discrepancy from H0 as large as  examine the probability

We have a result and it differs from H0 by some observed amount d.
 = Prob[a result that differs as much as or more from H0 (than d); evaluated under
the assumption that ],
 = P(t(Y) > t(y) ; 
If is near 1, then, following FEV, the data are good evidence that 
Terms: discrepancy refers to parameters difference, discord, fit refers to observed
differences.


29
Interpreting 

 may be regarded as stringency or severity with which the test has probed the
discrepancy ;
that is, has passed a severe test:
(other terms: precision, sensitivity)
 Avoids unwarranted interpretations of “consistency with H0” with insensitive
tests (“fallacies of acceptance”).
 A fallacy of acceptance takes an insignificant difference as evidence that the null
is true; but tests have limited discernment power.
 More relevant to specific data than is a test’s power, which is calculated relative
to a predesignated critical value c beyond which the test "rejects" the null.
Power at 

() = P(t(Y) > c ; 

In contrast:

 = P(t(Y) > t(y) ; 

30
An important passage in Cox 2008, p. 25)
“In the Neyman-Pearson theory of tests, sensitivity of a test is assessed by the notion
of power….. In the approach adopted here the assessment is via the distribution of
the random variable P, again considered for various alternatives”
P here is the significance probability regarded as a random variable—which it is…
Cox makes an equivalent move referring to confidence intervals—I'll come back to…
small p-values (evidence of some discrepancy in direction of alternative).
Critics correctly note that the p-value alone doesn’t tell you the size of the
discrepancy indicated “effect size”), but it, together with sample size, can be used to
determine this:
 If a test is so sensitive that so small a p-value is probable, even when ,
then a small p-value is not evidence of a discrepancy from H0 in excess of .
So by this reasoning we also avoid “fallacies of rejection”

31
5. Substantively-based null hypotheses.
A theory T predicts that H0 is at least a very close approximation to true situation
(perhaps T has already passed several theoretical and empirical tests)
Rival theory T* predicts a specified discrepancy from H0 and the test is designed to
discriminate between T and T* in a thus far untested domain.
Focus on interpreting: p-value not small.
H0 may be formed deliberately to let T “stick its neck out”
Discrepancies from T in the direction of T* are given a very good chance to be
detected, so if no significant departure is found, this constitutes evidence for T in the
respect tested
Famous “null results” take this form (e.g., set the GTR predicted values at 0)
rivals to the GTR predicted a breakdown of the Weak Equivalence Principle (WEP)
for massive self-gravitating bodies, the earth-moon system: this effect, the Nordvedt
effect would be 0 for GTR (identified with the null hypothesis) and non-0 for rivals.

32
Measurements of the round trip travel times between the earth and moon (between
1969 and 1975) set upper bounds to the possible violation of the WEP, and because
the tests were sufficiently sensitive, these measurements provided good evidence that
the Nordvedt effect is absent, i.e., evidence for the null hypothesis
Such a negative result does not provide evidence for all of GTR (in all its areas of
prediction), but it does provide evidence for its correctness with respect to this effect.
Some argue I should allow inferring all of GTR (Chalmers, Laudan, Musgrave) —but
I think it is this piecemeal way that inquiries enable progress (realizing that NOT all
of GTR has been warranted provoked developing rivals…)
Unlike the large-scale theory testers, we are not after an account of “theory choice”
but rather learning about aspects of theoretical phenomena by local probes…
Statistical reasoning is at the heart of inquiries which are broken down into questions
about parameters in statistical distributions intermediate between the full theory and
the actual data.
My conception is to view them as probing for piecemeal errors—and note, an error
can be any mistaken claim or flawed understanding—both empirical and theoretical!

33
 A fundamental tenet of the conception of inductive learning most at home with
the frequentist philosophy is that inductive inference requires building up
incisive arguments and inferences by putting together several different piecemeal results.
 Although the complexity of the story makes it more difficult to set out neatly, as,
for example, if a single algorithm is the whole of inductive inference, the payoff
is an account that approaches the kind of full-bodied arguments that scientists
build up in order to obtain reliable knowledge.
 The goal is not representing beliefs or opinions (as in personalist Bayesian
accounts) but avoiding being misled by beliefs and opinions

34
Series of Confidence Intervals (goal: to distinguish inferential and
behaviorist justifications while avoiding an infamous fallacy of R.A. Fisher)
Consider our Normal mean “embedded” example, rather than running a
significance test we may form the
1 − α upper confidence bound, CIU(Y; α)
for estimating mean µ, let  be known, y= n-.5.
The upper 1 - α limit is Y + k(α)y
k(α) the upper α-value of the standard Normal distribution. Pick a small α, say .025
The upper .975 limit is Y + 1.96y
Consider the inference:
Infer (or regard the data as evidence for)
µ  y + 1.96y
CIU(Y; .025)

35
One rationale is that it instantiates an inference rule that yields true claims with high
probability (.975) because
P(µ  Y + 1.96y)  .975
Whatever the true value of µ

36
The procedure has high long-run “coverage probabilities.”
They "rub-off" on the particular case.
Instead we might view µ  y + 1.96syas an inference from a type of reductio ad
absurdum argument:
suppose in fact that this inference is false and the true mean is µ*, where µ*
y + 1.96y.
Then it is very probable that we would have observed a larger sample mean:
P( Y  y ; µ*)  .975.
Therefore, one can reason, y is inconsistent at level .975 (or .025) with having been
generated from a population with µ in excess of the upper limit.
This reasoning is captured in FEV
-------------------------------------------------

37
Aside: This statistic is directly related to a test of µ  µ0 against µ  µ0.
In particular, Y is statistically significantly smaller than values of µ in excess of
CIU(Y; α) at level α.

38
Fisher's fiducial error
This may make  look like a random variable----but it is not; these claims do not
hold once a specific y is plugged in for Y
P ( < Y  0x) = .5
P ( < Y  .5x) = .7
P ( < Y  1x) = .84
P ( < Y  1.5x) = .93
P ( < Y  1.96 x) = .975
“Our (Cox and Hinkley) attitude toward the statement [ < Y  1.96 x] might then
be taken to be the same as that to other uncertain statements for which the probability
is [.975] of their being true, and hence…is virtually indistinguishable from a
probability statement about []. However, this attitude is in general incorrect, in our
view because the confidence statement can be known to be true or untrue…… The
system of confidence limits simply summarizes what the data tell us about , given
the model….It is wrong to combine confidence limit statements about different
39
parameters as though the parameters were random variables.” (Cox and Hinkley
1974, p. 227)
But what exactly is it telling us about 

40

P ( < Y  0x) = .5
P ( < Y  .5x) = .7
P ( < Y  1x) = .84
P ( < Y  1.5x) = .93
P ( < Y  1.96 x) = .975
Answer (Mayo): it tells you which discrepancies are and are not indicated (at various
degrees of severity)
If we replace P with SEV (the degree of severity or even “corroboration”) the claims
are true even after substituting the observed sample mean
(the severity with which the inference has passed the test with the data)

41
SEV( < Y 0  x) to abbreviate:
The severity with which a test T with an observed result Y 0 passes:
( < Y 0  x).
SEV ( < Y 0  0x) = .5
SEV ( < Y 0  .5x) = .7
SEV ( < Y 0  1x) = .84
SEV ( < Y 0  1.5x) = .93
SEV ( < Y 0  1.96 x) = .975
But aren’t I just using this as another way to say how probable each claim is?
No. This would lead to inconsistencies (if we mean mathematical probability), but
the main thing is, or so I argue, probability gives the wrong logic for “how welltested” (or “corroborated”) a claim is

42
What Would a Logic of Severe Testing be?
If H has passed a severe test T with x, there is no problem in saying x warrants H, or,
if one likes, x warrants believing in H….
(H could be the confidence interval estimate)
If SEV(H ) is high, its denial is low, i.e., SEV(~H ) is low
But it does not follow a severity assessment should obey the probability calculus, or
be a posterior probability….
For example, if SEV(H ) is low, SEV(~H ) could be high or low
……in conflict with a probability assignment
(there may be a confusion of ordinary language use of “probability”)

43
Other points lead me to deny probability yields a logic we want for well-testedness2
(e.g., problem of irrelevant conjunctions, tacking paradox for Bayesians and
hypothetico-deductivists)
This is different from what I call the
Contrast again with the rubbing off construal: The procedure is rarely wrong,
therefore, the probability it is wrong in this case is low.
The long-run reliability of the rule is a necessary but not a sufficient condition to
infer H (with severity)
The reasoning instead is counterfactual:
H:

 < Y 0  1.96x

(i.e.,  < CIu )

passes severely because were this inference false, and the true mean  > CIu
then, very probably, we would have observed a larger sample mean:

44
A very well known criticism (many Bayesians say it is a reductio) of frequentist
confidence intervals stems from assuming that confidence levels must give degrees
of belief, critics mount
e.g., a 95% confidence interval might be known to be correct, (given some additional
constraints about the parameter)--trivial intervals
In our construal, this is just saying no parameter values are ruled out with severity…..
“Viewed as a single statement [the trivial interval] is trivially true, but, on the other hand,
viewed as a statement that all parameter values are consistent with the data at a particular
level is a strong statement about the limitations of the data.” (Cox and Hinkley1974, p.
226)3

45
Mayo and Spanos: Severity Interpretations for T+ of “accept” and “reject”

(SEV) for non-statistically-significant results
For “Accept” H0 (a statistically insignificant difference from 0)
 ≤ Y 0 + k( √n) passes with severity (1 - ,
where P(d(X) > k) =  (for any 0 <  < 1).

(SEV) for statistically-significant results
For Reject H0 (a statistically significant difference from 0)
 > Y 0  k( √n) passes with severity (1 - 

46
FEV(ii): A moderate p-value is evidence of the absence of a discrepancy  from H0
only if there is a high probability the test would have given a worse fit with H0 (i.e.,
smaller p-value) were a discrepancy  to exist
1

47

Weitere ähnliche Inhalte

Was ist angesagt?

An Introduction to Mis-Specification (M-S) Testing
An Introduction to Mis-Specification (M-S) TestingAn Introduction to Mis-Specification (M-S) Testing
An Introduction to Mis-Specification (M-S) Testingjemille6
 
Probability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis TestingProbability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis Testingjemille6
 
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...jemille6
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Sciencejemille6
 
Replication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden ControversiesReplication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden Controversiesjemille6
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...jemille6
 
Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma jemille6
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayojemille6
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
 
hypothesis testing overview
hypothesis testing overviewhypothesis testing overview
hypothesis testing overviewi i
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testingrishi.indian
 
6 estimation hypothesis testing t test
6 estimation hypothesis testing t test6 estimation hypothesis testing t test
6 estimation hypothesis testing t testPenny Jiang
 
Feb21 mayobostonpaper
Feb21 mayobostonpaperFeb21 mayobostonpaper
Feb21 mayobostonpaperjemille6
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperChristian Robert
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardChristian Robert
 
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."jemille6
 
Hypothesis testing an introduction
Hypothesis testing an introductionHypothesis testing an introduction
Hypothesis testing an introductionGeetika Gulyani
 

Was ist angesagt? (20)

An Introduction to Mis-Specification (M-S) Testing
An Introduction to Mis-Specification (M-S) TestingAn Introduction to Mis-Specification (M-S) Testing
An Introduction to Mis-Specification (M-S) Testing
 
Probability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis TestingProbability/Statistics Lecture Notes 4: Hypothesis Testing
Probability/Statistics Lecture Notes 4: Hypothesis Testing
 
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
 
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in ScienceD. Mayo: Philosophy of Statistics & the Replication Crisis in Science
D. Mayo: Philosophy of Statistics & the Replication Crisis in Science
 
Replication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden ControversiesReplication Crises and the Statistics Wars: Hidden Controversies
Replication Crises and the Statistics Wars: Hidden Controversies
 
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosoph...
 
Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayo
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severely
 
hypothesis testing overview
hypothesis testing overviewhypothesis testing overview
hypothesis testing overview
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testing
 
6 estimation hypothesis testing t test
6 estimation hypothesis testing t test6 estimation hypothesis testing t test
6 estimation hypothesis testing t test
 
Feb21 mayobostonpaper
Feb21 mayobostonpaperFeb21 mayobostonpaper
Feb21 mayobostonpaper
 
Hypothesis testing - Primer
Hypothesis testing - PrimerHypothesis testing - Primer
Hypothesis testing - Primer
 
beyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paperbeyond objectivity and subjectivity; a discussion paper
beyond objectivity and subjectivity; a discussion paper
 
Discussion a 4th BFFF Harvard
Discussion a 4th BFFF HarvardDiscussion a 4th BFFF Harvard
Discussion a 4th BFFF Harvard
 
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
Yoav Benjamini, "In the world beyond p<.05: When & How to use P<.0499..."
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testing
 
Hypothesis testing an introduction
Hypothesis testing an introductionHypothesis testing an introduction
Hypothesis testing an introduction
 

Ähnlich wie Frequentist Statistics as a Theory of Inductive Inference (2/27/14)

Severe Testing: The Key to Error Correction
Severe Testing: The Key to Error CorrectionSevere Testing: The Key to Error Correction
Severe Testing: The Key to Error Correctionjemille6
 
Philosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of StatisticsPhilosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of Statisticsjemille6
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversyjemille6
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)jemille6
 
Mayo &amp; parker spsp 2016 june 16
Mayo &amp; parker   spsp 2016 june 16Mayo &amp; parker   spsp 2016 june 16
Mayo &amp; parker spsp 2016 june 16jemille6
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learningjemille6
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”jemille6
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testingpraveen3030
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy jemille6
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...jemille6
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsificationjemille6
 
The Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and CasualtiesThe Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and Casualtiesjemille6
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1jemille6
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500jemille6
 
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...jemille6
 
research report interpreting data gathered through testing hypothesis
 research report interpreting data gathered through testing hypothesis research report interpreting data gathered through testing hypothesis
research report interpreting data gathered through testing hypothesisRhyslynRufin1
 

Ähnlich wie Frequentist Statistics as a Theory of Inductive Inference (2/27/14) (20)

Severe Testing: The Key to Error Correction
Severe Testing: The Key to Error CorrectionSevere Testing: The Key to Error Correction
Severe Testing: The Key to Error Correction
 
Philosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of StatisticsPhilosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of Statistics
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Wars
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversy
 
Mayod@psa 21(na)
Mayod@psa 21(na)Mayod@psa 21(na)
Mayod@psa 21(na)
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)
 
Mayo &amp; parker spsp 2016 june 16
Mayo &amp; parker   spsp 2016 june 16Mayo &amp; parker   spsp 2016 june 16
Mayo &amp; parker spsp 2016 june 16
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learning
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
 
The Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and CasualtiesThe Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and Casualties
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
 
D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1D.G. Mayo Slides LSE PH500 Meeting #1
D.G. Mayo Slides LSE PH500 Meeting #1
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500
 
HYPOTHESIS
HYPOTHESISHYPOTHESIS
HYPOTHESIS
 
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistic...
 
research report interpreting data gathered through testing hypothesis
 research report interpreting data gathered through testing hypothesis research report interpreting data gathered through testing hypothesis
research report interpreting data gathered through testing hypothesis
 

Mehr von jemille6

Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfjemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfjemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inferencejemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?jemille6
 
What's the question?
What's the question? What's the question?
What's the question? jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metasciencejemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredgingjemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probabilityjemille6
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severityjemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)jemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...jemille6
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (jemille6
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...jemille6
 

Mehr von jemille6 (20)

Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
 
D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...
 

Kürzlich hochgeladen

AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfPatidar M
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 

Kürzlich hochgeladen (20)

AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Active Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdfActive Learning Strategies (in short ALS).pdf
Active Learning Strategies (in short ALS).pdf
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
ICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdfICS2208 Lecture6 Notes for SL spaces.pdf
ICS2208 Lecture6 Notes for SL spaces.pdf
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 

Frequentist Statistics as a Theory of Inductive Inference (2/27/14)

  • 1. Frequentist Statistics as a Theory of Inductive Inference (2/27/14) A conversation I had with Sir David Cox, (“Statistical Science and Philosophy of Science: Where Do/Should They Meet (in 2011 and beyond)?” began with his asking: Cox: Deborah, in some fields foundations do not seem very important, but we both think that foundations of statistical inference are important; why do you think that is? Mayo: I think because they ask about fundamental questions about evidence, inference, and probability. …In statistics we’re so intimately connected to the scientific interest in learning about the world, we invariably cross into philosophical questions about empirical knowledge and inductive inference. When Sir David Cox was unable to make it to give my seminar at the LSE last December, I presented my construal of his philosophy of “frequentist” statistical methods (based on scads of papers over 60 years, 2 recent books of his on “Principles”, our 2 joint papers) 1
  • 2.  Moving away from caricatures of the standard methods (e.g., significance tests, confidence intervals) on which critics often focus, the tools emerge as satisfying piecemeal learning goals within a multiplicity of methods and interconnected checks, and reports, of error.  We deny the assumption that the overall unified principle for these methods must be merely controlling the low-long run error probabilities of methods (behavioristic rationale),  We advance an epistemological principle that renders hypothetical error probabilities relevant for learning about particular phenomena.  We contrast it with current Bayesian attempts at unifications of frequentist and Bayesian methods. 2
  • 3. Decision and Inference: Cox also distinguishes uses of statistical methods for inference and for decision--- the former is open ended (inferences are not fixed in advance), without explicit quantification of costs Background Knowledge: In the statistical analysis of scientific and technological data, there is virtually always external information that should enter in reaching conclusions about what the data indicate with respect to the primary question of interest. Typically, these background considerations enter not by a probability assignment (as in the Bayesian paradigm), but by identifying the question to be asked, designing the study, specifying an appropriate statistical model, interpreting the statistical results, and relating those inferences to primary scientific ones, and using them to extend and support underlying theory. --a very large issue in current foundations 3
  • 4. Roles of Probability in Induction Inductive inference: the premises (background and evidence statements) can be true while the conclusion inferred may be false without a logical contradiction: the conclusion is "evidence transcending". probability naturally arises in capturing induction, but there is more than one way this can occur: To measure: 1. how much belief or support there is for a hypothesis 2. how reliable or how well tested a hypothesis is  too often presupposed it can only be the former—probabilism leading to wellknown difficulties with using frequentist ideas (universes are not as plenty as blackberries—Peirce)  our interest is in how frequentist probability may be used for the latter (for measuring corroboration, trustworthiness, severity of tests)… 4
  • 5. For instance, significance testing uses probability to characterize the proportion of cases in which a null hypothesis H0 would be erroneously rejected in a hypothetical long-run of sampling, an error probability. We may even refer to frequentist statistical inference as error probability statistics, or just error statistics. Cox’s analogy with calibrating instruments: "Arguments involving probability only via its (hypothetical) long-run frequency interpretation are called frequentist. That is, we define procedures for assessing evidence that are calibrated by how they would perform were they used repeatedly. In that sense they do not differ from other measuring instrument." (Cox 2006, p. 8) As with the use of measuring instruments, we employ the performance features to make inferences about aspects of the particular thing that is measured, (aspects that the measuring tool is appropriately capable of revealing). 5
  • 6. It’s not hard to see how we do so, if we remember the central problem of inductive inference The argument from: H entails data y, (H “fits” y) y is observed, therefore H is correct, is, of course, deductively invalid. A central problem is to be able nevertheless to warrant inferring H, or regarding y as good evidence for H. Although y accords with H, even in deterministic cases, many rival hypotheses (some would say infinitely many) would also fit or predict y, and would pass as well as H. 6
  • 7. In order for a test to be probative, one wants the prediction from H to be something that would be very difficult to achieve, and not easily accounted for, were H false and rivals to H correct (whatever they may be) i.e., the test should have high probability of falsifying H, if H is false…. 7
  • 8. Statistical Significance Tests While taking elements from Fisherian and Neyman-Pearson approaches, our conception differs from both 1. We have empirical data y treated as observed values of a random variable (vector) Y. y is of interest to provide information about the probability distribution of Y as defined by the relevant statistical model—often an idealized representation of the underlying data generating process. 2. We have a hypothesis framed in terms of parameters of the distribution, the hypothesis under test or the null hypothesis H0. An elementary example: Data Y consists of n Independent and Identically Distributed (IID) components (r.v’s), Normally distributed with unknown mean  and possibly unknown standard deviation  : Yi ~ NIID(,), i =1,2,…,n. 8
  • 9.  A simple hypothesis is obtained if the value of  is known and H0 asserts that  = 0, a given constant. We find a function T = t(Y) of the sample, the test statistic, such that: (i) The larger the value of t the more inconsistent or discordant the data are with H0 in the respect tested (ii) The statistic T = t(Y) has a (numerically) known probability distribution when H0 is true. The p-value (observed statistical significance level) corresponding to t(y) is: P(t(Y) ≥ t(y); H0 ):=P(T ≥ t; H0 ) regarded as a measure of accordance with H0 in the respect tested. So we have: (1) Data (corresponding sample space) , (2) hypotheses, (3) test statistic, (4) p-value (or other error probability) as a measure of discordance or accordance with H0 in the respect tested. 9
  • 10. Low p-values, if properly computed, count as evidence against H0 (evidence of a specified discrepancy or inconsistency with H0). 10
  • 11. Our treatment contrasts with what is often taken as the strict Neyman-Pearson (N-P) formulation:  Rejects recipe-like view of tests, e.g., set a preassigned threshold value  and "reject" H0 if and only if p ≤  . However, such “behavioristic” construals have roles in this conception: While a relatively mechanical use of p-values is widely lampooned, there are contexts where it serves as a screening device, decreasing the rates of publishing misleading results (e.g., microarrays) Virtues: impartiality, relative independence from manipulation, gives protection of known and desirable long-run properties. But the main interest and novelty here is developing an inferential or evidential rationale. P(t(Y) ≥ t(y); H0 ):=P(T ≥ t; H0 ) 11
  • 12. For example: low p-values, if properly computed, count as evidence against H0 (evidence of a specified discrepancy or inconsistency with H0). Why? 12
  • 13. It’s true that such a rule provides low error rates (i.e., erroneous rejections) in the long run when H0 is true, a behavioristic argument: Suppose that we were to treat the data as just decisive evidence against H0, then, in hypothetical repetitions, H0 would be rejected in a long-run proportion p of the cases in which it is actually true. (so a p-value is an error probability unlike what many allege) But what matters for us, is that such a rule also provides a way to determine whether a specific data set provides evidence of inconsistency with or discrepancy from H0. …along the lines of the probative demand for tests… 13
  • 14. FEV: Frequentist Inductive Inference Principle The reasoning is based on the following frequentist inference principle (with respect to H0): (FEV) y fails to count as evidence against H0, if, such discordant results are fairly frequent even if H0 is correct. Actually FEV(i) (Mayo and Cox 2006/2010, 254)1 y counts as evidence of a discrepancy only if (and to the extent that) a less discordant result would probably have occurred, if H0 correctly described the distribution generating y. So if the p-value is not small, it is fairly easy (frequent) to generate such discordant results even if the null is true, so this is not good evidence of a discrepancy from the null… Where “such discordant results” are those as or even further from H0 than the outcome observed. 14
  • 15. Weight Gain Example. To distinguish between this "evidential" use of significance test reasoning, and the familiar appeal to “low long-run erroneous behavior” (N-P), consider a very informal example: Suppose that weight gain is measured by a multiplicity of well-calibrated and stable methods, and the results show negligible change over a test period (e.g., before and after England). This is grounds for inferring that my weight gain is negligible within limits set by the sensitivity of the scales. Why? While it is true that by following such a procedure in the long run one would rarely report weight gains erroneously, that is not the rationale for the particular inference. 15
  • 16. The justification is rather that the error probabilistic properties of the weighing procedure reflect what is actually the case in the specific instance. (It informs about the specific cause of the lack of a recorded increase in weight). Low long run error, while necessary, is not sufficient for warranted evidence in a particular case. General Severity Account of Evidence for statistical and non-statistical cases: FEV falls under a more general account of evidence or inductive inference, extending beyond statistical hypotheses: Data y provide evidence for a claim H, just to the extent that H passes a severe test with y 16
  • 17. The intuition behind requiring severity is that: Data y0 in test T provide good evidence for inferring H (just) to the extent that H passes severely with y0, i.e., to the extent that H would (very probably) not have survived the test so well were H false. This may be a quantitative or a qualitative assessment. (A core question that permeates "inductive philosophy", both in statistics and philosophy is: What is the nature and role of probabilistic concepts, methods, and models in making inferences in the face of limited data, uncertainty and error? 17
  • 18. Mayo and Cox (2010) identify a key principle — Frequentist Evidential Principle (FEV): (FEV) — by which hypothetical error probabilities may be used for inductive inference from specific data and consider how FEV may direct and justify: (a) Different uses and interpretations of statistical significance levels in testing a variety of different types of null hypotheses, and (b) when and why "selection effects" need to be taken account of in data dependent statistical testing. 18
  • 19. The Role of Probability in Frequentist Induction The defining feature of an inductive inference is that the premises (background and evidence statements) can be true while the conclusion inferred may be false without a logical contradiction: the conclusion is "evidence transcending". Probability naturally arises but there is more than one way this can occur. Two distinct philosophical traditions for using probability in inference are summed up by Pearson (1950, p. 228) (probabilism and performance): “For one school, the degree of confidence in a proposition, a quantity varying with the nature and extent of the evidence, provides the basic notion to which the numerical scale should be adjusted. The other school notes the relevance in ordinary life of a knowledge of the relative frequency of occurrence of a particular class of events in a series of repetitions, and suggests that "it is through its link with relative frequency that probability has the most direct meaning for the human mind" 19
  • 20. Frequentist induction, employs probability in the second manner: For instance, significance testing appeals to probability to characterize the proportion of cases in which a null hypothesis H0 would be rejected in a hypothetical long-run of repeated sampling, an error probability. We may even refer to frequentist statistical inference as error probability statistics, or just error statistics. Some, following Neyman, describe the role of probability in frequentist error statistics as “inductive behavior”:  Here the inductive reasoner "decides to infer" the conclusion, and probability quantifies the associated risk of error. The idea that one role of probability arises in science to characterize the "riskiness" or probativeness or severity of the tests to which hypotheses are put is reminiscent of the philosophy of Karl Popper. 20
  • 21. Popper and Neyman:  Popper and Neyman have a broadly analogous approach based on the idea that we can speak of a hypothesis having been well-tested in some sense, quite different from its being accorded a degree of probability, belief or confirmation;  Both also broadly shared the view that in order for data to "confirm" or "corroborate" a hypothesis H, that hypothesis would have to have been subjected to a test with high probability or power to have rejected it if false. But the similarities between Neyman and Popper do not go as deeply as one might suppose.  Despite the close connection of the ideas, there appears to be no reference to Popper in the writings of Neyman (Lehmann, 1995, p. 3) and the references by Popper to Neyman are scarce (e.g.,LSD).  Moreover, because Popper denied that any inductive claims were justifiable, his philosophy forced him to deny that even the method he espoused (conjecture and refutations) was reliable (not that it’s known to be unreliable) 21
  • 22.  Although H might be true, Popper made it clear that he regarded corroboration at most as a report of the past performance of H: it warranted no claims about its reliability in future applications. 22
  • 23. Popper and Fisher: Popper’s focus on falsification by Popper as the goal of tests, and indeed, science, is clearly strongly redolent of Fisher. While, again, evidence of direct influence is virtually absent, Popper would have concurred with Fisher (1935a, p. 16) that ‘every experiment may be said to exist only in order to give the facts the chance of disproving the null hypothesis’. However, Popper's position denies ever having grounds for inference about reliability, or for inferring reproducible deviations.  By contrast, a central feature of frequentist statistics is to actually assess and control the probability that a test would have rejected a hypothesis, if false. The advantage in the modern statistical framework is that the probabilities arise from defining a probability (statistical) model to represent the phenomenon of interest. Neyman describes frequentist statistics as modelling the phenomenon of the stability of relative frequencies of results of repeated "trials", granting there are other 23
  • 24. possibilities concerned with modelling psychological phenomena connected with intensities of belief, or with readiness to bet specified sums, etc. Frequentist statistical methods can (and often do) supply tools for inductive inference by providing methods for evaluating the severity or probativeness of tests —although they don’t directly do so, the severity rationale gives guidance in using them in this way. 24
  • 25. A hypothesis H passes a severe test T with data y0 if, (S-1) y0 agrees with H, and (S-2) with very high probability, test T would have produced a result that accords less well with H than y0 does, if H were false. In statistical contexts, the test statistic t(Y) must define an appropriate measure of accordance or distance (as required by severity condition (S-1). Note: when t(Y) is statistically significantly different from H0, it is agreeing with notH0 ‘The severity with which inference H passes test T with outcome y’ may be abbreviated by: SEV(Test T, outcome y, claim H) 25
  • 26. Just as one makes inferences about changes in body mass based on performance characteristics of various scales, error probabilities of tests indicate the capacity of the particular test to have revealed inconsistencies in the respects probed, and this in turn allows relating p-values to statements about the underlying process. Peirce: shall we assume instead the scales read my mind and mislead me just when I don't know the weight, but doo fine with items of known weight, e.g., 5 lb potatoes. —That itself a highly unreliable way to proceed in reasoning from instruments… Note: Clue to showing counterinduction is unreliable. 26
  • 27. A Multiplicity of Types of Null Hypotheses (Cox’s taxonomy) Considerable criticisms and abuses of significance tests would have been avoided had Cox’s taxonomy in (1958) been an integral part of significance testing. Types of Null Hypothesis (multiple uses in piecemeal steps involved in linking data and models to learn about modeled phenomena) 1. Embedded null hypotheses 2. Dividing null hypotheses: 3. Null hypotheses of absence of structure 4. Null hypotheses of model adequacy 5. Substantively-based null hypotheses. Evaluating such local hypotheses for these piecemeal learning goals demands its own criteria 27
  • 28. 1. Embedded null hypotheses Consider a parametric family of distributions f(y;  indexed by unknown (vector of) parameters =  where  denotes unknown nuisance parameter(s). The null and alternative hypotheses of interest are: H0: vs. H1:  Central tasks, often ignored, take central stage in Cox’s discussion and indeed are an important part of the justification of this approach (though must be skimpy here): (a) —need to choose appropriate test statistic T(Y) (b) —need to be able to compute the probability distribution of T under the null (and perhaps alternative) hypotheses (it should not depend on nuisance parameters) 28
  • 29. (c) —need to collect data y to so that they satisfy adequately the assumptions of the relevant probability model. If we succeed with all that…. p-value not small—evidence the data are consistent with the null… But no evidence against is not automatically evidence for the null… but we can infer the absence of a discrepancy from (or concordance with) H0 To infer the absence of a discrepancy from H0 as large as  examine the probability  We have a result and it differs from H0 by some observed amount d.  = Prob[a result that differs as much as or more from H0 (than d); evaluated under the assumption that ],  = P(t(Y) > t(y) ;  If is near 1, then, following FEV, the data are good evidence that  Terms: discrepancy refers to parameters difference, discord, fit refers to observed differences.  29
  • 30. Interpreting    may be regarded as stringency or severity with which the test has probed the discrepancy ; that is, has passed a severe test: (other terms: precision, sensitivity)  Avoids unwarranted interpretations of “consistency with H0” with insensitive tests (“fallacies of acceptance”).  A fallacy of acceptance takes an insignificant difference as evidence that the null is true; but tests have limited discernment power.  More relevant to specific data than is a test’s power, which is calculated relative to a predesignated critical value c beyond which the test "rejects" the null. Power at  () = P(t(Y) > c ;  In contrast:  = P(t(Y) > t(y) ;  30
  • 31. An important passage in Cox 2008, p. 25) “In the Neyman-Pearson theory of tests, sensitivity of a test is assessed by the notion of power….. In the approach adopted here the assessment is via the distribution of the random variable P, again considered for various alternatives” P here is the significance probability regarded as a random variable—which it is… Cox makes an equivalent move referring to confidence intervals—I'll come back to… small p-values (evidence of some discrepancy in direction of alternative). Critics correctly note that the p-value alone doesn’t tell you the size of the discrepancy indicated “effect size”), but it, together with sample size, can be used to determine this:  If a test is so sensitive that so small a p-value is probable, even when , then a small p-value is not evidence of a discrepancy from H0 in excess of . So by this reasoning we also avoid “fallacies of rejection” 31
  • 32. 5. Substantively-based null hypotheses. A theory T predicts that H0 is at least a very close approximation to true situation (perhaps T has already passed several theoretical and empirical tests) Rival theory T* predicts a specified discrepancy from H0 and the test is designed to discriminate between T and T* in a thus far untested domain. Focus on interpreting: p-value not small. H0 may be formed deliberately to let T “stick its neck out” Discrepancies from T in the direction of T* are given a very good chance to be detected, so if no significant departure is found, this constitutes evidence for T in the respect tested Famous “null results” take this form (e.g., set the GTR predicted values at 0) rivals to the GTR predicted a breakdown of the Weak Equivalence Principle (WEP) for massive self-gravitating bodies, the earth-moon system: this effect, the Nordvedt effect would be 0 for GTR (identified with the null hypothesis) and non-0 for rivals. 32
  • 33. Measurements of the round trip travel times between the earth and moon (between 1969 and 1975) set upper bounds to the possible violation of the WEP, and because the tests were sufficiently sensitive, these measurements provided good evidence that the Nordvedt effect is absent, i.e., evidence for the null hypothesis Such a negative result does not provide evidence for all of GTR (in all its areas of prediction), but it does provide evidence for its correctness with respect to this effect. Some argue I should allow inferring all of GTR (Chalmers, Laudan, Musgrave) —but I think it is this piecemeal way that inquiries enable progress (realizing that NOT all of GTR has been warranted provoked developing rivals…) Unlike the large-scale theory testers, we are not after an account of “theory choice” but rather learning about aspects of theoretical phenomena by local probes… Statistical reasoning is at the heart of inquiries which are broken down into questions about parameters in statistical distributions intermediate between the full theory and the actual data. My conception is to view them as probing for piecemeal errors—and note, an error can be any mistaken claim or flawed understanding—both empirical and theoretical! 33
  • 34.  A fundamental tenet of the conception of inductive learning most at home with the frequentist philosophy is that inductive inference requires building up incisive arguments and inferences by putting together several different piecemeal results.  Although the complexity of the story makes it more difficult to set out neatly, as, for example, if a single algorithm is the whole of inductive inference, the payoff is an account that approaches the kind of full-bodied arguments that scientists build up in order to obtain reliable knowledge.  The goal is not representing beliefs or opinions (as in personalist Bayesian accounts) but avoiding being misled by beliefs and opinions 34
  • 35. Series of Confidence Intervals (goal: to distinguish inferential and behaviorist justifications while avoiding an infamous fallacy of R.A. Fisher) Consider our Normal mean “embedded” example, rather than running a significance test we may form the 1 − α upper confidence bound, CIU(Y; α) for estimating mean µ, let  be known, y= n-.5. The upper 1 - α limit is Y + k(α)y k(α) the upper α-value of the standard Normal distribution. Pick a small α, say .025 The upper .975 limit is Y + 1.96y Consider the inference: Infer (or regard the data as evidence for) µ  y + 1.96y CIU(Y; .025) 35
  • 36. One rationale is that it instantiates an inference rule that yields true claims with high probability (.975) because P(µ  Y + 1.96y)  .975 Whatever the true value of µ 36
  • 37. The procedure has high long-run “coverage probabilities.” They "rub-off" on the particular case. Instead we might view µ  y + 1.96syas an inference from a type of reductio ad absurdum argument: suppose in fact that this inference is false and the true mean is µ*, where µ* y + 1.96y. Then it is very probable that we would have observed a larger sample mean: P( Y  y ; µ*)  .975. Therefore, one can reason, y is inconsistent at level .975 (or .025) with having been generated from a population with µ in excess of the upper limit. This reasoning is captured in FEV ------------------------------------------------- 37
  • 38. Aside: This statistic is directly related to a test of µ  µ0 against µ  µ0. In particular, Y is statistically significantly smaller than values of µ in excess of CIU(Y; α) at level α. 38
  • 39. Fisher's fiducial error This may make  look like a random variable----but it is not; these claims do not hold once a specific y is plugged in for Y P ( < Y  0x) = .5 P ( < Y  .5x) = .7 P ( < Y  1x) = .84 P ( < Y  1.5x) = .93 P ( < Y  1.96 x) = .975 “Our (Cox and Hinkley) attitude toward the statement [ < Y  1.96 x] might then be taken to be the same as that to other uncertain statements for which the probability is [.975] of their being true, and hence…is virtually indistinguishable from a probability statement about []. However, this attitude is in general incorrect, in our view because the confidence statement can be known to be true or untrue…… The system of confidence limits simply summarizes what the data tell us about , given the model….It is wrong to combine confidence limit statements about different 39
  • 40. parameters as though the parameters were random variables.” (Cox and Hinkley 1974, p. 227) But what exactly is it telling us about  40
  • 41.  P ( < Y  0x) = .5 P ( < Y  .5x) = .7 P ( < Y  1x) = .84 P ( < Y  1.5x) = .93 P ( < Y  1.96 x) = .975 Answer (Mayo): it tells you which discrepancies are and are not indicated (at various degrees of severity) If we replace P with SEV (the degree of severity or even “corroboration”) the claims are true even after substituting the observed sample mean (the severity with which the inference has passed the test with the data) 41
  • 42. SEV( < Y 0  x) to abbreviate: The severity with which a test T with an observed result Y 0 passes: ( < Y 0  x). SEV ( < Y 0  0x) = .5 SEV ( < Y 0  .5x) = .7 SEV ( < Y 0  1x) = .84 SEV ( < Y 0  1.5x) = .93 SEV ( < Y 0  1.96 x) = .975 But aren’t I just using this as another way to say how probable each claim is? No. This would lead to inconsistencies (if we mean mathematical probability), but the main thing is, or so I argue, probability gives the wrong logic for “how welltested” (or “corroborated”) a claim is 42
  • 43. What Would a Logic of Severe Testing be? If H has passed a severe test T with x, there is no problem in saying x warrants H, or, if one likes, x warrants believing in H…. (H could be the confidence interval estimate) If SEV(H ) is high, its denial is low, i.e., SEV(~H ) is low But it does not follow a severity assessment should obey the probability calculus, or be a posterior probability…. For example, if SEV(H ) is low, SEV(~H ) could be high or low ……in conflict with a probability assignment (there may be a confusion of ordinary language use of “probability”) 43
  • 44. Other points lead me to deny probability yields a logic we want for well-testedness2 (e.g., problem of irrelevant conjunctions, tacking paradox for Bayesians and hypothetico-deductivists) This is different from what I call the Contrast again with the rubbing off construal: The procedure is rarely wrong, therefore, the probability it is wrong in this case is low. The long-run reliability of the rule is a necessary but not a sufficient condition to infer H (with severity) The reasoning instead is counterfactual: H:  < Y 0  1.96x (i.e.,  < CIu ) passes severely because were this inference false, and the true mean  > CIu then, very probably, we would have observed a larger sample mean: 44
  • 45. A very well known criticism (many Bayesians say it is a reductio) of frequentist confidence intervals stems from assuming that confidence levels must give degrees of belief, critics mount e.g., a 95% confidence interval might be known to be correct, (given some additional constraints about the parameter)--trivial intervals In our construal, this is just saying no parameter values are ruled out with severity….. “Viewed as a single statement [the trivial interval] is trivially true, but, on the other hand, viewed as a statement that all parameter values are consistent with the data at a particular level is a strong statement about the limitations of the data.” (Cox and Hinkley1974, p. 226)3 45
  • 46. Mayo and Spanos: Severity Interpretations for T+ of “accept” and “reject” (SEV) for non-statistically-significant results For “Accept” H0 (a statistically insignificant difference from 0)  ≤ Y 0 + k( √n) passes with severity (1 - , where P(d(X) > k) =  (for any 0 <  < 1).  (SEV) for statistically-significant results For Reject H0 (a statistically significant difference from 0)  > Y 0  k( √n) passes with severity (1 -  46
  • 47. FEV(ii): A moderate p-value is evidence of the absence of a discrepancy  from H0 only if there is a high probability the test would have given a worse fit with H0 (i.e., smaller p-value) were a discrepancy  to exist 1 47