2. Course Overview
1. What is evaluation?
2. Measuring impacts (outcomes, indicators)
3. Why randomize?
4. How to randomize?
5. Sampling and sample size
6. Threats and Analysis
7. Cost-Effectiveness Analysis
8. Project from Start to Finish
3. Our Goal in This Lecture: From Sample to Population
1. To understand how samples and populations are related
1. Population- All people who meet a certain criteria. Ex: The
population of all 3rd graders in India who take a certain exam
2. Sample- A subset of the population. Ex: 1000 3rd graders in
India who take a certain exam
We want the sample to tell us something about the overall
population
Specifically, we want a sample from the treatment and a sample
from the control to tell us something about the true effect size of an
intervention in a population
2. To build intuition for setting the optimal sample size for your
study
This will help us confidently detect a difference between treatment
and control
4. Lecture Outline
1. Basic Statistics Terms
2. Sampling variation
3. Law of large numbers
4. Central limit theorem
5. Hypothesis testing
6. Statistical inference
7. Power
5. Lesson 1: Basic Statistics
To understand how to interpret data, we need to understand three basic
concepts:
What is a distribution?
What’s an average result?
What is a standard deviation?
6. What is a Distribution?
A distribution graph or table shows each possible outcome and the
frequency that we observe that outcome
A probability distribution- same as a distribution but converts
frequency to probability
8. What’s the Average Result?
What is the “expected result”? (i.e. the average)?
Expected Result=the sum of all possible values each multiplied by
the probability of its occurrence
11. What’s a Standard Deviation?
Standard deviation: Measure of dispersion in the population
Weighted average distance to the mean gives more weight to those
points furthest from mean.
13. Lecture Outline
1. Basic Statistics Terms
2. Sampling variation
3. Law of large numbers
4. Central limit theorem
5. Hypothesis testing
6. Statistical inference
7. Power
14. Our Goal in This Lecture: From Sample to Population
1. To understand how samples and populations are related
1. Population- All people who meet a certain criteria. Ex: The
population of all 3rd graders in India who take a certain exam
2. Sample- A subset of the population. Ex: 1000 3rd graders in
India who take a certain exam
We want the sample to tell us something about the overall
population
Specifically, we want a sample from the treatment and a sample
from the control to tell us something about the true effect size of an
intervention in a population
2. To build intuition for setting the optimal sample size for your
study
This will help us confidently detect a difference between treatment
and control
15. Sampling Variation: Example
We want to know the average test score of grade 3 children in
Springfield
How many children would we need to sample to get an accurate
picture of the average test score?
22. Sampling Variation: Definition
Sampling variation is the variation we get between different
estimates (e.g. mean of test scores) due to the fact that we do not
test everyone but only a sample
Sampling variation depends on:
• The variation in test scores in the underlying population
• The number of people we sample
23. What if our Population Instead of Looking Like This…
Population
Population
mean
25. Standard Deviation: Population 1
Measure of dispersion in the population
1 Standard
1 Standard
deviation deviation
Population
Population mean
1 Standard
deviation
38. A Third Sample of 50 Students
Population mean
Sample
Sample mean
39. Lets Pick a Sample of 100 Students
Population mean
Sample
Sample mean
40. Lets Pick a Different 100 Students
Population mean
Sample
Sample mean
41. Lets Pick a Different 100 Students- What do we Notice?
Population mean
Sample
Sample mean
42. Law of Large Numbers
The more students you sample (so long as it is randomized), the
closer most averages are to the true average (the distribution gets
“tighter”)
When we conduct an experiment, we can feel confident that on
average, our treatment and control groups would have the same
average outcomes in the absence of the intervention
43. Lecture Outline
1. Basic Statistics Terms
2. Sampling variation
3. Law of large numbers
4. Central limit theorem
5. Hypothesis testing
6. Statistical inference
7. Power
44. Central Limit Theorem
If we take many samples and estimate the mean many times, the
frequency plot of our estimates (the sampling distribution) will
resemble the normal distribution
This is true even if the underlying population distribution is not
normal
57. Central Limit Theorem
The more samples you take, the more the distribution of possible
averages (the sampling distribution) looks like a bell curve (a
normal distribution)
This result is INDEPENDENT of the underlying distribution
The mean of the distribution of the means will be the same as the
mean of the population
The standard deviation of the sampling distribution will be the
standard error (SE)
푠푒 =
푠푑
2 푛
58. Central Limit Theorem
The central limit theorem is crucial for statistical inference
Even if the underlying distribution is not normal, IF THE SAMPLE
SIZE IS LARGE ENOUGH, we can treat it as being normally
distributed
59. THE Basic Questions in Statistics
How big does your sample need to be?
Why is this the ultimate question?
• How confident can you be in your results?
We need it to be large enough that both the law of large numbers
and the central limit theorem can be applied
We need it to be large enough that we could detect a difference in
outcome of interest between the treatment and control samples
60. Samples vs Populations
We have two different populations: treatment and comparison
We only see the samples: sample from the treatment population
and sample from the comparison population
We will want to know if the populations are different from each other
We will compare sample means of treatment and comparison
We must take into account that different samples will give us
different means (sample variation)
63. What if we Ran a Second Experiment?
Comparison
mean
Treatment
mean
Estimated effect
64. Many Experiments Give Distribution of Estimates
100
90
80
70
60
50
40
30
20
10
0
Frequency
-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
Difference
65. Many Experiments Give Distribution of Estimates
100
90
80
70
60
50
40
30
20
10
0
Frequency
-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
Difference
66. Many Experiments Give Distribution of Estimates
100
90
80
70
60
50
40
30
20
10
0
Frequency
-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
Difference
67. Many Experiments Give Distribution of Estimates
100
90
80
70
60
50
40
30
20
10
0
Frequency
-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
Difference
68. Many Experiments Give Distribution of Estimates
100
90
80
70
60
50
40
30
20
10
0
Frequency
-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
Difference
69. Many Experiments Give Distribution of Estimates
100
90
80
70
60
50
40
30
20
10
0
Frequency
-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
Difference
70. What Does This Remind You Of?
100
90
80
70
60
50
40
30
20
10
0
Frequency
-3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
Difference
71. Hypothesis Testing
When we do impact evaluations we compare means from two
different groups (the treatment and comparison groups)
Null hypothesis: the two means are the same and any observed
difference is due to chance
• H0: treatment effect = 0
Research hypothesis: the true means are different from each other
• H1: treatment effect ≠ 0
Other possible tests
• H2: treatment effect > 0
84. Critical Value
There is always a chance the true effect is zero, however, large our
estimated effect
Recollect that, traditionally, if the probability that we would get 훽 if
the true effect were 0 is less than 5% we say we can reject that the
true effect is zero
Definition: the critical value is the value of the estimated effect which
exactly corresponds to the significance level
If testing whether bigger than 0 a significant at 95% level it is the
level of the estimate where exactly 95% of area under the curve lies
to the left
훽 is significant at 95% if it is further out in the tail than the critical
value
93. Recap Hypothesis Testing: Power
Underlying truth
Effective
(H0 false)
No Effect
(H0 true)
Statistical
Test
Significant
(reject H0)
True positive
Probability = (1 – κ)
False positive
Type I Error
(low power)
Probability = α
Not significant
(fail to reject
H0)
False zero
Type II Error
Probability = κ
True zero
Probability = (1-α)
94. Definition of Power
Power: If there is a measureable effect of our intervention (the null
hypothesis is false), the probability that we will detect an effect
(reject the null hypothesis)
Reduce Type II Error: Failing to reject the null hypothesis
(concluding there is no difference), when indeed the null hypothesis
is false.
Traditionally, we aim for 80% power. Some people aim for 90%
power
95. More Overlap Between H0 Curve and Hβ* Curve, the Lower the
Power. Q: What Effects Overlap?
100. Why Does Significance Change Power?
Q: what trade off are we making when we chance significance level
and increase power?
Remember: 10% significance means we’ll make Type I (false
positive) errors 10% of the time
So moving from 5-10% significance means get more power but at
the cost of more false positives
Its like widening the gap between the goal posts and saying “now
we have a higher chance of getting a goal”
101. Allocation Ratio and Power
Definition of allocation ratio: the fraction of the total sample that
allocated to the treatment group is the allocation ratio
Usually, for a given sample size, power is maximized when half
sample allocated to treatment, half to control
102. Why Does Equal Allocation Paximize power?
Treatment effect is the difference between two means (mean of
treatment and control)
Adding sample to treatment group increases accuracy of treatment
mean, same for control
But diminishing returns to adding sample size
If treatment group is much bigger than control group, the marginal
person adds little to accuracy of treatment group mean, but more to
the control group mean
Thus we improve accuracy of the estimated difference when we
have equal numbers in treatment and control groups
103. Summary of Power Factors
Hypothesized effect size
• Q: A larger effect size makes power increase/decrease?
Variance
• Q: greater residual variance makes power increase/decrease?
Sample size
• Q: Larger sample size makes power increase/decrease?
Critical value
• Q: A looser critical value makes power increase/decrease
Unequal allocation ration
• Q: an unequal allocation ratio makes power increase/decrease?
103
104. Power Equation: MDE
1
P P N
Ef fectSize t t
2
1 *
1
*
Effect Size
Variance
Sample
Size
Significance
Level
Power
Proportion in
Treatment
105. Clustered RCT Experiments
Cluster randomized trials are experiments in which social units or
clusters rather than individuals are randomly allocated to
intervention groups
The unit of randomization (e.g. the village) is broader than the unit of
analysis (e.g. farmers)
That is: randomize at the village level, but use farmer-level surveys
as our unit of analysis
105
106. Clustered Design: Intuition
We want to know how much rice the average farmer in Sierra Leone
grew last year
Method 1: Randomly select 9,000 farmers from around the country
Method 2: Randomly select 9,000 farmers from one district
106
107. Clustered Design: Intuition II
Some parts of the country may grow more rice than others in
general; what if one district had a drought? Or a flood?
• ie we worry both about long term correlations and correlations of
shocks within groups
Method 1 gives most accurate estimate
Method 2 much cheaper so for given budget could sample more
farmers
What combination of 1 and 2 gives the highest power for given
budget constraint?
Depends on the level of intracluster correlation, ρ (rho)
107
110. Intracluster Correlation
Total variance can be divided into within cluster variance (휏2) and
between cluster variance (σ2)
When variance within clusters is small and the variance between
clusters is large, the intra cluster correlation is high (previous slide)
Definition of intracluster correlation (ICC): the proportion of total
variation explained by within cluster level variance
• Note, when within cluster variance is high, within cluster
correlation is low and between cluster correlation is high
푖푐푐 = 휌 =
휏2
휎2+휏2
113. Power with clustering
Effect Size Variance
Ef fectSize 2
1
P P N
t t
m
1 *
1
*
1 ( 1)
Sample
Size
Significance
Level
Power
Proportion in
Treatment
ICC Average
Cluster Size
114. Clustered RCTs vs. Clustered Sampling
Must cluster at the level at which you randomize
• Many reasons to randomize at group level
Could randomize by farmer group, village, district
If randomize one district to T and one to C have too little power
however many farmers you interview
• Can never distinguish treatment effect from possible district wide
shocks
If randomize at individual level don’t need to worry about within
village correlation or village level shocks, as that impacts both T and
C
114
115. Bottom Line for Clustering
If experimental design is clustered, we now need to consider ρ when
choosing a sample size (as well as the other effects)
Must cluster at level of randomization
It is extremely important to randomize an adequate number of
groups
Often the number of individuals within groups matter less than
the total number of groups
115
117. Common Tradeoffs
Answer one question really well? Or many questions with less
accuracy?
Large sample size with possible attrition? Or small sample size
that we track very closely?
Few clusters with many observations? Or many clusters with few
observations?
How do we allocate our sample to each group?
118. Rules of Thumb
A larger sample is needed to detect differences between two
variants of a program than between the program and the
comparison group.
For a given sample size, the highest power is achieved when half
the sample is allocated to treatment and half to comparison.
The more measurements are taken, the higher the power. In
particular, if there is a baseline and endline rather than just an
endline, you have more power
The lower compliance, the lower the power. The higher the attrition,
the lower the power
For a given sample size, we have less power if randomization is at
the group level than at the individual level.