# Sampling, Statistics and Sample Size

4. Sep 2014
1 von 118

### Sampling, Statistics and Sample Size

• 1. Sampling, Statistics, Sample Size, Power
• 2. Course Overview 1. What is evaluation? 2. Measuring impacts (outcomes, indicators) 3. Why randomize? 4. How to randomize? 5. Sampling and sample size 6. Threats and Analysis 7. Cost-Effectiveness Analysis 8. Project from Start to Finish
• 3. Our Goal in This Lecture: From Sample to Population 1. To understand how samples and populations are related 1. Population- All people who meet a certain criteria. Ex: The population of all 3rd graders in India who take a certain exam 2. Sample- A subset of the population. Ex: 1000 3rd graders in India who take a certain exam  We want the sample to tell us something about the overall population  Specifically, we want a sample from the treatment and a sample from the control to tell us something about the true effect size of an intervention in a population 2. To build intuition for setting the optimal sample size for your study  This will help us confidently detect a difference between treatment and control
• 4. Lecture Outline 1. Basic Statistics Terms 2. Sampling variation 3. Law of large numbers 4. Central limit theorem 5. Hypothesis testing 6. Statistical inference 7. Power
• 5. Lesson 1: Basic Statistics To understand how to interpret data, we need to understand three basic concepts:  What is a distribution?  What’s an average result?  What is a standard deviation?
• 6. What is a Distribution?  A distribution graph or table shows each possible outcome and the frequency that we observe that outcome  A probability distribution- same as a distribution but converts frequency to probability
• 7. Baseline Test Scores 500 450 400 350 300 250 200 150 100 50 0 0 7 14 21 28 35 42 49 56 63 70 77 84 91 98 test scores frequency
• 8. What’s the Average Result?  What is the “expected result”? (i.e. the average)?  Expected Result=the sum of all possible values each multiplied by the probability of its occurrence
• 9. 26 500 450 400 350 300 250 200 150 100 50 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 test scores frequency mean Mean = 26
• 10. Population Population mean Mean=26
• 11. What’s a Standard Deviation?  Standard deviation: Measure of dispersion in the population  Weighted average distance to the mean gives more weight to those points furthest from mean.
• 12. Standard Deviation = 20 26 600 500 400 300 200 100 0 500 450 400 350 300 250 200 150 100 50 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 test scores frequency sd mean 1 Standard Deviation
• 13. Lecture Outline 1. Basic Statistics Terms 2. Sampling variation 3. Law of large numbers 4. Central limit theorem 5. Hypothesis testing 6. Statistical inference 7. Power
• 14. Our Goal in This Lecture: From Sample to Population 1. To understand how samples and populations are related 1. Population- All people who meet a certain criteria. Ex: The population of all 3rd graders in India who take a certain exam 2. Sample- A subset of the population. Ex: 1000 3rd graders in India who take a certain exam  We want the sample to tell us something about the overall population  Specifically, we want a sample from the treatment and a sample from the control to tell us something about the true effect size of an intervention in a population 2. To build intuition for setting the optimal sample size for your study  This will help us confidently detect a difference between treatment and control
• 15. Sampling Variation: Example  We want to know the average test score of grade 3 children in Springfield  How many children would we need to sample to get an accurate picture of the average test score?
• 16. Population: Test Scores of all 3rd Graders Population
• 17. Mean of Population is 26 (true mean) Population Population mean
• 18. Pick Sample 20 Students: Plot Frequency Population Population mean Sample Sample mean
• 19. Zooming in on Sample of 20 Students Population mean Sample Sample mean
• 20. Pick a Different Sample of 20 Students Population mean Sample Sample mean
• 21. Another Sample of 20 Students Population mean Sample Sample mean
• 22. Sampling Variation: Definition  Sampling variation is the variation we get between different estimates (e.g. mean of test scores) due to the fact that we do not test everyone but only a sample  Sampling variation depends on: • The variation in test scores in the underlying population • The number of people we sample
• 23. What if our Population Instead of Looking Like This… Population Population mean
• 24. …Looked Like This Population Population mean
• 25. Standard Deviation: Population 1  Measure of dispersion in the population 1 Standard 1 Standard deviation deviation Population Population mean 1 Standard deviation
• 26. Standard Deviation: Population II 1 sd 1 sd Population Population mean 1 Standard deviation
• 27. Different Samples of 20 Gives Similar Estimates Population mean Sample Sample mean
• 28. Different Samples of 20 Gives Similar Estimates Population mean Sample Sample mean
• 29. Different Samples of 20 Gives Similar Estimates Population mean Sample Sample mean
• 30. Lecture Outline 1. Basic Statistics Terms 2. Sampling variation 3. Law of large numbers 4. Central limit theorem 5. Hypothesis testing 6. Statistical inference 7. Power
• 31. Population Population
• 32. Pick Sample 20 Students: Plot Frequency Population Population mean Sample Sample mean
• 33. Zooming in on Sample of 20 Students Population mean Sample Sample mean
• 34. Pick a Different Sample of 20 Students Population mean Sample Sample mean
• 35. Another Sample of 20 Students Population mean Sample Sample mean
• 36. Lets Pick a Sample of 50 Students Population mean Sample Sample mean
• 37. A Different Sample of 50 Students Population mean Sample Sample mean
• 38. A Third Sample of 50 Students Population mean Sample Sample mean
• 39. Lets Pick a Sample of 100 Students Population mean Sample Sample mean
• 40. Lets Pick a Different 100 Students Population mean Sample Sample mean
• 41. Lets Pick a Different 100 Students- What do we Notice? Population mean Sample Sample mean
• 42. Law of Large Numbers  The more students you sample (so long as it is randomized), the closer most averages are to the true average (the distribution gets “tighter”)  When we conduct an experiment, we can feel confident that on average, our treatment and control groups would have the same average outcomes in the absence of the intervention
• 43. Lecture Outline 1. Basic Statistics Terms 2. Sampling variation 3. Law of large numbers 4. Central limit theorem 5. Hypothesis testing 6. Statistical inference 7. Power
• 44. Central Limit Theorem  If we take many samples and estimate the mean many times, the frequency plot of our estimates (the sampling distribution) will resemble the normal distribution  This is true even if the underlying population distribution is not normal
• 45. Population of Test Scores is not Normal Population
• 46. Take the Mean of One Sample Population Population mean Sample Sample mean
• 47. Plot That One Mean Population mean Sample Sample mean
• 48. Take Another Sample and Plot that Mean Population mean Sample Sample mean
• 49. Repeat Many Times Population mean Sample Sample mean
• 50. Repeat Many Times Population mean Sample Sample mean
• 51. Repeat Many Times Sample mean
• 52. Repeat Many Times Sample mean
• 53. Sample mean Repeat Many Times
• 54. Sample mean Repeat Many Times
• 55. Sample mean Distribution of Sample Means
• 56. Normal Distribution
• 57. Central Limit Theorem  The more samples you take, the more the distribution of possible averages (the sampling distribution) looks like a bell curve (a normal distribution)  This result is INDEPENDENT of the underlying distribution  The mean of the distribution of the means will be the same as the mean of the population  The standard deviation of the sampling distribution will be the standard error (SE) 푠푒 = 푠푑 2 푛
• 58. Central Limit Theorem  The central limit theorem is crucial for statistical inference  Even if the underlying distribution is not normal, IF THE SAMPLE SIZE IS LARGE ENOUGH, we can treat it as being normally distributed
• 59. THE Basic Questions in Statistics  How big does your sample need to be?  Why is this the ultimate question? • How confident can you be in your results? We need it to be large enough that both the law of large numbers and the central limit theorem can be applied We need it to be large enough that we could detect a difference in outcome of interest between the treatment and control samples
• 60. Samples vs Populations  We have two different populations: treatment and comparison  We only see the samples: sample from the treatment population and sample from the comparison population  We will want to know if the populations are different from each other  We will compare sample means of treatment and comparison  We must take into account that different samples will give us different means (sample variation)
• 61. Comparison Treatment One Experiment, 2 Samples, 2 Means Comparison mean Treatment mean
• 62. Difference Between the Sample Means Comparison mean Treatment mean Estimated effect
• 63. What if we Ran a Second Experiment? Comparison mean Treatment mean Estimated effect
• 64. Many Experiments Give Distribution of Estimates 100 90 80 70 60 50 40 30 20 10 0 Frequency -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 Difference
• 65. Many Experiments Give Distribution of Estimates 100 90 80 70 60 50 40 30 20 10 0 Frequency -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 Difference
• 66. Many Experiments Give Distribution of Estimates 100 90 80 70 60 50 40 30 20 10 0 Frequency -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 Difference
• 67. Many Experiments Give Distribution of Estimates 100 90 80 70 60 50 40 30 20 10 0 Frequency -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 Difference
• 68. Many Experiments Give Distribution of Estimates 100 90 80 70 60 50 40 30 20 10 0 Frequency -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 Difference
• 69. Many Experiments Give Distribution of Estimates 100 90 80 70 60 50 40 30 20 10 0 Frequency -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 Difference
• 70. What Does This Remind You Of? 100 90 80 70 60 50 40 30 20 10 0 Frequency -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 Difference
• 71. Hypothesis Testing  When we do impact evaluations we compare means from two different groups (the treatment and comparison groups)  Null hypothesis: the two means are the same and any observed difference is due to chance • H0: treatment effect = 0  Research hypothesis: the true means are different from each other • H1: treatment effect ≠ 0  Other possible tests • H2: treatment effect > 0
• 72. Distribution of Estimates if True Effect is Zero
• 73. Distributions Under Two Alternatives
• 74. We Don’t See These Distributions, Just our Estimate 휷
• 75. Is Our Estimate 휷 Consistent With the True Effect Being β*?
• 76. If True Effect is β*, we would get 휷 with Frequency A
• 77. Is it also Consistent with the True Effect Being 0?
• 78. If True Effect is 0, we would get 휷 with Frequency A’
• 79. Q: Which is More Likely, True Effect=β* or True Effect=0?
• 80. A is Bigger than A’ so True Effect=β* is more Likely that True Effect=0
• 81. But Can we Rule Out that True Effect=0?
• 82. Is A’ so Small That True Effect=0 is Unlikely?
• 83. Probability true effect=0 is area to the right of A’ over total area under the curve
• 84. Critical Value  There is always a chance the true effect is zero, however, large our estimated effect  Recollect that, traditionally, if the probability that we would get 훽 if the true effect were 0 is less than 5% we say we can reject that the true effect is zero  Definition: the critical value is the value of the estimated effect which exactly corresponds to the significance level  If testing whether bigger than 0 a significant at 95% level it is the level of the estimate where exactly 95% of area under the curve lies to the left  훽 is significant at 95% if it is further out in the tail than the critical value
• 85. 95% Critical Value for True Effect>0
• 86. In this Case 휷 is > Critical Value So….
• 87. …..We Can Reject that True Effect=0 with 95% Confidence
• 88. What if the True Effect=β*?
• 89. How Often Would we get Estimates that we Could Not Distinguish from 0? (if true effect=β*)
• 90. How Often Would we get Estimates that we Could Distinguish from 0? (if true effect=β*)
• 91. Chance of Getting Estimates we can Distinguish from 0 is the Area Under H β* that is above Critical Value for H0
• 92. Proportion of Area under H β* that is above Critical Value is Power
• 93. Recap Hypothesis Testing: Power Underlying truth Effective (H0 false) No Effect (H0 true) Statistical Test Significant (reject H0) True positive Probability = (1 – κ)  False positive Type I Error (low power)  Probability = α Not significant (fail to reject H0) False zero Type II Error  Probability = κ True zero  Probability = (1-α)
• 94. Definition of Power  Power: If there is a measureable effect of our intervention (the null hypothesis is false), the probability that we will detect an effect (reject the null hypothesis)  Reduce Type II Error: Failing to reject the null hypothesis (concluding there is no difference), when indeed the null hypothesis is false.  Traditionally, we aim for 80% power. Some people aim for 90% power
• 95. More Overlap Between H0 Curve and Hβ* Curve, the Lower the Power. Q: What Effects Overlap?
• 96. Larger Hypothesized Effect, Further Apart the Curves, Higher the Power
• 97. Greater Variance in Population, Increases Spread of Possible Estimates, Reduces Power
• 98. Power Also Depends on the Critical Value, ie level of Significance we are Looking For…
• 99. 10% Significance Gives Higher Power than 5% Significance
• 100. Why Does Significance Change Power?  Q: what trade off are we making when we chance significance level and increase power?  Remember: 10% significance means we’ll make Type I (false positive) errors 10% of the time  So moving from 5-10% significance means get more power but at the cost of more false positives  Its like widening the gap between the goal posts and saying “now we have a higher chance of getting a goal”
• 101. Allocation Ratio and Power  Definition of allocation ratio: the fraction of the total sample that allocated to the treatment group is the allocation ratio  Usually, for a given sample size, power is maximized when half sample allocated to treatment, half to control
• 102. Why Does Equal Allocation Paximize power?  Treatment effect is the difference between two means (mean of treatment and control)  Adding sample to treatment group increases accuracy of treatment mean, same for control  But diminishing returns to adding sample size  If treatment group is much bigger than control group, the marginal person adds little to accuracy of treatment group mean, but more to the control group mean  Thus we improve accuracy of the estimated difference when we have equal numbers in treatment and control groups
• 103. Summary of Power Factors  Hypothesized effect size • Q: A larger effect size makes power increase/decrease?  Variance • Q: greater residual variance makes power increase/decrease?  Sample size • Q: Larger sample size makes power increase/decrease?  Critical value • Q: A looser critical value makes power increase/decrease  Unequal allocation ration • Q: an unequal allocation ratio makes power increase/decrease? 103
• 104. Power Equation: MDE     1 P P N Ef fectSize t t 2    1 * 1 *     Effect Size Variance Sample Size Significance Level Power Proportion in Treatment
• 105. Clustered RCT Experiments  Cluster randomized trials are experiments in which social units or clusters rather than individuals are randomly allocated to intervention groups  The unit of randomization (e.g. the village) is broader than the unit of analysis (e.g. farmers)  That is: randomize at the village level, but use farmer-level surveys as our unit of analysis 105
• 106. Clustered Design: Intuition  We want to know how much rice the average farmer in Sierra Leone grew last year  Method 1: Randomly select 9,000 farmers from around the country  Method 2: Randomly select 9,000 farmers from one district 106
• 107. Clustered Design: Intuition II  Some parts of the country may grow more rice than others in general; what if one district had a drought? Or a flood? • ie we worry both about long term correlations and correlations of shocks within groups  Method 1 gives most accurate estimate  Method 2 much cheaper so for given budget could sample more farmers  What combination of 1 and 2 gives the highest power for given budget constraint?  Depends on the level of intracluster correlation, ρ (rho) 107
• 108. Low Intracluster Correlation Variation in the population Clusters Sample clusters
• 109. HIGH Intracluster Correlation
• 110. Intracluster Correlation  Total variance can be divided into within cluster variance (휏2) and between cluster variance (σ2)  When variance within clusters is small and the variance between clusters is large, the intra cluster correlation is high (previous slide)  Definition of intracluster correlation (ICC): the proportion of total variation explained by within cluster level variance • Note, when within cluster variance is high, within cluster correlation is low and between cluster correlation is high  푖푐푐 = 휌 = 휏2 휎2+휏2
• 111. HIGH Intracluster Correlation
• 112. Low Intracluster Correlation
• 113. Power with clustering Effect Size Variance Ef fectSize 2     1 P P N t t m   1 * 1 * 1 ( 1)         Sample Size Significance Level Power Proportion in Treatment ICC Average Cluster Size
• 114. Clustered RCTs vs. Clustered Sampling  Must cluster at the level at which you randomize • Many reasons to randomize at group level  Could randomize by farmer group, village, district  If randomize one district to T and one to C have too little power however many farmers you interview • Can never distinguish treatment effect from possible district wide shocks  If randomize at individual level don’t need to worry about within village correlation or village level shocks, as that impacts both T and C 114
• 115. Bottom Line for Clustering  If experimental design is clustered, we now need to consider ρ when choosing a sample size (as well as the other effects)  Must cluster at level of randomization  It is extremely important to randomize an adequate number of groups  Often the number of individuals within groups matter less than the total number of groups 115
• 116. COMMON TRADEOFFS AND RULES OF THUMB
• 117. Common Tradeoffs  Answer one question really well? Or many questions with less accuracy?  Large sample size with possible attrition? Or small sample size that we track very closely?  Few clusters with many observations? Or many clusters with few observations?  How do we allocate our sample to each group?
• 118. Rules of Thumb  A larger sample is needed to detect differences between two variants of a program than between the program and the comparison group.  For a given sample size, the highest power is achieved when half the sample is allocated to treatment and half to comparison.  The more measurements are taken, the higher the power. In particular, if there is a baseline and endline rather than just an endline, you have more power  The lower compliance, the lower the power. The higher the attrition, the lower the power  For a given sample size, we have less power if randomization is at the group level than at the individual level.