Presentation for the fourth meeting of the EARLI SIG 18 Educational Effectiveness.
Abstract: The topic of comparative international large-scale assessments (LSA) has always had a lot of attention from policy makers and educational researchers, inviting criticism. One criticism concerns the fact that the complex sampling design of LSA is not always taken into account. This paper aims to demonstrate the consequences of not taking into account the sampling design of one such assessment, TIMSS 2011. Three features, weights, proficiency estimation with plausible values and variance estimation with jackknife are used in single level (students) and multilevel (students and schools) cases. The results show that the consequences can be significant, but are not completely in line with previous literature.
Mental Health Awareness - a toolkit for supporting young minds
Demonstrating the consequences of not taking into account sampling designs with TIMSS 2011 data
1. Demonstrating the
consequences of not taking into
account sampling designs with
TIMSS 2011 data
Dr. Christian Bokhove
Lecturer in Mathematics Education
University of Southampton
EARLI SIG
August 28th 2014
2. OUTLINE
• International studies
• IEA & OECD
• PISA, TIMSS, …
• Some aspects of their sampling design
• Two stage sampling
• Weights
• Rotated test design
• What if you don’t take this into account?
• Simulation with TIMSS 2011 data
• Single level model
• Multilevel models
3. OUTLINE
• International studies
• IEA & OECD
• PISA, TIMSS, …
• Some aspects of their sampling design
• Two stage sampling
• Weights
• Rotated test design
• What if you don’t take this into account?
• Simulation with TIMSS 2011 data
• Single level model
• Multilevel models
4. IEA & OECD
The International Association for
the Evaluation of Educational
Achievement (IEA) is an
independent, international
cooperative of national research
institutions and governmental
research agencies. It conducts
large-scale comparative studies of
educational achievement and
other aspects of education.
The mission of the Organisation
for Economic Co-operation and
Development (OECD) is to
promote policies that will
improve the economic and social
well-being of people around the
world.
5. PISA
http://www.oecd.org/pisa/
“The Programme for International Student Assessment (PISA) is a
triennial international survey which aims to evaluate education
systems worldwide by testing the skills and knowledge of 15-year-old
students. To date, students representing more than 70 economies have
participated in the assessment.”
• Last one appeared in 2013 with 2012 data
6. TIMSS
http://timssandpirls.bc.edu/timss2011/
“TIMSS 2011 is the fifth in IEA’s series of international assessments of
student achievement dedicated to improving teaching and learning in
mathematics and science. First conducted in 1995, TIMSS reports every
four years on the achievement of fourth and eighth grade students.“
7. OUTLINE
• International studies
• IEA & OECD
• PISA, TIMSS, …
• Some aspects of their sampling design
• Two stage sampling
• Weights
• Rotated test design
• What if you don’t take this into account?
• Simulation with TIMSS 2011 data
• Single level model
• Multilevel models
8. Two-stage sampling in educational studies
● Random sampling is rarely used in educational surveys:
– Too expensive (e.g., training test administrators and travel costs)
● Selected students attend many different schools
– It is not practical to contact many schools
– A link with class, teacher, school variables is sought
● Sampling is usually conducted in two stages
● First stage
– Schools are selected
● Second stage
– Students (PISA) or classes (TIMSS/PIRLS) are selected
● 35 students selected randomly (PISA)
● One or two intact classes (TIMSS/PIRLS)
9. Replicate weights
● Replicate weights or resampling techniques are used to calculate
correct standard errors in two-stage sampling designs
● The idea behind:
– There are many possible samples of schools and not all of them yield the
same estimates
– Use different samples of schools to calculate estimates
– Take into account error of selecting one school and not another
(sampling error)
● Each replicate weight represents one sample
● Variability between estimates reflects the sampling error
10. Two replication methods
● Jackknife
– TIMSS and PIRLS
– Schools are paired with other similar schools within zones
– A replicate is created for each zone or pair of schools
– One school is randomly removed within each zone and the weight of the
other school is doubled
● Balanced repeated replication (BRR)
– Select one school at random within each stratum
– Set its weight to 0
– Double the weight of the other school
– PISA uses a variant of BRR (Fay) to prevent
smaller sample size
Source: OECD (2009). PISA Data Analysis Manual: SPSS (2nd Edition. Paris): OECD Publishing.
11. OUTLINE
• International studies
• IEA & OECD
• PISA, TIMSS, …
• Some aspects of their sampling design
• Two stage sampling
• Weights
• Rotated test design
• What if you don’t take this into account?
• Simulation with TIMSS 2011 data
• Single level model
• Multilevel models
12. Weights
• In theory sampling design provides student samples with equal
selection probabilities.
• But variation in number of classes selected, and differential patterns
of nonresponse can result in varying selection probabilities, requiring
a unique sampling weight for the students in each participating class
in the study.
• Total weight (TOTWGT)
• Sums to the student population size in each country
• The overall student sampling weight is the product of the final weight
components for schools, classes, and students
• Important in multilevel analyses
• School level: final school weight
• Student level: final student weight multiplied with final class weight
13. OUTLINE
• International studies
• IEA & OECD
• PISA, TIMSS, …
• Some aspects of their sampling design
• Two stage sampling
• Weights
• Rotated test design
• What if you don’t take this into account?
• Simulation with TIMSS 2011 data
• Single level model
• Multilevel models
14. Rotated test design
● The item pool should include a large number of items for domain
validity (e.g., mathematical literacy)
● At the same time:
– Fatigue biases results of long tests
– Schools refuse to participate in lengthy studies
● Rotated test forms
– Students are assigned a subset of item pool
– Minimize testing time
15. Plausible values
● Rotated booklets introduce challenges for estimating academic
achievement
– Students miss data on a number of items
● Plausible values methods are employed to obtain population
estimates with rotated booklet designs
● Students do not answer all items but plausible scores are produced
as if they had responded to all items based on
– Responses to test items
– Background characteristics
16. Plausible values
● Plausible values are random draws from the distribution of a
student's ability
– Instead of obtaining a point estimate, a range of values are estimated for
each student
● A single score cannot be calculated because data is missing for a
number of items
● Plausible values account for imputation error
– Making inference on ability from small number of items
● Estimation should be conducted separately for each plausible value
– Typically five plausible values are considered
– The variability between estimates reflects the imputation error
17. Challenge
● Ignoring the complex design leads to wrong conclusions, like different
point estimates and/or underestimated standard errors, see Rutkowski et
al. (2010)
– Variance estimation: jackknife, BRR
– Not taking into account weights (e.g. Rutkowski et al (2010): Bulgarian TIMSS
2007, higher probability of selection to students from vocational and profiled
schools). In a multilevel situation choosing wrong composite weights.
– Treatment of plausible values: instead of Rubin’s rules averaging (five)
plausible values or choosing only one plausible value.
● Drent et al. (2013) formulated quality criteria (low, satisfactory, high)
● Standard software cannot handle replicate weights and plausible values
18. Available software
● IDB Analyzer (SPSS)
● NAEP Data Explorer (web tool)
● PISA SPSS macros
● R package 'intsvy‘ (Daniel Caro, Oxford)
– Free
– Does not rely on commercial software like SPSS or SAS
– Open source
– Can be extended to perform other analyses
19. Available software
Multilevel software
● R
– Has multilevel package but no weights
– Can link to MLwin
● MLwin
– Have to combine plausible
values manually
– No resampling
– Does handle weights
● HLM
– Combines plausible values
– Weights
– No resampling
20. OUTLINE
• International studies
• IEA & OECD
• PISA, TIMSS, …
• Some aspects of their sampling design
• Two stage sampling
• Weights
• Rotated test design
• What if you don’t take this into account?
• Simulation with TIMSS 2011 data
• Single level model
• Multilevel models
21. Simulation with TIMSS 2011 data
• TIMSS 2011
• Three aspects: jackknife, weights, plausible values
• Five countries:
England is chosen as a base-level, using the ranking for grade 8 TIMSS
2011. One arbitrary country significantly above England in the
rankings, Singapore, is chosen, as well as one country significantly
below England in the rankings (Norway). In addition the countries
respectively one place higher and one place lower are chosen
(United States and Hungary).
22. Simulation with TIMSS 2011 data
• Data preparation:
• Publicly available TIMSS 2011 year 8 data files are used.
• Additional columns calculated: average of the five plausible values and
different weighting columns.
• Two experiments:
A. single level analyses, and
B. multilevel analyses with students nested in schools.
• For experiment A an open source R package intsvy (Caro, 2014) for R
is used.
• Experiment B looks at multilevel models by constructing null models
in HLM 6.08 for five countries with student and school levels.
23. Single level
Different scenarios:
• Two conditions concern variance estimation with jackknife (JK):
either jackknife is applied or isn’t applied.
• Two conditions concern weights (Wgt): either weights are applied or
are not applied.
• Three final conditions for the maths achievement scores are used for
Plausible Values.
• PVR denotes the correct approach using ‘plausible values with Rubin’s rules’.
• PVA denotes the ‘mean of the plausible values’.
• PV1 only uses ‘the first plausible value’.
A total of 2×2×3=12 cases are calculated, as shown in the table on the
next slide. Case 1 replicates the values from the international report
(Mullis, Martin, Foy, & Arora, 2012).
24. PV1 Case 9
With JK With Wgt
Case 10
No JK With Wgt
Case 11
With JK No Wgt
Case 12
No JK No Wgt
Country Score SE # Score SE # Score SE # Score SE #
Singapore 609.71 3.68 1 609.71 1.08 1 606.22 3.63 1 606.22 1.08 1
USA 508.75 2.58 2 508.75 0.75 2 508.92 2.52 4 508.92 0.74 4
England 506.03 5.45 3 506.03 1.36 3 509.44 5.59 3 509.44 1.37 3
Hungary 504.75 3.44 4 504.75 1.22 4 513.38 2.96 2 513.38 1.16 2
Norway 475.24 2.38 5 475.24 1.03 5 477.04 2.62 5 477.04 1.03 5
PVA Case 5
With JK With Wgt
Case 6
No JK With Wgt
Case 7
With JK No Wgt
Case 8
No JK No Wgt
Country Score SE # Score SE # Score SE # Score SE #
Singapore 610.99 3.73 1 610.99 1.06 1 607.54 3.68 1 607.54 1.06 1
USA 509.48 2.59 2 509.48 0.73 2 509.68 2.53 4 509.68 0.72 4
England 506.76 5.48 3 506.76 1.34 3 509.99 5.64 3 509.99 1.35 3
Hungary 504.81 3.48 4 504.81 1.21 4 513.47 2.98 2 513.47 1.15 2
Norway 474.64 2.37 5 474.64 0.99 5 476.55 2.64 5 476.55 1.00 5
PVR Case 1
With JK With Wgt
Case 2
No JK With Wgt
Case 3
With JK No Wgt
Case 4
No JK No Wgt
Country Score SE # Score SE # Score SE # Score SE #
Singapore 610.99 3.77 1 610.99 0.83 1 607.54 3.74 1 607.54 0.87 1
USA 509.48 2.63 2 509.48 0.55 2 509.68 2.58 4 509.68 0.57 4
England 506.76 5.53 3 506.76 0.89 3 509.99 5.63 3 509.99 0.70 3
Hungary 504.81 3.48 4 504.81 0.47 4 513.47 2.98 2 513.47 0.40 2
Norway 474.64 2.44 5 474.64 0.55 5 476.55 2.66 5 476.55 0.50 5
Maths achievement scores and standard errors for five countries for twelve different cases with weights, jackknife
and plausible values.
25. Observations
Differences in achievement results and standard errors:
• Not taking into account Jackknife (example in yellow)
• Average score the same.
• Underestimates standard error.
• So: relative ranking same but significant testing influenced.
• Not taking into account weights (example in orange)
• Influences achievement scores: USA, England, Hungary and Norway scoring
higher, and Singapore scoring lower.
• Impact on relative rankings.
• Standard errors different, some higher some lower.
• Plausible values (example in green)
• PVA and PVR the same achievement score, PV1 different.
• PVA and PV1 underestimate standard error.
• But no clear pattern PVA and PV1 (which contradicts previous literature).
26. Multilevel
Used HLM, does not have Jackknife
• Note that with MLwin you need to
combine Plausible Values manually.
• Three conditions concern weights:
no weights, weights only at student
level (see Willms & Smith, 2005)
and final weights (Rutkowski et al.,
2010).
• Three conditions for the maths
achievement scores are used for
Plausible Values. PVR denotes the
correct approach using ‘plausible
values with Rubin’s rules’. PVA
denotes the ‘mean of the plausible
values’. PV1 only uses ‘the first
plausible value’.
• The 3×3 scenarios are reported in
table 3.
27. Maths achievement scores and standard errors of five countries for multilevel null models in three different
weighting scenarios S1, S4 and S6 and plausible values.
28. Observations
Differences in achievement results and standard errors:
• The different weighting methods greatly influence achievement scores and
standard errors. This also has an impact on the relative rankings. There does
not seem to be a pattern in over- or underestimation of scores and standard
errors.
• For plausible values the cases for PV1 yield a different average than PVA and
PVR, in three cases lower except for Hungary and Norway. For PVA and PV1,
the standard error is underestimated with respect to PVR. However,
between PVA and PV1 underestimation of SE’s differ only slightly, with PVA
in most cases being closer to or just as close to PVR as PV1.
• Singapore PV1 PVA PVR
United states PVA PVR PV1
England PV1 PVA PVR
Hungary PV1 PVA PVR
Norway PVA PV1 PVR
29. Final thoughts
• Not taking into account three features of complex sample designs for
LSA’s can have a big influence on achievement scores, standard errors
and rankings.
• Confirms findings by Rutkowski et al. (2010).
• Not all ‘rules of thumb’ from previous literature (Drent et al., 2013;
Rutkowski et al., 2010) seem to hold.
• Therefore, caution should always be taken when analysing LSA data,
hopefully improving future LSA analyses by educational researchers.
• Need transparent methodology
THANK YOU
C.Bokhove@soton.ac.uk
QUESTIONS/DISCUSSION
30. Relevant references
Beaton, A.E., & Gonzalez, E.J. (1995). NAEP Primer. Center for the study of testing, evaluation and
educational policy, Boston College. Chestnut hill: MA.
Caro, D. (2014). intsvy: International Assessment Data Manager. R package version 1.3. http://CRAN.R-project.
org/package=intsvy
Drent, M, Meelissen, M.R.M., & van der Kleij, F.M. (2013). The contribution of TIMSS to the link between
school and classroom factors and student achievement. Journal of curriculum studies, 45 (2), 198 - 224.
Goldstein, H. (2004). International comparisons of student attainment: some issues arising from the PISA
study. Assessment in Education, 11(3), 319-330.
Kreiner, S., & Christensen, K. B. (2014). Analyses of model fit and robustness. A new look at the PISA
scaling model underlying ranking of countries according to reading literacy. Psychometrika, 79(2), 210-231.
Martin, M.O. & Mullis, I.V.S. (Eds.). (2012). Methods and procedures in TIMSS and PIRLS 2011. Chestnut
Hill, MA: TIMSS & PIRLS International Study Center, Boston College.
Mullis, I.V.S., Martin, M.O., Foy, P., & Arora, A. (2012).TIMSS 2011 International results in mathematics.
Lynch School of Education, Boston College.
Rubin, D. (1987). Multiple imputation for nonresponse in sample surveys. New York: John Wiley.
Rutkowski, L., Gonzalez, E., Joncas, M., & von Davier, M. (2010). International large-scale assessment data:
Issues in secondary analysis and reporting. Educational Researcher, 39(2), 142-151.
Von Davier, M., Gonzalez, E., & Mislevy, R.J. (2009). Plausible values: What are they and why do we need
them? IERI Monograph Series: Issues and Methodologies in Large-Scale Assessments, 2, 9-36.
Willms, J.D., & Smith, T. (2005). A manual for conducting analyses with data from TIMSS and PISA. Report
prepared for UNESCO Institute for Statistics.