1) Both summative and formative evaluations are important to measure program effectiveness, with summative evaluating overall impact and formative providing feedback to improve the program.
2) Measuring student growth over time, such as with NWEA assessments, and comparing results to internal or national benchmarks allows evaluation of a program's impact.
3) More rigorous evaluation with techniques like randomized experiments or longitudinal cohort analysis is warranted for larger programs requiring more resources. The closer a program is to the classroom, the larger its likely impact.
1. Using Growth to
Measure Program
Effectiveness
Andy Hegedus, Ed. D.
Kingsbury Center at NWEA
April 2013
2. Summative vs.
Formative Evaluations
• Summative
–Was the program effective?
• Formative
–How can we improve the program?
Good evaluations include both elements
3. What defines effective?
• The new program produced
better results than . . .
–The previous program
–A comparison group
4. Measuring Growth
Grade 5 Math
215
210
205
200
195
190
185
Spring Fall Winter Spring
2011 Grade 5 Math
5. Measuring Improvement
Two different Cohorts
Grade 5 Math
220
215
210
205
200
195
190
185
Spring Fall Winter Spring
2011 Grade 5 Math 2012 Grade 5 Math
6. Measuring Improvement
Comparing to a Benchmark
Grade 5 Math
220
215
210
205
200
195
190
185
Spring Fall Winter Spring
2011 Grade 5 Math 2012 Grade 5 Math Comparison Group
7. What are some
Comparison Groups?
• Internal Group
• National Group
–NWEA 2011 Growth Norms
• Matched Group
–NWEA Virtual Comparison Groups
(VCG’s)
8. How much rigor?
• The more ineffective the program, the greater the
bias toward action
• Changes that are more resource intensive require
more careful and rigorous evaluation
– Disciplined screening including research review
– Structured pilot
– Rigorous evaluation of pilot
• For larger changes, consider piloting multiple
alternatives to improve your odds
23. Quality of
resources and support
• Quality of professional development
• Quality of support
materials, texts, etc.
• Availability and quality of
implementation support
24. Some common
evaluation designs
• Randomized Experiment
• Quasi-experiment
• Time-series
27. Considerations
• Minimum size for a good study
• Grouping by schools? by classrooms? by
students?
• Risks associated with non-random selection
– Not equivalent groups
– Volunteer effect
28. Time-Series
Target Population
Selection
Business Post-test
Pretest Pretest Intervention Post-test
as Usual
29. Historical information can
help
120 95% error band
100
80
60 Intervention
Control
40
20
0
Quarter 1 Quarter 2 Quarter 3 Quarter 4
30. Closing points
• This is not rocket science
–You can do this stuff
• Good measures properly used are
instrumental
• Evaluate with the rigor that is
proportionate to the stakes
–Get the expertise to help if needed
You saw teacher evaluation hat. This is my research hat. Day to day manage a research project with a large Idaho district on the impact of an new PD program on student performance – Keeping Learning on Track – All about in the moment evidence gathering and adjustment by students, peers, teachersTwo people from the hard statistical research side – evaluation modeling, two from Kingsbury – survey construction/analysis, me and project manager Not an expert at this; however, do have some experience to shareThings change when you move from status to growthWhat’s the story line within the data – what’s the impact of the change?Should we act or should we gather more evidenceProgram evaluation rather than teacher evaluation and school evaluationProgram – Curriculum, Sp Ed, RtI, etc.
Summative – did it work yes or noFormative – gather information along the way to understand more of the why underneath the results.A good evaluation is both
Implement something new and see if it does better. If so then effective. Time Series designBetter than comparison group then effective – ControlWhat you chose depends on context and available data.
Learning trajectory – assume group of kids is intactAll kids grow – make improvement as there is instructionKnowing growth by itself doesn’t tell you that the program is effective. No reference.
One approach – measure two different cohortsA little less summer loss and slightly higher growth trajectory – New cohort did better than old cohortDistinguish between growth of a group and improvementImprovement is this years group did better than last years group
Growth relative to a benchmarkPast math programSummer loss may or may not be the programSeeing growth2012 better than 2011 group2012 did better than comparisonFairly compelling evidence that program is doing well
How much effort should you put into the evaluation. Balance with resources and nature of the problemPrinciple/GuidelinesGreater dissatisfaction – the more bias toward action rather than evaluation – More resources, the more careful you should be – new mathematics program in a very large district. Before you begin, three steps first. Screening, Design, Evaluation. Rare to produce great results. Often the cause can be attributed to low fidelity to implementation.If large change, try a few alternatives and see what works better on a small scale. Evaluate them and see what gives you the best results.
Current no growthHypersonic – not conclusive that it is better. The probability that is not better is the height of the line about the current program. Since 90% of that line is above current math, there is a very strong likelihood that Hypersonic math will yield better results. Although a pure research scientist may not say conclusively due to the confidence interval, practically it is likely a solid investment.
Second exampleCurrent program shows some gainIn this case, it may depend on the cost for implementation.Reluctant to invest just based on this evidence due to the possibility of error.Small investment to expand pilot – may be considered.
What’s wrong with 5th grade math?Result is slightly below average. Why if all the others are above average is fifth grade below?Two ways to look at it – Cross sectional analysis
Cross sectional analysisGrowth of successive fifth grades to see if the pattern is the same as prior groups in the same program. Could do more than two years. Not as well in two successive years. Another possibility. A cohort of students could be the issue.
LongitudinalSame group was doing well in the prior year. Fifth grade is low in both. Result is then with the program not the kids.Want to follow a group 3rd thru 6th grade if there is a potential issue with kidsProgram issue – cross-sectional
How likely is it to have an effect and how do you weigh it when you are doing an evaluation.Program by teachers in the classroom, curriculum or resources, likely to have a larger impact than one away from the school.
Hypersonic – what is taught and how its taught – every kid will see different stuff due to the new materialsPower of Inquiry – PD to help teachers use more inquiry based learning – fidelity of implementation issue – Just what is taughtPLC – teachers sitting with data, performance of students, evaluating student work, SIP. What they need to work on. One more step away from the classroom. PLC perhaps can sustain something over time.Giving laptop computers to seniors – will improve reading skills. Going on web and reading atlantic monthly maybe. Don’t see connection. Impact is unlikely.Judge likelihood of an impact. It doesn’t mean that interventions that are removed have no impact. The further it’s away the more important it is to have implementation fidelity measures – observations, PD high quaility, measure it all. If no result then is it the program or what you did to implement it.
How much the kids learnedPD – how reliably did schools and teachers take what they learned in PD and apply it in the classroomQuality – how good was the PD? The PD materials? The presenter? Time given for practice and learning?
Measure over two years. 2011 – 8 pts, 2012 – slightly lowerOn the face it slipped in its effectiveness
Let’s look at fractions specifically. Overall score went down.Other strands declined. Something in the way the intervention was delivered caused other areas to decline.This year we will work on computation skills and measure by that goal area. No improvement in mathematics. Could be so much focus on intervention, rather than other areas. If only working on one piece of the domain, the remainder of the curriculum may suffer.You want to evaluate the intended impact is seen and look more broadly for unintended consequences in the broader domain.
Power of InquiryHow many teachers implemented it. How do you know?Survey – self-report – good formative information. Not necessarily the most reliable.Survey multiple people on the same topic – teachers and studentsAsk both hard and perceptual questions – compare to standards for implementation fidelityPrincipals observe – part of formative evaluation – friendly visit and conference afterwardsDrop in visits – a couple times a week get into all classrooms – get a sampling to gauge the level of implementation
District level view
Let’s dig under the district wide data. For example you can look at schools. Two did really well. Could look at grades. Sub-groups.
Need to understand more about those schools.Replicable? What did they do that others didn’t? What of these differences is likely to have caused these results?How do you know if you don’t gather fidelity data?
Know outcomesKnow fidelity of implementationWhy there might be implementation differences – How doable is this? Is the PD good? Does it supply everything people need?
Study designsThere are three used typically. Let’s discuss each one separately.
Randomized experiments – High stakes and high resource investmentGets you equivalence between two groups – random by school in big district; classroom – some will get it some won’t bleed over. If it looks good implement everywhere quickly
Don’t randomly assign to treatmentGroups may not be equivalent – If not huge investment okay or if random assignment is not possible. The more people that participate and the more diverse they are the more likely you will get good data
Current program over timeMAP before can give you the baseline; then do intervention and see the change. Same kids, get baseline in 1st semester. Intervention in 2nd semester.
Long time MAP usersHistorical/baseline can also provide a good context for pre and post interventionKLT evaluation we are looking at three years of student data to get trajectories of student growth as a group and student growth trajectories for each participating and control teachers. If we know this, then we look for deflections as the intervention occurs.
You can do thisGood measures are not always MAP. Can be what teachers are already doing in the classroom.Millions at stake be careful and rigorous.
What if different cohorts are coming in and they aren’t equivalentGood or bad 4th grade – look at whether growth exceeds expectations all the time – may not be the programThe larger the study the less the effect – 100 classrooms across a school system vs. 3 in one school. More numbers means less cohort effect.