Over the last few years I have developed a deeper skepticism about the reliability of relying on test scores for accountability purposes. I think tests have very limited potential in guiding distant policymakers, regulators, portfolio managers, foundation officials, and other policy elites in identifying with confidence which schools are good or bad, ought to be opened, expanded, or closed, and which programs are working or failing. The problem, as I’ve pointed out in several pieces now, is that in using tests for these purposes we are assuming that if we can change test scores, we will change later outcomes in life. We don’t really care about test scores per se, we care about them because we think they are near-term proxies for later life outcomes that we really do care about — like graduating from high school, going to college, getting a job, earning a good living, staying out of jail, etc…
But what if changing test scores does not regularly correspond with changing life outcomes? What if schools can do things to change scores without actually changing lives? What evidence do we actually have to support the assumption that changing test scores is a reliable indicator of changing later life outcomes?
This concern is similar to issues that have arisen in other fields about the reliability of near-term indicators as proxies for later life outcomes. For example, as one of my colleagues noted to me, there are medicines that are able to lower cholesterol levels but do not reduce — or even may increase — mortality from heart disease. It’s important that we think carefully about whether we are making the same type of mistake in education.
If increasing test scores is a good indicator of improving later life outcomes, we should see roughly the same direction and magnitude in changes of scores and later outcomes in most rigorously identified studies. We do not. I’m not saying we never see a connection between changing test scores and changing later life outcomes (e.g. Chetty, et al); I’m just saying that we do not regularly see that relationship. For an indicator to be reliable, it should yield accurate predictions nearly all, or at least most, of the time.
To illustrate the un-reliability of test score changes, I’m going to focus on rigorously identified research on school choice programs where we have later life outcomes. We could find plenty of examples of disconnect from other policy interventions, such as pre-school programs, but I am focusing on school choice because I know this literature best. The fact that we can find a disconnect between test score changes and later life outcomes in any literature, let alone in several, should undermine our confidence in test scores as a reliable indicator.
I should also emphasize that by looking at rigorous research I am rigging things in favor of test scores. If we explored the most common use of test scores — examining the level of proficiency — there are no credible researchers who believe that is a reliable indicator of school or program quality. Even measures of growth in test scores or VAM are not rigorously identified indicators of school or program quality as they do not reveal what the growth would have been in the absence of that school or program. So, I think almost every credible researcher would agree that the vast majority of ways in which test scores are used by policymakers, regulators, portfolio managers, foundation officials, and other policy elites cannot be reliable indicators of the ability of schools or programs to improve later life outcomes.
With the evidence below I am exploring the largely imaginary scenario in which test scores changes can be attributed to schools or programs with confidence. Even then, the direction and magnitude of changing test scores does not regularly correspond with changing later life outcomes. I’ve identified 10 rigorously designed studies of charter and private school choice programs with later life outcomes. I’ve listed them below with a brief description of their findings and hyperlinks so you can read the results for yourself.
Notice any patterns? Other than the general disconnect between test scores and later life outcomes (in both directions), I notice that the No Excuses charter model that is currently the darling of the ed reform movement and that New York Times columnists have declared as the only type of “Schools that Work” tend not to fare nearly as well in later outcomes as they do on test scores. Meanwhile the unfashionable private choice schools and Mom and Pop charters seem to do much better on later life outcomes than at changing test scores. I don’t highlight this pattern as proof that we should shy away from No Excuses charters. I only mention it to suggest ways in which over-relying on test scores and declaring with confidence that we know what works and what doesn’t can lead to big policy mistakes.
Here are the 10 studies:
1. Boston charters (Angrist, et al, 2014) – Huge test score gains, no increase in HS grad rate or postsecondary attendance. Shift from 2 to 4 yr
2. Harlem Promise Academy (Dobbie and Fryer, 2014) – Same as Boston charters
3. KIPP (Tuttle, et al, 2015) – Large test score gains, no or small effect on HS grad rate, depending on analysis used
4. High Tech High (Beauregard, 2015) – Widely praised for improving test scores, no increase in college enrollment
5. SEED Boarding Charter (Unterman, et al, 2016) – same as Boston charters
6. TX No Excuses charters (Dobbie and Fryer, 2016) – Increase test scores and college enrollment, but no effect on earnings
7. Florida charters (Booker, et al, 2014) – No test score gains but large increase in HS grad rate, college attendance, and earnings
8. DC vouchers (Wolf, et al, 2013) – Little or no test score gain but large increase in HS grad rate
9. Milwaukee vouchers (Cowen, et al, 2013) – same as DC
10. New York vouchers (Chingos and Peterson, 2013) – modest test score gain, larger college enrollment improvement
– Jay P. Greene
Jay P. Greene is endowed chair and head of the Department of Education Reform at the University of Arkansas.