A Realistic Perspective on High-Stakes Testing
In a recent Education Next review (“A Gloomy Perspective on High-Stakes Testing”, vol. 18, no. 2), Dan Goldhaber takes aim at my new book, The Testing Charade: Pretending to Make Schools Better, in which I argue that test-based accountability has failed on balance and that it is time to explore alternatives. Goldhaber asserts that Charade is incomplete and slanted. However, his review mischaracterizes the book’s arguments and fails to rebut many of its core contentions.
Goldhaber writes that I characterize test-based accountability (TBA) as “an unmitigated disaster.” I do not. Rather, I line up the accumulated evidence about the effects of test-based accountability in the U.S., both positive and negative. I argue that the negative effects greatly outweigh the positive and that it is therefore reasonable to deem the approach a failure.
In Charade, I discuss five principal effects of TBA that lead to this conclusion. I’ll briefly list them here, along with Goldhaber’s response.
Positive effects on achievement. I write that the most substantial positive effect of TBA has been improved mathematics performance among younger students. I describe the impressive gains in scores on the National Assessment of Educational Progress (NAEP) and explain that the most credible evidence attributes part of these gains to TBA. Goldhaber’s response, oddly enough, is to summarize precisely the same evidence. This only appears to “contradict” my conclusions, as he wrote, because he mischaracterizes Charade as a one-sided attack on TBA.
While the news about mathematics is good, Goldhaber overstates how good it is. Charade offers three reasons why these gains are not as positive an indication about TBA as they might seem. First, both NAEP and the international PISA studies suggest that these gains don’t persist until high-school graduation. Second, the conventional estimates of the gains rely on the main NAEP, but the other two relevant data sources (the NAEP long-term trend assessment and the TIMSS) show much smaller improvements. Third, we have 30 years of evidence that many teachers responded to TBA by shifting instructional time from other subjects into math, so some unknown portion of the gain in mathematics likely reflects a shift of achievement among subjects rather than an overall increase in learning. Goldhaber notes the first of these caveats, albeit without acknowledging its implications for his positive portrayal of TBA’s effects. He ignores the other two.
I also argue that improved performance in math is the only large positive impact of TBA. Goldhaber’s response is to describe gains in reading in somewhat more positive terms than I do. I present the relevant data in Charade, so readers can decide for themselves which adjectives are closer to the mark. However, it seems clear that the reading trends represent a failure, given that reading has been one of TBA’s two primary targets for decades.
Widespread inappropriate test preparation. Three decades of research documents several types of pernicious test prep that have become pervasive in our schools. I provide concrete examples to illustrate how inappropriate some of this is. Goldhaber offers no rebuttal.
Score inflation. I explain that 25 years of research has shown that score inflation is common, that it is often very large, and that the limited research on its distribution suggests that both inflation and bad test preparation affect disadvantaged students more than others. Again, Goldhaber offers no rebuttal.
Corruption of the ideal of teaching. Increasingly, new teachers have been taught not only that they should engage in test prep—even forms of test prep that clearly produce bogus gains—but that doing so is good instruction. I make clear in Charade that this is the least well substantiated conclusion about the effects of TBA, but it is apparent in some of the most widely used teacher-preparation books, and it is a common complaint among young teachers. Goldhaber does not address this.
Widespread cheating. Everyone knows about the cheating scandal in Atlanta, but that was hardly an isolated occurrence. In Charade, a coauthor and I enumerate media reports of roughly 20 districts in which cheating has been confirmed and roughly 200 in which suspicious score patterns have been documented. This list is not exhaustive, of course, as there are no relevant national data.
I also argue that these documented accounts should be seen as the tip of the iceberg. There is no serious monitoring for cheating in most of the 13,000 school districts in the U.S., and some of the best known cases became known only because of the determined efforts of a few individuals outside the school system. Moreover, there are common forms of test prep that blend into cheating in that they can only produce fraudulent gains in performance.
Goldhaber does attempt to counter this argument, but not successfully. First, he notes that the District of Columbia, a district that I indicate may have experienced cheating, has shown substantial increases on NAEP. It has, but that has no bearing on the ongoing dispute about possible cheating. Some teachers could be cheating even while others are doing real teaching. Moreover, I present the D.C. case only as possible cheating, and one could simply delete D.C. from the long list of cases documented in Charade without weakening the argument. Second, he points to a well-known study by Brian Jacob and Steven Levitt (which I also cite in Charade) in which they estimated that 4-5% of classrooms in Chicago experienced cheating. Goldhaber ignores the fact that Jacob and Levitt themselves wrote that their method will understate the extent of cheating and doesn’t consider that the Chicago data were collected before NCLB ratcheted up pressure on schools. More important, the focus in Charade is the national prevalence of cheating, and the experience in one district or state tells us next to nothing about this. Chicago tells us no more about this than does Atlanta, where cheating was systematic, or Kentucky, where another study that I cite in Charade found that 9 to 36 percent of teachers in Kentucky reported various forms of cheating in their own schools.
In sum, Goldhaber does not successfully rebut any of the core arguments that undergird my conclusion that TBA has on balance failed.
Rather than rebut the evidence I present in Charade, Goldhaber repeatedly cites the District of Columbia to argue that TBA can work. I won’t wade into the ongoing debate about what has happened in DCPS, but I certainly don’t argue in Charade that there has been no variation in the effects of TBA across districts. Some have undoubtedly done better than the aggregate suggests, but that would necessitate that others have done yet worse. My argument—which Goldhaber leaves intact—is that in the aggregate, despite decades of effort and refinements, TBA has done more harm than good.
While the evidence about the past effects of TBA is generally clear, I stress in Charade that we have far less evidence to guide the development of alternatives, and I also emphasize that many of the suggestions I make in in the book may prove wrong or too difficult. I stress that there is ample room for debate about how to move forward and that regardless of who wins those debates, we will make mistakes.
That is the debate I hope Charade will promote. We need to face up to the findings of three decades of research on the effects of TBA and engage in a vigorous debate about how best to move forward, including discussion about how best to use standardized testing. It is disappointing that Goldhaber’s review does little to advance this necessary debate.
— Daniel Koretz
Daniel Koretz is the Henry Lee Shattuck Professor of Education at the Harvard Graduate School of Education. He is an expert on educational assessment and testing policy.