The Accountability Plateau
In Texas and across the nation, high-stakes testing regimes produced real gains for a few years, then flat-lined
Many educators and elected officials, including more than a few members of Congress, regard “No Child Left Behind,” the well-known moniker of George W. Bush’s 2001 education act, as a discredited “brand.” Indeed, the very acronym NCLB is about to be tossed into the dustbin of history in favor of its progenitor, ESEA (the Elementary and Secondary Education Act), or perhaps some new title yet to be devised on Capitol Hill. There are many reasons why NCLB has been discredited, including, to quote Kevin Carey, the “apocalyptic language out there, that standards and tests have ruined American public education, driven the best teachers out of the classroom, etc., etc.”
Yet, as the data presented below demonstrate, NCLB—and the accountability movement it embodied, codified, and symbolized—contributed to a major change in the performance level of American students in math. The data also suggest, however, that the accountability movement has likely reached a point of diminishing (or perhaps even no) returns. While moving on from NCLB is probably essential to produce further growth in student performance, “consequential accountability” was an important and meaningful education reform and ought not be dismissed as a failed initiative.
Debates over the effects and effectiveness of NCLB almost always revolve around national and state scores on the National Assessment of Educational Progress (NAEP). Not surprisingly, the release in November 2011 of the newest NAEP Mathematics and Reading Report Cards set off a new round of discussion about the impact of NCLB and accountability more generally. Given the ongoing fights surrounding the overdue reauthorization of ESEA/NCLB, the debate over the effects of accountability is more important now than ever.
Remember that NCLB’s system of consequential accountability (in which schools face cascading penalties for failure, e.g., replacement of the school’s principal, reconstitution, closure, etc.) was built upon the experience of many states that had already developed such systems before 2001. There is considerable agreement that states adopting consequential accountability before NCLB experienced more rapid growth in their test scores relative to non-adopting states. However, as Hanushek and Raymond note, as NCLB took hold, all states became “effectively consequential accountability states.” Perhaps not surprisingly, after NCLB, states that were new to the accountability regime experienced faster growth on NAEP assessments than states that had introduced their own accountability regimes before 2001.
The Case of Texas
Texas was one of the first states in the nation to adopt strict and consequential accountability. The Texas experience was fundamental to the framing of NCLB, as George W. Bush took the lessons and practices of Texas along with him when he moved from Austin to Washington. Thus, looking at the growth in NAEP scores in Texas relative to changes in the nation as a whole allows us to tease out some lessons about the effects of accountability on student performance and to speculate about the effectiveness of accountability past, present, and future.
As we look at these data, we should remember that, while NAEP is rightfully viewed as the “gold standard” of assessments, it is not the ideal instrument for detailed statements of cause and effect. We should further keep in mind one of the prime maxims of statistics: Correlation is not causation.
The Remarkable Growth in NAEP Math Scores
It is well known that, as measured by NAEP, American students have improved substantially in math (more in fourth grade than in eighth) and little in reading over the last two decades. Separate and apart from overall averages, there has been continuing concern for the level of skills among racial/ethnic minorities as well as concern for the effects of accountability on low- versus high-performing students (specifically, whether or not NCLB placed so much attention on low-performing students that high-performing students were neglected and suffered as a result). Looking at trends in Texas versus the nation presents some insights into these issues.
Consider Figure 1, which graphs the average scale scores on NAEP’s math assessment for fourth-grade students in Texas and in the United States as a whole. The growth in the performance of these students is nothing short of remarkable. Using the very rough rule of thumb that a 10-point change in NAEP scores equals about one year of learning, in 2011 our fourth graders are about two years ahead of where they were in 1992. But, as the figure shows, Texas and the nation marked their peaks of achievement at two distinct points in time.
In 1992, students in Texas were performing at the same level as the students in the nation. In the 1993-94 school year, Texas introduced its system of consequential accountability and, by the time of the next NAEP assessment in 1996, Texas fourth-graders had surpassed their peers nationwide. Between 1992 and 2000, math scores across the nation began to creep up; during the same period, a growing number of states began to adopt accountability systems.
By 2003, NCLB had turned every state into a consequential accountability state, and the rate of increase nationwide in math scores between 2000 and 2007 was remarkable. While Texas students continued to outperform the nation as a whole through 2007, the sharp uptick in national performance after 2000 narrowed the Texas lead substantially. Indeed, the last two assessments, in 2009 and 2011, show no significant difference between fourth graders in Texas and fourth graders nationwide.
We return to these overall patterns later, but first we turn to the performance of three groups of students who served as particular focal points of NCLB and the accountability movement more generally: blacks (Figure 2), Hispanics (Figure 3), and low-performing students (Figure 4), defined here by the cut score identifying those students performing at NAEP’s 10th percentile.
At the beginning of the series in 1992, black and Hispanic fourth-grade students in Texas scored slightly higher than their nationwide peers, while those low-performing students at the 10th percentile in Texas achieved at the same level as those at the 10th percentile nationally.
Between 1992 and 2000, the scores of Texas students in all three groups increased faster than those of their peers nationwide, with the size of the gap between student in Texas and the nation widening to well over 10 points for each group. Between 2000 and 2003, nationwide, the gains for students in each group increased dramatically but then slowed substantially. Gains among Texas fourth graders were sustained over a longer period of time, but also show evidence of little growth since 2005, with Hispanic and the lowest-performing students actually scoring lower in the latest assessments than in 2007.
The growth in fourth-grade math achievement represents one of the most significant success stories in contemporary American education. Again, the reader is reminded that, while correlation is not causation, the introduction of consequential accountability in Texas and then across the nation coincided with impressive spikes in the performance of students in fourth-grade math, and in particular among the students of most concern to NCLB and the accountability movement more generally.
NAEP test results for eighth-grade math represent a somewhat weaker reflection of this striking pattern (Figure 5). The first NAEP eighth-grade math assessment was in 1990, at which time Texas eighth graders lagged the nation by 5 points. That gap disappeared by 2000. By 2005, as the strong fourth-grade performers moved into the eighth grade and as the Texas system of consequential accountability continued to gain traction, Texas eighth graders moved past their national peers, producing a gap of 6 points. Whether eighth-grade test scores can continue to grow, given the flattening scores at the fourth grade, is something that remains to be seen.
Among black and Hispanic eighth graders, Texas students started at about the same place as their national peers in 1990. Over time, however, they experienced steady growth in performance, producing a widening gap with the nation. Indeed, the size of the gap for black students (in favor of Texas) has increased from 6 or 7 points before 2000 to 10 points in the last three assessments (Figure 6). The size of the gaps in favor of Hispanic students in Texas has been somewhat more variable, and was not statistically significant before 2000 (Figure 7). But this gap has grown to over 10 points in the last three assessments. Similarly, the cut score defining the lowest 10th percentile has risen more rapidly in Texas than in the nation as a whole (Figure 8), becoming statistically significant in 2000 and almost doubling in size from 2000 (7 points) to the latest assessment in 2011 (13 points).
A frequent criticism of the accountability movement and NCLB was that the focus on racial and ethnic minorities and on the lowest-performing students led to a neglect of the nation’s highest-performing youngsters.
Here we define high-performing students as those performing at NAEP’s 90th percentile. Fourth-grade math scores for these students both in Texas and in the nation display sharp increases since 1992 (Figure 9). The cut score for the top performers nationwide stood at 259 in 1992 and steadily rose to 276 in 2011, a gain of 17 points. The highest-performing fourth graders in Texas saw a correspondingly large jump in cut scores from 256 in 1992 to 273 in 2005. (Interestingly, half of that gain occurred between the assessments immediately preceding and following implementation of the state’s accountability system in 1993-94). Since 2005, however, there has been no statistically significant change in cut score for those Texas youngsters, although the national cut score for high performers has continued to rise—producing a statistically significant difference (to the disadvantage of Texas) in the two most recent administrations of NAEP.
Eighth-grade math scores among the highest performers also improved substantially over the period, gaining 14 points nationally and 17 points in Texas (Figure 10). The sharpest gains for these high-performing eighth graders in Texas were between 2000 and 2005, building on the improvement made in math by Texas fourth graders four years earlier. Gains continued thereafter at somewhat slower rates, likely reflecting the slower growth in fourth-grade math skills.
The growth in NAEP scores of the highest-performing students in Texas and the nation essentially mirrors the gains made by student groups that were focal to the policy goals of NCLB. Whatever changes more directly focused on specific target populations apparently spilled over to affect the performance of high performers as well. And just as we saw evidence of diminishing effectiveness in recent years for average, minority, and low-performing students, there is evidence that the spillover effects of accountability on high-performing students are also wearing thin. The recent absence of growth in Texas fourth-grade math skills among these high-performing students may portend the end of a remarkable period of growth among the highest performers in the second-largest state in the union.
The Disappointing Case of Reading Scores
The improvements in NAEP math scores were an unquestionable success for America’s fourth and eighth graders and even more so for students in Texas. However, neither the nation as a whole nor Texas has done nearly as well improving students’ reading skills. Figure 11 shows no significant difference between the reading scores of fourth-grade students in Texas and in the nation as a whole, except in 2003, and minimal improvement across the board. And Texas’s eighth graders have significantly lagged the nation since 2003: by 2 points in 2007 and by 4 to 5 points in every other assessment between 2003 and 2011 (Figure 12).
Accountability and NCLB Were a Success, But…
In 1972, Stephen Jay Gould and Niles Eldridge proposed a theory of evolutionary change that emphasized what they termed “punctuated equilibrium.” Their core insight was that complex systems will exist in long periods of stasis. Rather than coming in small incremental steps, change is often characterized by abrupt radical transformations caused by events external to the existing system. Perhaps the most dramatic example is the relatively sudden disappearance of dinosaurs associated with a meteor crashing into the Earth and changing the climate. As a result, the dinosaurs’ long reign was replaced by a new equilibrium dominated by mammals.
In 1993, political scientists Frank Baumgartner and Bryan Jones introduced this theory to the study of public policy, and it has since become a common lens through which to view change in social systems. Baumgartner and Jones argued that policy generally changes only incrementally, until some event, such as change in the party control of government or sizable shifts in public opinion, lead to large policy alterations. In their approach, large changes in external conditions (what Baumgartner and Jones term an “exogenous shock”) are often needed to produce change in complex social and political systems.
The pattern of test scores in Texas and the nation suggest that consequential accountability—adopted early by Texas, then by more states, and finally by the nation as a whole—was a shock to the U.S. school system that altered the ecosystem and led to a different outcome than had existed before. Over a relatively short period, math performance in fourth and eighth grade abruptly shifted to higher levels of performance. For example, between 2000 and 2005—the five years spanning the introduction of accountability via NCLB—the average math scale score nationwide at the fourth grade rose by 12 points, roughly a year of learning. In the same period, the average scale score for black fourth graders rose by 18 points, for Hispanic students by 17 points, and the cut score defining the 10th percentile of performance increased by 16 points. The corresponding changes among eighth-grade math scores are small only in comparison: 6 points nationwide, 11 points for black students, 10 points for Hispanic students, and 8 points for those students at the 10th percentile.
To be sure, an important lingering issue is the absence of growth in reading scores in Texas and in the nation as a whole. Many have argued that the foundation for reading, compared to math, is far more dependent on what happens early in children’s lives—before they enroll in school—and that improving reading skills is therefore much harder to accomplish. Whatever the explanation, clearly the absence of growth reflects a failure of the accountability “meteor” to affect reading levels in a fundamental way.
There is once final pattern to note: As would be expected when viewed through the punctuated- equilibrium lens, once the disruption of consequential accountability has wrung all changes out of the system, a new stasis should take hold. Indeed, Texas, an early adopter, led the nation to higher scores and seems to be ahead of the nation in reaching a new plateau where changes are minimal compared to what came in response to the introduction of an accountability system. The nation, which lagged Texas in adopting accountability, now seems to be entering a period of little change in test scores.
In the 1990s and early 2000s, accountability was an exogenous shock that produced radical gains in math if not in reading. But we now need a new shock to prevent a prolonged period of stasis and stagnation. Scanning the heavens for the next meteor, the most likely candidates to come crashing into the school ecosystem are the Common Core and the better measurement of teacher performance. If the United States is lucky, one or both of these shocks will produce yet another major uptick in math scores. If we are really lucky, these shocks will produce upticks in reading and other subject areas as well.
Mark Schneider, a former commissioner of the National Center for Education Statistics, is a vice president at American Institutes for Research and visiting scholar at the American Enterprise Institute. This article was commissioned and also published by the Fordham Institute.