Accountability Incentives

“Take out your classes’ latest benchmark scores,” the consultant told them, “and divide your students into three groups. Color the ‘safe cases,’ or kids who will definitely pass, green. Now, here’s the most important part: identify the kids who are ‘suitable cases for treatment.’ Those are the ones who can pass with a little extra help. Color them yellow. Then, color the kids who have no chance of passing this year and the kids that don’t count—the ‘hopeless cases’—red. You should focus your attention on the yellow kids, the bubble kids. They’ll give you the biggest return on your investment.”
—Jennifer Booher-Jennings, “Rationing Education in an Era of Accountability” Phi Delta Kappan International (June 2006)

Increasingly frequent journalistic accounts report that schools are responding to No Child Left Behind (NCLB) by engaging in what has come to be known as “educational triage.” Although these accounts rely almost entirely on anecdotal evidence, the prospect is of real concern. The NCLB accountability system divides schools into those in which a sufficient number of students score at the proficient level or above on state tests to meet Adequate Yearly Progress (AYP) benchmarks (“make AYP”) and those that fail to make AYP. The system gives no credit to schools for moving students closer to proficiency or for advancing already-proficient students. If schools intent on meeting minimum competency benchmarks practice educational triage, they dedicate a disproportionate amount of their limited resources to “bubble kids,” students who might otherwise perform just below the proficiency threshold. While these marginally performing students are likely to benefit from increased attention, reallocation of instructional attention leads to a tradeoff whereby the achievement gains of the marginally performing students come at the expense of both the lowest- and highest-performing students.

With congressional proceedings on NCLB’s reauthorization under way, the time is opportune to take a hard look at educational triage claims. If the current law’s minimum competency standard produces gains among students near the proficiency threshold but disadvantages others, the rules of the accountability system need to be modified, perhaps to reward improvements across the entire achievement distribution.

To search for evidence of educational triage, I analyzed three years of test-score and other data on 300,000 students in public schools in a western state. I found none. I concluded that these schools were not responding to NCLB by trading off achievement among students with different baseline levels. Rather, they were successful at raising the performance of students who were otherwise at risk of failing the state test without sacrificing the performance of lower- and higher-performing students. Even in failing schools, students above the proficiency threshold made gains that were greater than one would expect if schools were concentrating resources on students near the threshold. When academic achievement is measured with test-score performance in this state, the much-politicized argument that NCLB compromises the educational needs and opportunities of high-performing, academically accelerated students holds no water.

THE STATE’S ACCOUNTABILITY PROGRAM

The U.S. Department of Education in 2003 approved the state’s accountability plan, which was designed to meet federal guidelines and regulations associated with NCLB. The plan requires all public schools in the state to meet proficiency standards in math and reading for all students and for each of 10 student subgroups, and to test a minimum of 95 percent of students in each subgroup to avoid sanctions. The accountability program measures students’ content knowledge and skills using an Internet-enabled testing system developed by the Northwest Evaluation Association (NWEA), a national nonprofit organization that provides assessment products and related services to school districts. NWEA compares spring assessment results to grade-specific benchmarks to gauge whether individual students, subgroups of students, and schools meet the state’s proficiency standards.

The particular demographic characteristics of this state limit the generalizabilty of my findings to other, more demographically diverse states. The state is disproportionately white and rural, with much smaller than typical schools and districts. Approximately 83 percent of students are white, 12 percent are Hispanic, and the remaining 5 percent are black, Asian, Pacific Islander, American Indian, or Native Alaskan. Roughly 40 percent of students were identified as economically disadvantaged based on their eligibility for free and reduced-price lunch. Schools that did not make AYP have a higher percentage of Hispanic students than schools that did (18 percent vs. 10 percent) and a higher percentage of students eligible for free and reduced-price lunch (48 percent vs. 39 percent). Despite the disadvantage of atypical demographics, this state offered the unique advantage of being able to measure achievement gains within the same school year. The state tests each student twice per year, permitting for measurement of individual students’ fall-to-spring test-score gains.

DATA

Data in this study are from the NWEA Growth Research Database. Starting with the 2002–03 school year, NWEA administered tests in mathematics, reading, and language arts to more than 90 percent of the state’s students. NWEA furnished fall and spring test scores for the first three years after enactment of the state’s accountability program (2002–03 through 2004–05 school years) for students in grades 3 through 8. My analysis focuses on math scores. The statewide percentage of students scoring in the proficient and advanced categories in math has ranged from a low of 53 percent for 8th graders in 2003 to a high of 90 percent for 4th graders in 2005.

The NWEA data set also provides demographic information about students, including the student’s grade in school, gender, race, ethnicity, and eligibility for free or reduced-price lunch. School-level characteristics include school type and school size. Although NWEA gathers data for students in traditional public schools, charter schools, and private schools, I limited the study to students enrolled in traditional public schools or public charter schools because private schools are not included in the state’s accountability program. I removed from the study sample very small schools, those with fewer than 34 students being tested, given their systematically different treatment under the state’s accountability system.

IDENTIFYING EDUCATIONAL TRIAGE

My objective was to detect shifts in how schools committed resources to different students, resources such as textbooks and teachers, but also such inputs as teacher attention and choice of curriculum and instructional strategies. Obviously, no formal accounting system tracks the distribution of resources directed at individual students. So I turned to an indirect measure of resource allocation. I infer the priorities of administrators and teachers from educational outcomes, as measured by student performance on the state’s math test. If there is a greater-than-expected increase in the achievement of students just below the state’s proficiency standard, and this occurs in tandem with a less-than-expected increase in the achievement of high- and low-performing students, then I can conclude that educational triage has transpired.

Is it reasonable to expect administrators and teachers to be able to identify students who are likely to be on the cusp of the proficiency threshold at the spring test administration? When speaking of states that use NWEA assessments, the answer is yes. NWEA furnishes classroom teachers and building principals with proficiency reports for each student within days of the fall test administration. The reports include a projection of each student’s performance on the spring test.

Consider a hypothetical example of the distribution of changes in test scores under educational triage. In Figure 1, the y-axis is the amount of growth in a student’s test score from the fall to the spring test administration. The x-axis identifies a student’s distance from the state-defined proficiency threshold. The vertical line in the middle of the graph is the threshold a student needs to cross to be considered proficient. The farther a student lies below the performance threshold in the fall, the more likely that student is to fail the spring assessment. The inverted “V” depicts the simplified pattern of gains one would expect to see if a school disproportionately targets resources, such as instructional time and teacher focus, to students particularly important to its accountability rating, that is, to students hovering around the state-defined proficiency threshold. If this practice were the case, the greatest fall-to-spring achievement gains would occur among students around the threshold, while other students would struggle to match expected test-score gains.

My basic strategy, then, was to compare fall-to-spring test-score changes among students who were expected to be either nearer or farther from the state-defined proficiency threshold following spring testing. A question of particular interest was whether schools that failed to make AYP in the previous school year responded strongly to the incentive to target instruction to students at risk of falling just short of the proficiency threshold.

STATISTICAL CONTROLS

The first step in my statistical analysis was to rank all the students within the same grade and year by their performance on the fall exam. I then divided the students into 20 groups of equal size and calculated a standardized test-score change for each student that measured their respective performance relative to students within the same performance group. Use of standardized test scores helps address what statisticians call reversion to the mean. When one takes repeated measures of some event or behavior, such as test-score performance among students, the measurements at the low and high ends of the resultant distribution tend over time to converge toward the average value for the population under study. With respect to students and test scores, reversion to the mean suggests that students with scores in the upper or lower tail of the test-score distribution are likely to perform closer to the average when tested more than once. This effect may mask a school’s actual response to the threat of failing AYP by producing the illusion that schools are helping low-performing students while neglecting high-performing students. By comparing each student’s gain to gains among students who performed at a similar level and would have experienced a similar, natural shift toward the average score, I can better separate legitimate test-score gains and losses from change associated with mean reversion.

I also made my best effort in the statistical analysis to isolate the change in test scores that could be attributed reasonably to the resources schools dedicate to teaching students. Specifically, I separated out the effects on test-score gains of a student’s race and ethnicity, as well as accounted for the influence of a student’s peers, by evaluating the influence of demographic characteristics of the student body, including average income level and percentage of minority students. Given a large data set, I also was able to account for characteristics of schools that I could not directly measure but that might influence student achievement over the school year. Finally, I took precautions against shaping the results according to changes in test difficulty from year to year for each grade and for students in a given school.

RESULTS

Despite many media claims of educational triage, I found no evidence of failing schools engaging in coordinated targeting of students near the state-defined proficiency threshold. In the state under study, public schools that had failed to make AYP focused instruction on the entire range of low-performing students in the subsequent school year, and did so without negative impact on high-performing students.

Figure 1b shows the changes in standardized test scores, across the full range of student performance, that can be attributed reasonably to teacher and school performance and to decisions about how the school allocates resources among students. In schools that failed to make AYP in the previous year, students who were expected to fall well below proficiency gained more than students nearest the proficiency threshold. The lowest-performing students gained about 0.20 standard deviations, roughly twice the improvement of those students whose expected gains were to leave them just below proficiency. Students expected to be proficient did not lose ground; the most advanced students performed comparably to other already-proficient students.

In schools that did make AYP, lower-performing students met expectations, with the largest of those gains coming from students expected to perform the weakest. It is interesting to note, in contrast, that higher-performing students in these schools lost ground from one year to the next. Remarkably, proficient students enrolled in failing schools experienced larger test-score gains than proficient students in non-failing schools. Remember that these patterns of gains and losses would look much different if schools did indeed engage in educational triage. Under educational triage, students near the proficiency threshold would attain the largest gains, while students dispersed away from this threshold and toward the tails of the achievement distribution would suffer diminished performance.

CONCLUSION

Although there is no evidence that schools in the study sample targeted resources to particular students, they may have allocated resources toward outcomes measured by the accountability system. For instance, schools may have taught to the tests in math and reading while neglecting science, social studies, the arts, and physical education. The apparent absence of educational triage in one state does not invalidate documented accounts of the practice in particular schools, nor encroach upon other arguments to modify NCLB’s proficiency-based school rating system. Nonetheless, as the reauthorization debate continues, policymakers should take note that educational triage was not evident in the first statewide analysis of the issue.

Matthew G. Springer is research assistant professor of public policy and education at Vanderbilt University’s Peabody College and director of the federally funded National Center on Performance Incentives.

Accountability Incentives

Latest Issue

Spring 2025

Get a Sample copy of The Journal

NEWSLETTER

Business + Editorial Office

Discover

More Information