In 25 years as an education researcher, I have never witnessed a rapid outpouring of new research in education such as we’ve seen in recent years. In what may be the most important byproduct of the No Child Left Behind Act and the Institute of Education Science’s grants to states, school agencies have been linking students to teachers and schools and tracking their achievement over time. Researchers across the country have been using those data to study the value of traditional teacher certification, the degree of on-the-job learning among teachers, the impact of charter schools, the effectiveness of teacher preparation programs, etc. Yet, much of that work depends on a simple, often unstated, assumption: that the short list of control variables captured in educational data systems—prior achievement, student demographics, English language learner status, eligibility for federally subsidized meals or programs for gifted and special education students—include the relevant factors by which students are sorted to teachers and schools. If they do, then researchers can effectively control for differences in the readiness to learn of students assigned different teachers or attending different schools.
But do they? The answer carries huge stakes—not just for teachers. Where the assumption appears justified, our understanding of the effects of various education interventions will continue to expand rapidly, since the existence of longitudinal data systems dramatically lowers the cost of new research and allows for widespread replication in disparate sites. However, if the assumption frequently appears unjustified, then the pace of progress will necessarily be much slower, as researchers are forced to shift to expensive and time-consuming randomized trials. When should we believe program impacts based on statistical controls in the absence of a randomized trial? Even though it might seem like an innocuous statistical debate, nothing less than the pace and scope of U.S. education renewal is at stake.
In 1986, Robert J. LaLonde compared non-experimental estimates of a job training program’s impact on welfare recipients against the program impact measured by a randomly assigned comparison group (LaLonde 1986). The earnings impacts he estimated after using statistical methods to control for observed differences between participants and non-participants were quite different from those based on the randomized control group.1 LaLonde’s findings, and related replications have led to a generalized skepticism of non-experimental methods in the study of education and job training.
However, it’s possible that LaLonde’s findings have been generalized too widely. For instance, the process by which students are sorted to teachers (or schools) and the process whereby welfare recipients choose a training program are quite different. While the reasons underlying a welfare recipient’s choice remain largely hidden to a researcher, it is possible that school data systems contain the very data that teachers or principals are using to assign students to teachers. Of course, there are many other unmeasured factors influencing student achievement—such as student motivation or parental engagement. But as long as those factors are also invisible to those making teacher and program assignment decisions, our inability to control for them would lead to imprecision, not bias (since different groups of students would not differ systematically on these unmeasured traits).
Given the practical difficulty of randomly assigning students to teachers or to schools, opportunities to replicate LaLonde’s benchmarking of non-experimental estimates against experimental estimates have been rare in education—until recently. For instance, several recent papers have exploited school admission lotteries to compare estimates of the impact of attending a particular school using the lottery-based comparison groups as well as statistical controls to compare students attending different schools. Abdulkadiroglu et al. (2011) and Angrist, Pathak, and Walters (2013) found similar estimates of the impact of a year in a Boston area charter school whether they compared charter school admission lottery winners and losers or whether they compared charter attendees to regular public school students with similar observed characteristics. Deutsch (2012) also found that the estimated effect of winning an admission lottery in Chicago was similar to that predicted by non-experimental methods. Deming (2014) found that non-experimental estimates of school impacts were unbiased predictors of lottery-based impacts of individual schools in a public school choice system in Charlotte, North Carolina. Bifulco (2012) compared the impact of two magnet schools using a lottery-based control group and several non-experimental control groups. When students’ prior achievement was added as a control variable, the non-experimental methods generated impact estimates quite similar to the estimates based on random assignment.
To date, there have been five studies which have tested for bias in individual teacher effect estimates. Four of those—Kane and Staiger (2008), Kane, McCaffrey, Miller, and Staiger (2013), Chetty, Friedman, and Rockoff (2014a) and Rothstein (2014)—estimate value-added for a given teacher in one period and then form predictions of their students’ expected achievement in a second period. The primary distinction between the four studies is the source of the teacher assignments during the second period. In Kane and Staiger (2008), 78 pairs of teachers in Los Angeles working in the same grades and schools were randomly assigned to different rosters of students, which had been drawn up by principals in those schools. The teachers’ value-added from prior years, in which they had been assigned students based on the principals’ predilections, provided unbiased forecasts of student achievement during the randomized year. A limitation of the study is the small sample size.
A similar but much larger study was carried out by Kane, McCaffrey, Miller, and Staiger (2013). They measured teachers’ effectiveness using data from 2009-10 and then randomly assigned rosters to 1,591 teachers during the 2010-11 school year. The 2009-10 measures included a range of measures, including value-added, classroom observations, and student surveys. The research team randomly assigned teachers to students in the second year of the study. The teachers were drawn from six different school districts: New York City (NY), Charlotte Mecklenburg (NC), Hillsborough County (FL), Memphis (TN), Dallas (TX), and Denver (CO). As with the smaller study, they found that the estimates of teachers’ value-added in the year in which teachers were randomly assigned students were well predicted by the same teachers’ value-added scores in the previous year.
Rather than use random assignment, Raj Chetty, John Friedman, and Jonah Rockoff watched what happened when teachers moved from school to school and from grade to grade (Chetty et al., 2014a). Using value-added estimates from the years before and after the change, they predicted changes in scores in a given grade and school based on changes in teacher assignments over the same time period. They asked, “What happens if a teacher with high value-added in grade 5 in school A moves to grade 4 in school B?” If the teacher’s high value-added in school A reflects her teaching ability, then the performance of students in grade 4 in school B should go up by the difference in the effectiveness between her and the teacher she is replacing. But if the teacher’s value-added in school A reflects some factor unrelated to his or her teaching ability, such as the quality of students he or she was assigned, then her transfer to school B should not have an impact. In fact, Chetty and his colleagues were able to predict performance changes based on teacher transfers, demonstrating that the information captured in teachers’ value-added scores captures their true impact on student learning.
Rothstein (2014) recently replicated the Chetty et al. (2014a) findings using data from North Carolina. Using the same methodology, Rothstein also found that he could predict changes in student achievement based on changes teacher assignments. However, Rothstein (2014) questioned the validity of the Chetty et. al. approach, citing a relationship between changes in the value-added of teachers and changes in students’ scores in the year before the students are in the teacher’s class. He interprets such a correlation as evidence that teachers’ value-added merely reflects the preparedness of the students they are assigned.
In a recent paper, my colleagues and I find a similar relationship in data from Los Angeles, but we offer a different interpretation (Bacher-Hicks, Kane, and Staiger, 2014). Since teachers do switch grades from year to year, students’ baseline test scores in a given year and the value-added estimates from prior academic years are sometimes based on the same data. (That is, suppose Teacher A taught 4th grade last year and is teaching 5th grade this year. Suppose also that he or she is an unusually good teacher. Teacher A’s students could enter grade 5 this year with high scores partially because they had Teacher A last year when they were in 4th grade.) Moreover, there are statistical shocks to achievement in a given school and subject— such as a dog barking in the parking lot on the day of the test— which could also introduce a mechanical relationship between the value-added estimates from prior years and student’s baseline achievement this year (Kane and Staiger 2002). For these reasons, Chetty et al. left out a teacher’s data from year t and year t-1 when generating their value-added estimates. Rothstein reintroduced the problem by using prior year scores as the dependent variable. When we account for such factors, we find that teachers’ value-added is not related to prior achievement, but continues to predict end of year achievement.
Glazerman et al. (2013) is the only team so far to use random assignment to validate the predictive power of teacher value-added effects between schools. To do so, they identified a group of teachers with estimated value-added in the top quintile in their state and district. After offering substantial financial incentives, they identified a subset of the high value-added teachers willing to move between schools and recruited a larger number of low-income schools willing to hire the high-value-added teachers. After randomly assigning the high value-added teachers to a subset of the volunteer schools, they found that student achievement rose in elementary schools, but not in middle schools.
In sum, there is now substantial evidence that value-added estimates capture important information about the causal effects of teachers and schools. Rarely in social science have we seen such a large number of replications in such a short period of time. Even more rarely have we seen such convergence in the findings. The application of statistical controls using longitudinal data systems often provide meaningful information regarding program impacts even without random assignment.
The latest findings have broad implications for education research. For instance, in many circumstances, we should ask researchers to provide promising evidence from non-experimental methods first before proceeding to more expensive clinical trials. We also need to gain a better understanding of the conditions under which such statistical methods can be expected to yield unbiased predictions of educational impacts. Therefore, even when random assignment has been deemed necessary and feasible, the Institute of Education Sciences should ask researchers to produce a comparison analogous to LaLonde’s: identify a non-experimental control group, identify the randomized control group, and compare the estimates yielded by the two approaches. (Researchers should also be encouraged to retrospectively construct non-experimental comparison groups for past randomized trials.)
In closing, I would offer two caveats:
First, we must narrow the range of statistical specifications which can be expected to yield valid predictions of teacher and school effects. For instance, although most of the studies cited above do include statistical controls for peer effects (the mean characteristics of other student in a class or school), very few states or districts do so when generating value-added estimates. We need more studies specifically designed to test for the importance of peer controls and other specification decisions.
Second, we know very little about how the validity of the value-added estimates may change when they are put to high stakes use. All of the available studies have relied primarily on data drawn from periods when there were no stakes attached to the teacher value-added measures. In the coming years, it will be important to track whether or not the measures maintain their predictive validity as they are used for tenure decisions, teacher evaluations and merit pay.
– Tom Kane
This first appeared on the Brown Center Chalkboard.
 Dehejia and Wahba (1999) later demonstrated that non-experimental methods performed better when using propensity score methods to choose a more closely matched comparison group.