I remember my high school English teacher, Mrs. Askew, and band director, Jimmy Larkin—both were extraordinarily good at their jobs. A couple of my teachers were so bad I recall their names too, but on the slim chance they’re still alive or have relatives who might read this, they’ll remain anonymous. The rest have faded from my memory as thoroughly as the inscriptions on the colonial headstones in the cemetery at the end of my street. All of my public school teachers—the good, the bad, and the easily forgettable—were fully credentialed and would have been deemed highly qualified under federal law had they lasted in the profession until the onset of No Child Left Behind (NCLB).
Last week, my colleagues and I released a report on how new teacher evaluation systems deployed in four urban school districts are performing in their goal of reliably and validly differentiating teachers on the basis of measureable differences in classroom performance, the kind of differences I imagine would have distinguished Mrs. Askew and Mr. Larkin from those whose names I can’t forget but won’t divulge.
These new systems depend primarily on two types of measurements: student test score gains on statewide assessments in math and reading in grades 4-8 that can be uniquely associated with individual teachers; and systematic classroom observations of teachers by school leaders and central staff. All teachers are subjected to evaluation by classroom observations, whereas those that teach reading and math in the tested grades are also subjected to evaluation through test scores gains (so-called value-added).
Our report concluded that, in general, the evaluation systems we examined do a decent job of distinguishing teachers based on characteristics of classroom performance that predict how teachers will perform in subsequent years. And these evaluation systems are strikingly better than what they replaced: slapdash approaches involving a couple of classroom visits by a building principal for some teachers in some years that resulted in virtually all teachers being classified as high performing.
At the same time, we identified flaws in the evaluation systems that need correction. The most troublesome of these is a strong bias in classroom observations that leads to teachers who are assigned more able students receiving better observation scores. The classroom observation systems capture not only what the teacher is doing, but also how students are responding. This makes the teacher’s classroom performance look better to an observer when the teacher has academically well-prepared students than when she doesn’t. Illustrative of this, when we divided teachers into five equal-sized groups based on the average prior academic achievement of their incoming students, we found that roughly three times as many (29%) of the teachers with the least prepared incoming students were identified as low performing based on classroom observations relative to teachers with the most prepared students incoming students (11%).
Since the release of our report, my co-authors and I have received inquiries about how our findings fit into the accelerating spate of legal actions aimed at teacher evaluation systems similar to the ones we examined. Unions in Florida, New York, New Mexico, Colorado, and Tennessee have filed suits, as has a group of individual teachers in Houston, backed by a union. All of the plaintiffs claim that teachers are subject to dismissals and employment actions based on seriously flawed evaluation methods.
Interestingly, all of these legal actions target the value-added component of teacher evaluation rather than classroom observations. Yet our study found no glaring defects in the value-added component beyond the sizeable gap between idealized systems that are perfectly reliable and valid vs. real systems for judging human performance in complex settings that invariably have appreciable margins of error. In other words, whereas consequential decisions to retain, promote, and compensate teachers based on their value-added to student achievement surely involve error, the districts we studied are using state-of-the-art value-added systems that minimize such error as much as is presently possible. Other than the inclusion of school-wide value-added in the evaluation scores of individual teachers, we found nothing obvious that needs to be fixed in the districts’ use of value-added.
In contrast, we found several serious and fixable problems in our districts’ classroom observation systems, including the one described above. Should the unions who are suing over new teacher evaluation systems redirect their complaints toward classroom observation? After all, we’ve shown that teachers who are assigned poorly prepared students get lower classroom observation ratings than teachers who are assigned high achieving students, and we’ve said that is unfair and needs to be corrected.
Making classroom observations the bull’s-eye of legal action isn’t going to help the current crop of plaintiffs very much. The reason is that unjustified dismissal actions are the foundation of their claims. But in the districts we examined, only teachers at the very tail end of the distribution are dismissed because of their evaluation scores, and it turns out that teachers who get the very worst evaluation scores remain at the tail end of the distribution regardless of whether their classroom observation ratings are biased.
In our report, we introduced a method for adjusting for the bias in classroom observation scores by taking into account the demographic make-up of teachers’ classrooms. We demonstrated that a regression-based statistical correction for the proportion of the students in each teacher’s class that are English-language learners, have education disabilities, are from low-income families, and so forth, wrings most of the bias out of classroom observations. It does so by boosting the ranking of teachers who are assigned more students whose family backgrounds and language and disability statuses are associated with lower academic achievement – much like the standard practice for scoring competitive diving, in which the raw score of the judges is multiplied by the degree of difficulty of the dive.
The question is whether teachers who were dismissed for low evaluation scores in the districts we studied would have received substantively different evaluation scores if their classroom observation scores had been adjusted as we recommend.
Over the years for which we have data, about four percent of the total teacher workforce was dismissed each year for low evaluation scores. Had the districts applied our statistical adjustment to the observation scores of these dismissed teachers, the fate of 15 percent of that four percent would have changed (less than one percent of the total teacher workforce).
Of course, if you were one of the dismissed teachers who would have survived for another year if your observation scores had been adjusted, this is a big deal. But from the point of view of the system as a whole and the interests of students, it is not. This is because the dismissed teachers who would have been retained using corrected observation scores would have just squeaked by. None of the upward movers gained more than about one percent in terms of total points available in their district’s evaluation system.
The bias in classroom observation systems that derives from some teachers being assigned much more able students than other teachers is very important to the overall performance of the teacher evaluation system. One of the consequences of it not being addressed is that teachers who understand how the system works and value high evaluation scores will do their best to be assigned to schools with high ability students, and within schools will do their best to get assigned the best students. This is the exact opposite of the equitable distribution of teacher talent that the U.S. Department of Education (the driving force behind state adoption of the type of teacher evaluation systems we have studied), intends for these systems to accomplish.
But the bias in classroom observation is not a serious problem with respect to teacher dismissal. It introduces a small margin of error into personnel actions of a very small percentage of teachers, but it does not result in mischaracterization of their classroom performance. These are very weak teachers whose names their students may want to forget, but can’t.
-Grover J. “Russ’ Whitehurst and Katharine Lindquist
This first appeared on the Brown Center Chalkboard.