Teacher evaluation systems show a stubborn tendency to rate nearly all teachers as effective. This historical fact has remained true even as most states have invested in rigorous new “multiple measures-based” teacher evaluation systems. The primary reason for the lack of variation in teacher evaluation scores in most systems is that teachers tend to score overwhelmingly positively on the instructional or professional practice portion of the evaluation, which usually makes up half or more of the overall evaluation score. Much as in the olden days when teacher evaluations were simple checklists, the principals who assign these subjective ratings just don’t differentiate much.
But why do principals give their teachers such high ratings? Do they really think all of their teachers are above the bar? Or do they know that some of their teachers are less effective, but they don’t reflect this assessment in their personnel ratings?
In a new study just out in Education Finance and Policy, Susanna Loeb and I provide some empirical evidence on the answers to these questions. Like some prior studies of principals’ ratings, we asked principals to rate a few of their teachers on different dimensions of their practice in a research setting—in our case, during late March 2012 interviews with about 100 principals in Miami-Dade County Public Schools, the fourth-largest school district in the United States. These ratings were “low-stakes” in the sense that no one except the researchers would know what scores the principals gave. The hope is that, with no consequences attached, principals would provide us with something close to their true assessments of their teachers’ performance.
Then we took another step. Sometime later, the district provided us with the “high-stakes” personnel ratings those principals gave those same teachers just a few weeks after the interviews. These data were collected as part of a broader school leadership study funded by the Institute of Education Sciences. So we were able to compare the two sets of ratings.
In both cases, the ratings were very positive; principals gave high marks, on average, to their teachers’ performance, regardless of who was asking. However, they made use of the lower ends of the rating scales much more often when rating only for the researchers. For example, when we asked principals in the interviews to give teachers an overall instructional rating, they gave about 15% of teachers a score in the “ineffective” range. Yet on the high-stakes personnel evaluation, virtually everyone was rated a 3 (“effective”) or 4 (“highly effective”), even if the principal expressed reservations about the teacher in the interview. In fact, fewer than 3% of teachers received a score of less than “effective” on even one of the seven evaluation standards on the district’s rating form.
But here’s something that was really interesting: every teacher might get a 3 or a 4 on every standard, but which of those ratings a teacher received seemed to contain information about the teacher’s performance. Not only were those essentially-binary ratings positively correlated with the interview ratings (r = 0.55), but they predicted teachers’ value-added just about as well as the interview ratings. In other words, they may skew their evaluation ratings high, but principals are not rating teachers at random. Teachers receiving the highest scores appear to be more effective than teachers merely receiving high scores, according both to the principals’ “true” assessments and to student achievement growth.
We also examined what factors led a teacher to get a higher personnel rating than the low-stakes rating would predict. We found that principals systematically appeared to inflate ratings for beginning teachers, which makes sense if they think those teachers deserve a little time to get the hang of teaching, or if they worry that those teachers have fewer job protections than other teachers. Principals assign worse-than-predicted personnel ratings to teachers who are absent more, which may suggest that principals place a premium on work effort in formal evaluations. Disturbingly, we also find that teachers of color—particularly Black teachers—score worse on the personnel rating than predicted by the interview rating.
So what is the punch line here? Principals do differentiate their teachers’ performance, but formal personnel ratings don’t reflect this differentiation. If evaluation systems are going to be a source of formative feedback for teachers, or if we are going to rely on evaluation systems to identify low performers for remediation or dismissal, policymakers presumably have an interest in bringing principals’ formal assessments closer to their internal ones. If that’s the goal, there is much more we need to understand about the cognitive and relational processes surrounding principals’ evaluation decisions.
— Jason A. Grissom
Jason A. Grissom is Associate Professor of Public Policy and Education at Vanderbilt University.