Why Reforming Teacher Evaluation Has — and Hasn't

Matthew Kraft, an assistant professor of education and economics at Brown University, has researched teacher evaluation reform extensively, through surveys of principals and multiple studies of state teacher-rating systems. Reform critics have used Kraft’s research in support of their position. FutureEd Director Thomas Toch spoke with Kraft to get his unfiltered perspectives on the teacher evaluation landscape.

How rigorous were teacher evaluations before the advent of new evaluation systems about a decade ago?

Prior to teacher evaluation reforms, most districts engaged in a largely perfunctory exercise of what I think teachers would often call the dog-and-pony show, where administrators would stop by the room with binary checklists that assessed whether certain kinds of classroom organization, general management-type functions of being a teacher were being met or not. In general, those ratings were completed and submitted for teachers’ files with no kind of exchange about the teachers’ performance in the classroom or feedback of any note.

They were rarely linked to any consequential decisions, whether that is pay or employment or just thinking about the best way to utilize teachers’ talents.

So we were losing the ability to take advantage of knowing who was doing a good job in the classroom and who wasn’t even though we were spending upwards of $400 billion a year on teacher compensation.

Principals informally judged teachers to be more or less effective and made personnel decisions informed by their informal assessments. But it was not systematized in any way, or necessarily based on what actually matters for learning.

What did you learn in your surveys of principals?

They recognized the old system was broken, that it needed to be reformed. The ideas of the evaluation reforms rang true to them, but the actual implementation on the ground proved to be quite problematic for principals.

In what ways?

We’ve tasked principals with dramatically increasing the frequency of evaluations and doing so in much a much more rigorous way, without any changes to their other responsibilities. So the work has been overwhelming to many of them. Many principals lack the content or grade-level expertise to provide specific suggestions to many teachers in their building.

They also spoke about the challenge of helping teachers to transition to a new process, veteran teachers who have been told for 15 years that they were meeting standards when, in fact, they may not have been.

Based on your survey of state systems, what do you make of the charge that these reforms have had little impact because many teachers continue to be rated proficient under the reformed systems?

It’s completely true that the percentage of teachers who are rated unsatisfactory or ineffective has barely budged after the reforms, with a few notable exceptions.

That said, these new reforms created multiple rating categories. One of them is between a proficient rating and an unsatisfactory rating—a so-called “developing,” or “needs improvement” rating. And there is a non-trivial amount of teachers that receive that rating, up to 4 to 5 percent. We’re talking about one out of 20 teachers that we’re identifying as someone who formally has been signaled [via evaluation ratings] that they need to continue to improve their practice to meet standards.

Focusing on the total percentage of teachers who are rated above unsatisfactory also masks the value of differentiating at the high end. Under the reforms, we now talk in many states about teachers being “exemplary” and “distinguished,” above and beyond some of their colleagues who are also succeeding at their jobs and serving kids well. Just having that conversation around differences in effectiveness and performance opens the door for continued improvement beyond just being proficient.

So, to suggest that there has not been any significant change in the distribution of teachers would be inaccurate. In the past, evaluation systems were mostly binary. You were either proficient or not proficient.

That’s correct. On the other hand, what does a “needs improvement” rating mean?

If it doesn’t imply that teachers are going to receive additional targeted support, or that there are potential consequences for performance, it’s just de facto differentiation. We have to act on the new information or we lose the value of changing the system from binary to multiple categories.

How much constructive feedback is taking place under the new systems? How much targeted support is being provided?

We don’t have great answers nationally on whether these reforms have substantially improved the feedback teachers are getting, or substantially changed the ways in which school districts are identifying and addressing low-performing teachers.

One challenge is that principals are likely to exit teachers informally rather than under formal evaluation processes. It’s hard to track that. But some of the best evidence we do have, from a handful of studies that have analyzed the effect of implementing high-stakes teacher evaluation systems in districts, shows that lower-performing teachers, as judged by contributions to student performance on state standardized tests, are more likely to exit the profession under these new systems than they were in the past, and more of them leave than their higher-performing peers.

That said, most states, when they wrote their Race to the Top plans and rolled out new evaluation systems in districts, didn’t emphasize that the purpose was to identify low-performing teachers and move them out of the profession. They framed it as a system to help good teachers become great. They talked about [improved evaluations] being a development tool. And there’s where I think the new systems have fallen short of their potential.

We have not invested the time or money it would take to do that well. Districts have been trying to do professional development for decades, and it’s been one of the least successful elements of U.S. public education systems.

It’s not to say that we can’t fold improved professional development into the new evaluation systems. I’ve certainly talked to principals that perceived it to be quite meaningful for their teachers and also teachers who have said as much.

But we have not systematically built on teachers’ strengths and helped them improve on weaknesses in a non-punitive way that gives teachers the sense that they can be transparent rather than defensive about their struggles, for fear that what they say might be later used against them in a high-stakes decision.

On the whole, I don’t think the new evaluation systems have improved the quality of professional development teachers are getting very much. We have created new information [on teacher performance] that we didn’t have before that could be used to that end—observations of instructional practices, student surveys. It’s just unfortunate that it’s not the norm to use the information more systematically for targeted feedback.

It sounds like it’s not a failure of teacher evaluation reform, but rather the failure of school districts to act on the results of evaluations, to build the infrastructure required to use the new information effectively.

It’s not obvious to me that we can completely separate program design from program implementation. Many teachers were troubled by the design of the new systems.

What features of the new evaluation systems turned teachers against them?

If the federal government, philanthropic organizations, and reform-oriented policy shops had not demanded that test-based measures of teacher performance had to play a central role in these new evaluation reforms, if they had provided more flexibility for districts and states to design their metrics, the level of opposition might have been reduced substantially.

It’s not obvious that these test-based measures contributed a huge amount to the theory of action around improving teachers. They were really more focused on differentiation [of teacher performance]. And only for one out of every five teachers at best, because the measures didn’t exist for the vast majority of teachers.

Teacher evaluation was weaponized by the test-based measures.

That was the line in the sand for reformers. Student achievement added one more data point that I think is important. But by being inflexible around that hot-button issue, we really undercut the potential value of the teacher evaluation reforms. We missed the opportunity to focus on what desperately needed to be improved, which was understanding what high-quality instruction looks like in the classroom, then having a conversation with teachers about where their practice sits on that spectrum, using multiple measures. Teacher effectiveness is multi-dimensional and evaluation system should reflect this.

Also, reformers tried to take on too much at once, with little input from teachers on the ground. At the same time the new evaluation systems were coming on line, there were changes in curriculum, the Common Core, and new testing systems.

Education reforms take time to succeed—even more time than the number of years states have been working to implement new evaluation systems. But if only a minority of districts are successful at implementing a new system, then the solution is to design a system that can be implemented with greater fidelity, not to blame poor implementation.

What grade do you give the teacher unions in the teacher evaluation reform conversation over the past decade?

I often wonder why teacher unions haven’t been more proactive about guaranteeing the effectiveness of their members. Other professions establish standards of professional practice and regulate themselves. And the AFT, at least, was engaged in conversations around teacher evaluation and reform early on.

The AFT and the NEA became focused on test-based measures. I don’t fault them for that. But if unions were more about taking ownership of the evaluation process—through, say, a peer evaluation and review system where master teachers are the primary evaluators under a system that actually differentiated teacher performance—they would gain a level of credibility they currently lack. Instead, they invested time and resources fighting evaluation reform and defending those very few teachers who aren’t serving kids well.

On the other hand, reformers and the federal government did not do nearly as much to give unions a seat at the table early on to help think through what new evaluation designs could look like. Had they been more inclusive, the unions might have taken a different approach. There’s fault to be found on both sides.

Should we use outside evaluators as a check against the ratings of building principals?

There’s clear evidence that principals rate their own teachers substantially higher than an outside rater would. But there are substantial costs to bringing in external evaluators. It’s not clear paying the salaries of outside folks is the best investment, compared to giving teachers in buildings time to observe each other and provide peer feedback, as well as bringing in outside content area experts to provide teachers with one-on-one feedback.

What’s next for teacher evaluation reform?

My hope is that we don’t abandon the idea of talking about differences in teacher effectiveness and focusing on ways in which we can identify specific areas of practice that teachers need to improve on, and providing them the support that they need to improve. And, in those cases in which teachers are not able to meet professional standards, exiting them from the profession.

That said, we would be well-served by providing more flexibility. Maybe we continue to use multiple measures [for rating teachers]. But what if we provided principals with ratings on three or four different measures and they were able to use that to make personnel decisions, without being constrained to take X approach with a teacher who gets a rating of Y?

There’s a lot of value in providing principals with more information about teacher performance and then equipping them with the ability to allocate resources and support systems for those teachers that need to improve their practice. There’s value in having leaders develop a school culture in which the focus is on open doors and continual improvement.

At the same time, having a system that rewards excellence is a key part of attracting a young and dynamic pool of talent that wants to work hard and serve kids, but also be recognized for their performance, both public recognition and compensation and career ladder opportunities.

Can we recognize excellence absent a legitimate system for recognizing excellence, one that is deemed credible by teachers when their colleague next door gets a $10,000 bonus at the end of the year and they don’t?

Teachers’ perceptions of the accuracy and fairness of the rating process are key. Teachers care about kids and they want to make a difference in the world. But they also care about their practice and feeling a sense of success. So having that rated formally, if it’s accurate, is a good thing.

What do we need to do to incentivize school leaders to embrace the continuous improvement model that you just described?

I’m optimistic about approaches that try to rethink the leadership model in schools, with a leader that’s focused on instruction and another that’s focused on operations. That would free principals who are instructional leaders to do that work and not worry about whether the buses run on time.

There’s also no doubt that principals would benefit from greater flexibility in personnel decisions. Internal transfer rules [that give senior teachers rights to open positions, regardless of their teaching ability] or restrictions on open posting constrain the decisions principals make because they don’t think it’s possible to fill the position with the best possible teachers.

And principals need resources—whether financial or technical support—when they lack the expertise to support teachers’ instructional improvement.

What’s your take-away from a decade’s worth of teacher evaluation reform?

In the end, evaluating out low-performing teachers is somewhat of a red herring. Even if that went as well as some reformers wished, it’s not going to dramatically move the needle [on a school district’s teacher performance]. What we have to do is to take the vast majority of teachers who are good and help them become great. That’s an incredibly hard thing to do because teaching is a very complex task.

But there’s no doubt that knowing what teachers are struggling with and what they excel in is a part of the process. The evaluation reforms have moved us in the direction of developing rigorous observational instruments and better data systems to track these things. There’s a lot of ancillary data infrastructure that we now have, which is a great thing. We need to be able to build on that with an emphasis towards supporting teacher improvement.

— Thomas Toch

Thomas Toch is a researcher at Georgetown University’s McCourt School of Public Policy.

This piece originally appeared on the FutureEd website. FutureEd is an independent, solution-oriented think tank at Georgetown’s McCourt School of Public Policy. Follow on Twitter at @futureedGU

Why Reforming Teacher Evaluation Has — and Hasn’t — Succeeded

Latest Issue

Spring 2025

Get a Sample copy of The Journal

NEWSLETTER

Business + Editorial Office

Discover

More Information