Develop and Validate — Then Scale
Lessons from the Gates Foundation’s Effective Teaching Strategy
Standards-based reform. School-to-work. No Child Left Behind. Small schools. Teacher effectiveness. College- and career-ready standards. Grit. The U.S. education policy community has a long history of picking up and later dropping reform ideas.
Still, it is hard to believe that, just a few years ago, the mayor of Washington, D.C., was unseated, teachers were on strike in Chicago, and the public forum nationwide was alight over teacher evaluation reforms. Before we move to the next reform idea of the day, we owe it to those who worked so hard to implement those teacher evaluation reforms to take a step back and reflect on the many lessons to be learned. I feel that obligation more keenly than most, having asked thousands of teachers and district leaders to work on the Measures of Effective Teaching project, which I led as a deputy director within the K-12 team of the Bill & Melinda Gates Foundation.
Speaking only for myself, I describe four high-level lessons I learned about the role of philanthropy in U.S. education reform.
Opening the Classroom Door
But, first, let’s remember why teacher evaluation reform was and remains so essential.
For more than four decades, researchers have documented large differences in student achievement gains when similar students are assigned to different teachers in the same schools. We also know that effective teachers can be identified based on their past student achievement gains and classroom practices. Those assessments have been confirmed by randomly assigning teachers to a different set of classrooms and subsequently tracking student outcomes. Moreover, we have learned that it is difficult to distinguish between effective and ineffective teachers before they enter the classroom. The indicators that districts currently use when hiring teachers—teaching credentials and content knowledge—explain little of the variation in teachers’ subsequent effectiveness in the classroom.
In other words, differences in teaching practice are a primary driver of children’s outcomes, both in the short- and long-term. But the first chance to meaningfully assess a teacher’s abilities is after she is on the job. That’s one of the reasons why reforming teacher performance evaluations—in which nearly every teacher receives the same “satisfactory” rating—were so critical.
However, that’s not the only reason. In 1977, John Meyer and Brian Rowan observed that classroom activities are “de-coupled” from district and school oversight, that teachers enjoy tremendous autonomy once they close the classroom door. They hypothesized that the closed-door classroom was an organizational survival mechanism to protect classrooms from external interference. To keep up appearances, schools go through the motions of managing instruction, declaring 98 percent or more of teachers’ instruction as “satisfactory” and student achievement as “proficient.” However, they do so without any meaningful effort to manage what happens inside classrooms.
Whether or not Meyer and Rowan were right about the cause, instruction has not been viewed as a collective, organizational responsibility in most public schools in the United States. Policymakers have increased pressure on superintendents and principals to improve student outcomes, but they do not seem to realize that school level leaders often have little visibility into classrooms and little leverage to change what teachers do. Before the recent reforms, there were no formal rubrics for conducting classroom observations, no common vocabulary for discussing teaching. Indeed, as we saw in Atlanta and elsewhere, administrators were willing to do almost anything in order to avoid having to interfere in teachers’ instructional practices—including tampering with students’ answer sheets and going to jail!
Some mistakenly see teachers’ autonomy as the signature perquisite of a profession. Much of the controversy surrounding teacher evaluation reform rests on the unstated norm that the classroom door should remain closed. But when there is no collective responsibility for instructional quality, when each new generation of teachers must invent their own practice (drawing largely on the instructional methods they experienced as children), teaching cannot improve from one generation to the next. The closed classroom door may shield the flame of individual creativity early in a teacher’s career, but, without the oxygen of external observation and feedback, the flame goes out.
In 2009, the Bill & Melinda Gates Foundation offered to help a set of districts to build the essential components of a new personnel system, to “re-couple” school management with classroom practice. Seven sites signed on. That included three large districts: Pittsburgh Public Schools, Hillsborough County Public Schools in Florida, and Memphis Public Schools in Tennessee (which merged with Shelby County in 2013). In addition, four California-based charter networks took part: Alliance College-Ready Public Schools, Aspire Public Schools, Green Dot Public Schools, and Partnership to Uplift Communities Schools. The Gates foundation would ultimately invest $215 million in the initiative. (The $575 million figure that is often quoted reflected funds from federal, state and local sources, which may or may not have reflected a net increase in spending.)
Teacher evaluation reform was certainly controversial. However, it was never simply about firing low-performing teachers. We hoped that improving evaluation systems would lead to a cascade of positive organizational changes inside school agencies. For example, new formal observation rubrics could provide teachers and supervisors with a common vocabulary for discussing instruction, which is a necessary ingredient for collective improvement. New data from value-added measures and formal rubrics could give managers more confidence in their subjective assessments of probationary teachers and help them set a high standard for promotion. New student surveys could engage the entire school community in the mission of instructional improvement. Finally, rather than seeing accountability in conflict with the goal of teacher growth, we saw some level of accountability as necessary for improving professional development. If teachers and supervisors were going to invest their time in improvement, we believed that it should “count” in some official way.
Looking Back: Lessons Learned
We chose as our metric of success improved academic outcomes, particularly for low-income students of color. However, as reported in a recent evaluation by RAND, student outcomes in the seven sites did not improve faster than at the comparison districts in the same states.
The media coverage of the RAND report has described the intensive partnership investment as a failure. However, the truth is even sadder: after spending $215 million, it’s impossible to say whether the foundation-funded reforms “worked” or not, because most of the comparison schools were doing the same things! Three out of the four states (Florida, Tennessee, and Pennsylvania) received federal Race to the Top grants during the study period in support of similar statewide efforts, and several large districts in California were pursuing similar reform efforts with federal waivers from the No Child Left Behind Act. Granted, these other sites had not received $215 million from the Gates Foundation.
Although that is a lot of money, it represented just 1 percent of the funding in the sites over the relevant period. The tragedy is that by funding districtwide initiatives at the same time other districts were doing similar things, we had no chance of later learning what worked. That’s one of the lessons I discuss below: philanthropies should not be behaving like federal and state agencies, funding broad initiatives by local education agencies. Rather, they ought to be funding smaller-scale pilots, with comparison groups, developing the evidence base, so that the public agencies can decide what to scale up later.
Below, I describe four valuable lessons to be learned from the intensive partnerships initiative.
Lesson No. 1: Better teacher evaluations did lead to improved outcomes, but much of that evidence was found outside the partnership sites.
The intensive partnerships work occurred during an unprecedented era of teaching reforms nationwide, and many of the changes we funded were underway elsewhere as states responded to incentives in the federal Race to the Top and ESEA waivers. As a result, although the Gates-led research did not find positive impacts on student outcomes, research outside of the partnerships project has shown the impact of many of its key features elsewhere.
For example, providing teachers with feedback based on the Framework for Teaching rubric led to subsequent improvements in students’ math and reading scores in Cincinnati (see “Can Teacher Evaluation Improve Teaching?” research, Fall 2012) and in Chicago (see “Does Better Observation Make Better Teachers?” research, Winter 2015). A 2017 U.S. Department of Education study of eight districts by Garet, Wayne et al. found schools that began providing feedback to teachers outperformed comparison schools in math, although not in reading, and in a 2011 study of 78 high-school teachers, Allen, Pianta et al. reported that coaching based on the Classroom Assessment Scoring System–Secondary (CLASS) observation rubric led to improvement in student outcomes.
Other interventions using teacher evaluation data had positive effects on student learning. In a 2016 study of Tennessee teachers by Papay et. al., achievement among students of low-rated teachers improved after their teacher was paired with a higher-performing colleague and encouraged to work together throughout the year. Multiple studies across the country have found that providing evaluation data to teachers and principals led to higher exit rates for ineffective teachers, including in New York City (Loeb et al in 2015), Houston (Cullen et al in 2017), and Chicago (Sartain and Steinberg in 2014). The public schools in Washington, DC—which implemented many of the same reforms that the intensive partnerships sites proposed—saw the largest gain in student test scores of any state in the history of the National Assessment of Educational Progress (NAEP) between 2007 and 2015 (see “A Lasting Impact” research, Fall 2017).
Therefore, it would be incorrect to conclude from the intensive partnerships site evaluation that better teacher evaluation “does not work.” On the contrary, we know that the theory of action did work in some instances when there were appropriate comparison groups. Unfortunately, because the intensive partnerships initiative was not designed to answer such questions, we learned little about the conditions required for success.
Lesson No. 2: Instead of providing large, multi-year grants for districtwide scale-up, the foundation should have invited applications to conduct pilot programs on a smaller scale first.
By inviting districts to apply for large, multi-year grants, the foundation hoped to empower local change-makers to rally support for ambitious proposals. That aspect of the initiative was a spectacular success. Like the federal Race to the Top grant program, the foundation’s application process—which required applicants to demonstrate support for the reforms from multiple stakeholders—was quite successful at inspiring bold proposals.
However, the same leaders, who successfully rallied their colleagues to join in their complex and ambitious proposals, lost that leverage the moment the grants were made and implementation faltered. In the words of Mike Tyson, “Everyone has a plan until they get punched in the mouth.” Once the grants were approved, many aspects of the plans did not survive the first punch.
Of course, the foundation tried to preserve some accountability for actually carrying out the plans by including milestones in its grant agreements and consequences for missing them. Yet the potential public embarrassment of terminating the grants—for both sides—made it unlikely that the contingencies would ever be enforced. (The federal government faced the same dilemma with the Race to the Top Initiative.) For example, all of the sites proposed to establish higher standards for the promotion of beginning teachers at the end of the probationary period. That did not happen. As reported by RAND, roughly 1 percent of probationary teachers were involuntarily denied tenure.
In addition, all of the sites had pledged to use the results of an associated Gates-funded research initiative that I led as principal investigator, the Measures of Effective Teaching (MET) project, to inform their teacher evaluation systems. MET aimed to determine how to identify excellent teaching, and we studied various performance measures, including classroom observations, student surveys, and student achievement gains (including through value-added), based on the videotaped lessons of 3,000 volunteer teachers from seven public school districts. These findings, we hoped, would inform their evaluation system designs.
While the intensive partnerships sites did make substantial changes to their evaluation systems—adopting formal rubrics, training and certifying raters, adding student achievement growth and student surveys—they also ignored some key findings from MET. For example, we had reported that, even when there were no stakes involved, principals inflated the scores of their own teachers while reliably scoring videos of teachers from other schools. Moreover, given the magnitude of rating error for individual raters, we recommended that districts use more than one observer per teacher to reduce the risk of score inflation and to improve reliability. But only one site, Hillsborough County, had someone other than an administrator at the teacher’s school observe her performance.
In the second year of the MET project (the 2010-11 school year), participating districts agreed to randomly assign teachers to classrooms within grades and subjects. The highest compliance rates were in Dallas (65 percent) and Charlotte-Mecklenburg (63 percent), which had not received intensive partnerships grants. Meanwhile, the site with the lowest compliance rate by far—27 percent—was Memphis, a grant recipient. This was an early signal that districts were going to struggle to implement their plans.
Lesson No. 3: The foundation should have asked districts to solve a more specific, tractable problem.
Challenging the closed-door norm was arguably the most ambitious organizational change that U.S. schools have ever undertaken. Unlike school desegregation and more than test-based accountability, teacher evaluation reform affected the day-to-day relationships of adults in schools. To make teaching evaluation more than a perfunctory exercise was tampering with the basic relationship between principal and teacher. Few organizations could undertake such a major change without substantial risk of failure and the intensive partnerships sites were no exception.
Take just one aspect of their plans: in the years before the intensive partnerships grants, principals regularly identified 98 percent or more of teachers as “satisfactory.” Yet, the sites were expecting the same managers who rated their teachers as “satisfactory” to suddenly rate many of those same colleagues as “developing” or “needing improvement.” Extensive training on the new rubrics was not going to override the base psychological need of managers to be consistent with their prior ratings of continuing teachers.
As RAND documented, ratings started out high and became increasingly compressed over time. Eventually, “proficient” or “exemplary” became the new “satisfactory.” Successfully implementing that single component of their plans would have required problem solving, iteration, and intense management focus. Even the highest-performing organization would have struggled to implement this one change successfully. There are several questions and adjustments that could—and should—have been explored. Should principals have evaluated teachers from other schools? Would teacher-collected video have made it easier to involve external observers? Did principals have sufficient content expertise to earn legitimacy?
In retrospect, I believe districts would have been more successful if we had invited them to focus on a more achievable goal: reinventing the way they supported, evaluated and promoted probationary teachers. Not all teachers, just those in their first two or three years of teaching.
Most collective-bargaining agreements give districts considerable leeway during the probationary period to “non-renew” early-career teachers. Probationary teachers represent less than 15 percent of the teaching force in most districts, so principals and district officials could have focused their efforts. The higher standard for tenure would have made the job protections that teachers enjoy post-tenure more legitimate, not less—something that teachers’ unions should have valued. And as long as the evaluation of probationary teachers was explicitly separated from the evaluation process for continuing teachers—perhaps it could have been administered by a separate office, with separate rules and weights on the measures—it need not have generated as much anxiety among continuing teachers, who are the vast majority of union members.
Moreover, beginning teachers have not yet learned the “closed-door” norm. Districts might have replaced that norm over time by setting a different expectation for new teachers, rather than having to confront the closed-door expectations of continuing teachers.
Lesson No. 4: The foundation should have extended its research and development strategy beyond simply developing measures of effective teaching to developing and testing solutions for other implementation challenges as well.
The MET project was designed to shed light on some specific questions surrounding the teacher evaluation process: Were teachers’ classroom observation scores related to their students’ achievement gains? Were “value-added” measures—which control for measured student baseline characteristics to infer teacher impacts—valid predictors of teachers’ causal effects following random assignment? How many raters and how many observations were required to achieve reliability? Did the teachers who produced gains on high-stakes assessments also help students score better on no-stakes tests of conceptual understanding? What were the advantages and disadvantages of different weighting methods? Could student surveys and tests of teachers’ pedagogical content knowledge take some of the weight off of classroom observations and value-added measures? When a single measure—such as a classroom observation or student test scores—is subject to gamesmanship, the right response is to find ways to spread the risk, rather than to abandon accountability entirely.
By the standards of social science, the project substantially shifted the field’s understanding on several of these questions. Unfortunately, when the project concluded in 2013, we left many hard problems unresolved.
We should have been working alongside the partnering districts to design and test solutions to the implementation challenges that were emerging. For example, could we have changed the observation process by using teacher-collected videos or external observers so that principals would be more willing to differentiate their feedback to their own teachers? And how could districts have tweaked the promotion process so that principals would apply higher standards? Rather than have promotion be the default, for example, districts might have asked principals to jump through more hoops to tenure a teacher with low student achievement scores, or they might have introduced peer review panels or required principals to interview other teaching candidates before each promotion.
In addition, there were other critical components to a successful personnel system which needed to be developed. Is there an approach to feedback and coaching that will reliably produce improvements in student outcomes? Is there a better way to assess teaching candidates during the recruitment process, before they are on the payroll?
Looking Forward: A New Role for Philanthropy
For the 13 years between the passage of the No Child Left Behind Act in 2002 and the Every Student Succeeds Act in 2015, presidents from both parties and a bipartisan alliance of moderates in Congress provided political cover for education reformers at the state and local levels. For better or worse, that bipartisan alliance definitively collapsed in 2015, leaving teacher-quality reforms in limbo. At the moment, there is no national education reform debate to speak of and it may be some years before new agendas and political alliances can take form in each state. Although there is a lively exchange of ideas regarding early childcare and college financing, there’s still a lot of work to be done in K-12 education, and a shortage of ideas for doing so.
Given the lack of education policy expertise and infrastructure at the state level, philanthropy should be supporting local think tanks, business alliances, and civil-rights groups to formulate an education reform agenda with local roots. That means supporting the publication of policy proposals, hosting public debates, and supporting research using local education system data (see “Making Evidence Locally” features, Spring 2017). An example of such an effort is Pathway 2 Tomorrow, a new nonprofit led by former New Mexico education chief Hanna Skandera that aims to build capacity at the state and local level. There is evidence that this ground-up approach can work: state organizations with a history of engaging local actors—such as the Pritchard Committee for Academic Excellence in Kentucky or Tennessee SCORE—have played a role in nurturing state-level policy agendas. Of course, philanthropies including the Gates foundation have funded such organizations in the past, but those activities are more important than ever now that the federal government has abdicated and many state agendas must be built from scratch.
But the most important lesson that national philanthropies should learn from the Gates Foundation’s intensive partnership initiative would be to adopt a more systematic approach to innovation. I lay out the three major phases below.
Phase I (Development):
The people closest to the work (teachers, principals, district leaders) are in the best position to identify the challenges to be solved. However, they lack the skills and expertise to find the solutions on their own. Funders should support teams of practitioners, designers, and researchers to develop solutions to knotty practical problems, such as correcting principals’ score inflation or using video to facilitate teacher evaluation and feedback. Although the initial design work should be carried out with a manageable number of volunteers, ideas should “graduate” to the next stage if they can be reliably replicated in a second set of schools or teachers or students. (Efficacy testing can wait for Phase II.)
Phase II (Validation):
Funders should then systematically test the subset of ideas that succeed in Phase I with a comparison group. The sample size for the validation stage should be large enough to discern the intended impact—but no larger. For example, to detect an improvement of 0.10 standard deviations in math achievement—which is about the same as the difference in achievement gains between a novice teacher and one in her fourth year on the job—requires roughly 1,000 students in all: 500 in the treatment group, and 500 in the comparison group (when interventions are delivered to individual students, and comparisons can be made between students within the same classrooms and schools). For interventions delivered at the teacher or school level, roughly 100 teachers or 26 schools are required for the treatment group (depending on whether the comparisons were being made within or between schools). We all know of philanthropic initiatives that have been scaled beyond these numbers without any efficacy testing.
The sample size requirements will vary by outcome, the magnitude of impact expected, and the statistical controls available. However, we know enough about the relevant sources of variance for many outcomes such as math and reaching achievement, absences, high-school graduation, and college-going that we should be able to establish validation sample-size requirements for each of them. Foundations that support the scale-up of interventions larger than these benchmarks—without first showing evidence of efficacy—should be called out for doing so.
Phase III (Scale-up):
Once an intervention has demonstrated efficacy, foundations should experiment with a variety of alternative approaches to scaling them up. Posting a report on a website will not suffice. For instance, funders could support events in which development teams present their results to superintendents, school boards, and state legislators. There are too few venues for such audiences to interact. Moreover, to cut through the noise, funders should help train decision-makers to interpret quantitative impact studies, so that they can be more discerning consumers. For instance, doctors are in a position to interpret the results from clinical trials because they were trained in the scientific method in medical school. Education decision-makers at the local level should be, too.
Districts and state agencies have a long history of shifting priorities as leadership changes and political winds shift. It’s an inevitable result of public governance. But it also means that public school districts cannot maintain the disciplined focus that solving any hard problem requires—or at least not on their own. Unless private foundations are willing to persist in solving the difficult, controversial problems like developing tools to improve classroom teaching, no one will.
During this post-ESSA limbo, the national philanthropies should be in tool-building mode, stocking the shelves with proven, implementable solutions for when state and local leaders are ready to resume their work.
Thomas J. Kane is the Walter H. Gale Professor of Education and faculty director of the Center for Education Policy Research at Harvard University.