Do Merit Pay Systems Work? What Can We Learn from International Data?
Recently, Education Next released a path-breaking, peer-reviewed study by Ludger Woessmann which estimated long-term impacts of merit pay arrangements for teachers on student performance in math, science and reading at age 15. Using international data from the Programme for International Student Assessment (PISA). The study has the great advantage of providing an estimate of long-term impacts of merit pay that cannot be identified by looking at the impact of policy innovations after two or three years. However, the study is necessarily limited by the fact that it is based on observations from only the countries for which relevant data is available.
Even though Woessmann’s innovative study was executed with great care and sophistication— and a version is now available in the Economics of Education Review—a group which calls itself the National Education Policy Center—a group which receives substantial funding from teacher unions—has persuaded a reviewer to write a misleading critique of the paper. Such critiques are standard practice for the NEPC. It critically reviews many studies, no matter how well executed, if the findings from that study do not lend support to positions the unions have taken. Fortunately, Woessmann has agreed to take the time to reply to a review more disingenuous than thoughtful. His response is highly technical, but for those interested in the methodological specifics, it is worth a careful read.
Ludger Woessmann replies:
The NEPC review. which makes a number of critical and partly strident claims about my paper, “Cross-Country Evidence on Teacher Performance Pay,” is a perfect example of a case where there is a lot of new and correct material in the text – but alas, what is correct is not new and what is new is not correct. Let’s start with the “not so new” statements.
The reviewer states: “The primary claim of this Harvard Program on Education Policy and Governance report and the abridged Education Next version is that nations ‘that pay teachers on their performance score higher on PISA tests.’ After statistically controlling for several variables, the author concludes that nations with some form of merit pay system have, on average, higher reading and math scores on this international test of 15-year-old students.” This is not a “claim,” but simply a factual statement of a descriptive fact. Not even the reviewer can deny that.
The bottom-line criticism of the reviewer is that “drawing policy conclusions about teacher performance pay on the basis of this analysis is not warranted.” That statement is hardly new. Compare it to my own conclusion in my abridged version in Education Next: “Although these are impressive results, before drawing strong policy conclusions it is important to confirm the results through experimental or quasi-experimental studies carried out in advanced industrialized countries.” Where’s the substantive difference that would justify the strident review?
Next, the reviewer states repeatedly that “attributing causality is problematic” in such a study based on observational data. Right – this is exactly what my paper states very clearly a number of times, and addresses with a number of additional analyses. Even in the abridged version of the study, I take substantial care to highlight the cautions that remain with the study. It is seriously misleading for a reviewer to repeat the caveats highlighted in the study itself.
Additional limitations of the analysis highlighted in the paper and simply repeated by the reviewer are that the number of country observations is limited to 28 OECD countries and that the available measure of teacher performance pay is imperfect. In particular, the measure does not distinguish different forms and intensities of the performance-pay scheme. The value added by such repetition is unclear to me. However, what is ignored by the reviewer – and what starts to bridge the case from “not so new” to “not so correct” – is that all these factors play against the findings of the paper. They limit statistical power and possibly bias the coefficient estimate downwards – and, in this sense, make the finding only stronger.
Now for the directly “not correct” statements. The review claims that dropping a single country can overturn the results. This is not correct. As stated in the study, qualitative results are robust to dropping any individual country, as well as to dropping obvious groups of countries. (Of course, the point estimates vary somewhat, albeit not in a statistically significant way – what else should be expected?) The review also claims that the “geographical distance between countries, or clusters of countries,” may drive the results. But the study reports specifications with continental fixed effects and specifications that drop different clusters of countries, both of which speak against this being a “serious concern.”
The press release for the review (although not the review itself) claims that “The data are analyzed at the country level.” In fact, all regressions are preformed at the level of over 180,000 students, controlling for a large set of background factors at the student and school level. The information on the possibility of performance pay, though, is at the system level.
The press release also highlights the point raised above about heterogeneity in the performance-pay schemes by stating that “Perhaps one type of approach is beneficial, while another is detrimental.” Right – but the whole point is that on average they are positively related to achievement.
The method used in the paper – clustering-robust linear regressions – may not be well known to the reviewer, but – contrary to the reviewer’s claim – it does in fact take care of the hierarchical structure of the error terms. Monte Carlo analyses have even shown that they do so in a way that is usually more robust than the methods suggested by the reviewer (multilevel modeling).
The reviewer wrongly claims that my “report concludes … that the threat of omitted variable bias is .. proven to be negligible.” I am not aware how any empirical study could prove such a thing – “proving” that omitted variable bias is negligible is clearly scientific nonsense.
Learning from International Data
The bottom line is whether, despite the caveats that my study itself mentions, we can learn anything from the cross-country analysis. Of course we can. The paper presents new empirical evidence that complements existing studies on performance pay, not least because the cross-country design goes some way to capture general-equilibrium effects of teacher sorting that have eluded existing experimental studies. Some evidence, combined with extensive robustness checks, is clearly better than no evidence, also as a basis for policy discussion.
The reviewer did not present a single attempt to test whether his claims have any validity. By contrast, the reviewed study has put clear evidence on the table, and shown that it is robust to a forceful set of validity checks (the more demanding of which are not even discussed in the review). It is up to the reader which approach – the one of the original study or the one of the reviewer – is more convincing. But it even seems that the reviewer, despite the strident language contained in the press release that summarizes his analysis, in the end agrees with my assessment: “The study presented in the Harvard Program on Education and Governance report, and abridged in Education Next, is a step in the right direction.”
– Paul E. Peterson