Value-Added Evaluation & Those Pesky Collateralized Debt Obligations
Last week, while I was away, Brookings released another of its occasional “consensus” documents; this one’s titled, “Passing Muster: Evaluating Teacher Evaluation Systems.” The effort was once again led by Brookings’ savvy Russ Whitehurst. The aim, more or less, is to tell state and federal officials how to “achieve a uniform standard for dispensing funds to school districts for the recognition of exceptional teachers without imposing a uniform evaluation system.”
The report offers an impressive seven-step model to help policymakers figure out how many teachers will be misidentified by different evaluation strategies under different sets of assumptions. “Misidentification” is meant conceptually, but, practically speaking, is discussed in terms of how the teachers in question fare on value-added calculations. The report also features new jargon like “tolerance” and “exceptionality” to characterize “how willing policymakers are to risk an error of over-inclusion” or “the cutoff in a teacher rank distribution that is used for decision-making.” The paper is clever, and fine as far as it goes, but leaves me concerned about the direction of teacher evaluation policy.
The exercise aims to inform efforts to evaluate teachers for whom districts can’t do value-added analysis, but the underlying thread seems to be the casual, implicit assumption that reading and math value-added are the “true” measure of teacher quality. This is hardly a unique take; it’s become the norm. The same stance characterized the Gates Foundation’s Measures of Effective Teaching report last winter, with its effort to gauge the utility of various teacher evaluation strategies (student feedback, observation, etc.) based upon how closely they approximated value-added measures.
The whole thing brings to my mind the collateralized debt bubble, in which incredibly complex models were built atop a pretty narrow set of assumptions and the simple conviction that assumptions could be taken as givens. In 2004, questioning underlying assumptions about real estate valuation would get an analyst dismissed as unsophisticated.
Edu-econometricians are eagerly building intricate models stacked atop value-added scores. Yet, today’s value-added measures are, at best, a pale measure of teacher quality. There are legitimate concerns about test quality; the noisiness and variability of calculations; the fact that metrics don’t account for the impact of specialists, support staff, or shared instruction; and the degree to which value-added calculations rest upon a narrow, truncated conception of good teaching. Value-added does tell us something useful and I’m in favor of integrating it into evaluation and pay decisions, accordingly, but I worry when it becomes the foundation upon which everything else is constructed.
When well-run public or private sector firms evaluate employees, they incorporate managerial judgment, peer feedback, and so forth, without assuming that these will or should reflect project completion, sales, assembly line performance, or what-have-you. The whole point of these other measures is to get a fuller picture of performance; and that would be self-defeating if these other measures were supposed to measure one underlying thing.
The one downside to having a slew of first-rate econometricians engaged in edu-research nowadays is that in their eagerness for outcomes to analyze, they tend to care less about the caliber of the numbers than whether they can count them. In the housing bubble, rocket scientists crunched decades of housing data to build complex models. Their job wasn’t to sweat the quality of the data, its appropriateness, or the real-world utility of their assumptions; it was to build dazzling models. The problem is that even the cleverest of models is only as good as the data. And it turned out that the data and assumptions were rife with overlooked problems.
Edu-econometricians love test scores because they can find increasingly sophisticated ways to model them. But if the scores are flawed, biased, or incomplete measures of learning or teacher effectiveness, the models won’t pick that up. Yet those raising such questions are at risk of being dismissed as unsophisticated and retrograde. (To be fair, sensible skepticism isn’t helped by the rush of union mouthpieces and carnival barkers eager to spout conspiracy theories, excuses, and ad hominem attacks.) So, the Brookings exercise is interesting and useful on its terms–but I’m growing more than a little concerned about those terms.
– Frederick Hess