The idea that we should “fire bad teachers” has become the mantra of the day, as though anyone was seriously arguing that bad teachers should be kept. No one is. Instead, the real issue is, and has always been, identification.
Those of us who follow the literature about value-added models (VAM) – the statistical models designed to isolate the unique effect of teachers on their students’ test scores – hear a lot about their imprecision. But anyone listening to the public discourse on these methods, or, more frighteningly, making decisions on how to use them, might be completely unaware of the magnitude of that error.
A new report from the National Center for Education Evaluation (a division of the U.S. Department of Education), done by researchers from Mathematica, provides a thorough assessment of error in VAM estimates under different circumstances. In a sense, it represents a guide for using these models in decisions about teachers and schools.
Error stems from many potential sources, including lower-quality assessments, testing conditions (e.g., a noisy radiator), and differences in ability, background, etc., between students in different classrooms (the models assume this sorting is random, though it is not).
The authors of this report gauge the extent of error by asking: how likely are “average” techers to be wrongly classified as ineffective (or effective)?
Because there is variation between states and districts in the models’ specification, data quality, and the “thresholds” that are chosen to determine effective and ineffective teachers, the report presents a variety of different error rates for the different scenarios. But even under the best of circumstances, the results paint a fairly grim picture.
Using the best-performing model with a moderate threshold and five years of available data (more than many districts using VAM actually have), the error rate is roughly 20 percent. This means that, for example, one in five of the teachers deemed ineffective is actually “average.”
So, even under relatively favorable methodological circumstances, value-added models’ identification of ineffective teachers “fails” about one in five times. With fewer years of data, the error rate increases to between roughly 25 percent (three years of data) and 35 percent (one year); see Table 4.2 (page 23) in the report.
These findings are backed up by a large body of previous research, such as the studies demonstrating the relatively high degree of instability in value-added scores between years (e.g., McCaffrey et al. 2009; Goldhaber and Hansen 2008; or this CALDER policy brief).
While it’s true that any performance measure is subject to error, these rates make it clear that value-added models are not ready to shoulder the identification burden that many people think (often assume) they can. Even though these methods are sophisticated, and getting better, no knowledgeable researcher will mince words about the imprecision of VAM.
Value added models have a great deal of potential in the quest to identify ineffective teachers (and also for instructional purposes), but when we start firing teachers based on evaluation systems in which VAM scores are heavily weighted (or to start paying teachers more based on them), the bar is high. These error rates ensure that states and districts that choose to rely disproportionately on value-added will inevitably be dismissing good teachers every year, possibly replacing them with novices who are considerably less skilled. In addition, the teaching profession will come to be one in which you might get fired for reasons that are, quite literally, random. Over time, both students and teachers, as a whole, will suffer. After all, what talented person will want to pursue a career like that?
Often lost in the debate over VAM is the fact that the most important decisions in using these models are political, not methodological. The estimates are what they are. How we use them is what matters most. If these models have a role to play in evaluation and pay systems, it must be commensurate with their limitations.
For instance, great care must be exercised in determining the cutoff points beyond which teachers are “deemed” ineffective, and under no circumstances should any system ignore margins of error in assigning these designations. There must also be other valid measures to gauge teacher effectiveness, so that VAM scores become just another corroborating indicator. And the importance of data availability (i.e., how many years) cannot be overstated (due to the larger sample size, the error rates of schoolwide value-added are roughly half those for individual teachers).
In many respects, Race to the Top and similar programs have incentivized the opposite of this caution. They claim to encourage “bold reforms,” but there is a thin line between boldness and recklessness. And right now, that line is straddling us.