Value-Added In Teacher Evaluations: Built To Fail

With all the controversy and acrimonious debate surrounding the use of value-added models in teacher evaluation, few seem to be paying much attention to the implementation details in those states and districts that are already moving ahead. This is unfortunate, because most new evaluation systems that use value-added estimates are literally being designed to fail.

Much of the criticism of value-added (VA) focuses on systematic bias, such as that stemming from non-random classroom assignment (also here). But the truth is that most of the imprecision of value-added estimates stems from random error. Months ago, I lamented the fact that most states and districts incorporating value-added estimates into their teacher evaluations were not making any effort to account for this error. Everyone knows that there is a great deal of imprecision in value-added ratings, but few policymakers seem to realize that there are relatively easy ways to mitigate the problem.

This is the height of foolishness. Policy is details. The manner in which one uses value-added estimates is just as important – perhaps even more so – than the properties of the models themselves. By ignoring error when incorporating these estimates into evaluation systems, policymakers virtually guarantee that most teachers will receive incorrect ratings. Let me explain.

Each teacher’s value-added estimate has an error margin (e.g., plus or minus X points). Just like a political poll, this error margin tells us the range within which that teacher’s “real” effect (which we cannot know for certain) falls. Unlike political polls, which rely on large random samples to get accurate estimates, VA error margins tend to be gigantic. One concrete example is from New York City, where the average margin of error was plus or minus 30 percentile points. This means that a New York City teacher with a rating at the 60th percentile might “actually” be anywhere between the 30th and 90th percentiles. We cannot even say with confidence whether this teacher is above or below average.

In fact, we should expect that most teachers in the nation will also fit this description – the margin of error “surrounding” their value-added scores will be too wide to determine whether they are above, below, or at average. Granted, a teacher at the 60th percentile is more likely to be 60th than 90th or 30th, but the proper interpretation of the score is to regard it as average.

For a much smaller group of teachers, there will be more “conclusive” results. Although the number will vary by sample size and other factors, it is typically 10-30 percent (and this is a generous estimate; it may be much lower), roughly half of whom will be rated above and half below average. In other words, the “top” and “bottom” 5-15 percent of teachers will get scores that are large and/or consistent enough to warrant confidence that they are above or below average - i.e., that they are different from the average by a “statistically significant” margin.

So, in short, most teachers’ VA scores (typically, at least three-quarters) will be indistinguishable from the average; 5-15 percent will receive scores that are above average; and 5-15 percent will receive scores that are below average. This kind of distribution is a staple of statistics.

Now, here’s the problem: In virtually every new evaluation system that incorporates a value-added model, the teachers whose scores are not significantly different from the average are being treated as if they are. For example, some new systems sort teachers by their value-added scores, and place them into categories – e.g., the top 25 percent are “highly effective," the next 25 percent are “effective," the next 25 percent are “needs improvement," and the bottom 25 percent are “ineffective."

In this example, virtually every single teacher in the middle two categories (“effective” and “needs improvement”) is statistically no different from average, but is placed into a category that implies otherwise. Furthermore, many of the teachers in the top and bottom categories (“highly effective” and “ineffective”) also have estimates that are “within the margin of error," and are therefore also properly regarded as average. These teachers are not only being misclassified, but are actually being given superlative labels like “highly effective."

So, again – if we ignore error margins, the majority of teachers are pretty much guaranteed to be misclassified. A more rigorous approach – the one that ensures the greatest accuracy – would be to group teachers according to whether or not their estimated score is different from the norm by a statistically significant margin. For example, a three-category scheme: above average, average, below average (an alternative, though less preferable, would be to set a minimum sample size for teachers’ estimates to “count”).

It’s true that a scheme like this would not produce as much “spread” across categories, and that imposing a statistical significance constraint would represent the “loss of data" (i.e., many teachers with varying estimates grouped into the "average" category). But teacher evaluations – like student evaluations – above all else, need to be accurate, which means that estimates must be interpreted properly.

Finally, keep in mind that the issues I deal with here do not speak to any of the oft-discussed weaknesses of value-added estimates – such as the non-random nature of classroom assignments, that different tests and models yield different results, and the fact that even statistically significant estimates may be incorrect. These problems are far more difficult to address, which makes our failure to account for random error even more exasperating. It is among the only of value-added’s problems with a readily-available partial solution, one that researchers agree is advisable (also here). Our failure to use this solution is inexplicable, and extremely risky.

- Matt Di Carlo

Blog Topics

It would be good if teachers spoke up during these discussions so that VAM's were used responsibly. Our response so far has been to plug our ears and refuse to engage. Now where has that gotten us?