Although most new teacher evaluations are still in various phases of pre-implementation, it’s safe to say that classroom observations and/or value-added (VA) scores will be the most heavily-weighted components toward teachers’ final scores, depending on whether teachers are in tested grades and subjects. One gets the general sense that many – perhaps most – teachers strongly prefer the former (observations, especially peer observations) over the latter (VA).
One of the most common arguments against VA is that the scores are error-prone and unstable over time – i.e., that they are unreliable. And it’s true that the scores fluctuate between years (also see here), with much of this instability due to measurement error, rather than “real” performance changes. On a related note, different model specifications and different tests can yield very different results for the same teacher/class.
These findings are very important, and often too casually dismissed by VA supporters, but the issue of reliability is, to varying degrees, endemic to all performance measurement. Actually, many of the standard reliability-based criticisms of value-added could also be leveled against observations. Since we cannot observe “true” teacher performance, it’s tough to say which is “better” or “worse,” despite the certainty with which both “sides” often present their respective cases. And, the fact that both entail some level of measurement error doesn’t by itself speak to whether they should be part of evaluations.*
Nevertheless, many states and districts have already made the choice to use both measures, and in these places, the existence of imprecision is less important than how to deal with it. Viewed from this perspective, VA and observations are in many respects more alike than different. Read More »
Late last week and over the weekend, New York City newspapers, including the New York Times and Wall Street Journal, published the value-added scores (teacher data reports) for thousands of the city’s teachers. Prior to this release, I and others argued that the newspapers should present margins of error along with the estimates. To their credit, both papers did so.
In the Times’ version, for example, each individual teacher’s value-added score (converted to a percentile rank) is presented graphically, for math and reading, in both 2010 and over a teacher’s “career” (averaged across previous years), along with the margins of error. In addition, both papers provided descriptions and warnings about the imprecision in the results. So, while the decision to publish was still, in my personal view, a terrible mistake, the papers at least make a good faith attempt to highlight the imprecision.
That said, they also published data from the city that use teachers’ value-added scores to label them as one of five categories: low, below average, average, above average or high. The Times did this only at the school level (i.e., the percent of each school’s teachers that are “above average” or “high”), while the Journal actually labeled each individual teacher. Presumably, most people who view the databases, particularly the Journal’s, will rely heavily on these categorical ratings, as they are easier to understand than percentile ranks surrounded by error margins. The inherent problems with these ratings are what I’d like to discuss, as they illustrate important concepts about estimation error and what can be done about it. Read More »
Using value-added and other types of growth model estimates in teacher evaluations is probably the most controversial and oft-discussed issue in education policy over the past few years.
Many people (including a large proportion of teachers) are opposed to using student test scores in their evaluations, as they feel that the measures are not valid or reliable, and that they will incentivize perverse behavior, such as cheating or competition between teachers. Advocates, on the other hand, argue that student performance is a vital part of teachers’ performance evaluations, and that the growth model estimates, while imperfect, represent the best available option.
I am sympathetic to both views. In fact, in my opinion, there are only two unsupportable positions in this debate: Certainty that using these measures in evaluations will work; and certainty that it won’t. Unfortunately, that’s often how the debate has proceeded – two deeply-entrenched sides convinced of their absolutist positions, and resolved that any nuance in or compromise of their views will only preclude the success of their efforts. You’re with them or against them. The problem is that it’s the nuance – the details – that determine policy effects.
Let’s be clear about something: I’m not aware of a shred of evidence – not a shred – that the use of growth model estimates in teacher evaluations improves performance of either teachers or students. Read More »
With all the controversy and acrimonious debate surrounding the use of value-added models in teacher evaluation, few seem to be paying much attention to the implementation details in those states and districts that are already moving ahead. This is unfortunate, because most new evaluation systems that use value-added estimates are literally being designed to fail.
Much of the criticism of value-added (VA) focuses on systematic bias, such as that stemming from non-random classroom assignment (also here). But the truth is that most of the imprecision of value-added estimates stems from random error. Months ago, I lamented the fact that most states and districts incorporating value-added estimates into their teacher evaluations were not making any effort to account for this error. Everyone knows that there is a great deal of imprecision in value-added ratings, but few policymakers seem to realize that there are relatively easy ways to mitigate the problem.
This is the height of foolishness. Policy is details. The manner in which one uses value-added estimates is just as important – perhaps even more so – than the properties of the models themselves. By ignoring error when incorporating these estimates into evaluation systems, policymakers virtually guarantee that most teachers will receive incorrect ratings. Let me explain. Read More »
The debate on the use of value-added models (VAM) in teacher evaluations has reached an impasse of sorts. Opponents of VAM use contend that the imprecision is too high for the measures to be used in evaluation; supporters argue that current systems are inadequate, that all measures entail error but this doesn’t preclude using the estimates.
This back-and-forth may be missing the mark, and it is not particularly useful in the states and districts that are already moving ahead. The more salient issue, in my view, is less about the amount of error than about how it is dealt with when the estimates are used (along with other measures) in evaluation systems.
Teachers certainly understand that some level of imprecision is inherent in any evaluation method—indeed, many will tell you about colleagues who shouldn’t be in the classroom, but receive good evaluation ratings from principals year after year. Proponents of VAM often point to this tendency of current evaluation systems to give “false positive” ratings as a reason to push forward quickly. But moving so carelessly that we disregard the error in current VAM estimates—and possible methods to reduce its negative impacts—is no different than ignoring false positives in existing systems. Read More »
The idea that we should “fire bad teachers” has become the mantra of the day, as though anyone was seriously arguing that bad teachers should be kept. No one is. Instead, the real issue is, and has always been, identification.
Those of us who follow the literature about value-added models (VAM) – the statistical models designed to isolate the unique effect of teachers on their students’ test scores – hear a lot about their imprecision. But anyone listening to the public discourse on these methods, or, more frighteningly, making decisions on how to use them, might be completely unaware of the magnitude of that error. Read More »