Controversial proposals for new teacher evaluation systems have generated a tremendous amount of misinformation. It has come from both “sides,” ranging from minor misunderstandings to gross inaccuracies. Ostensibly to address some of these misconceptions, the advocacy group Students First (SF) recently released a “myth/fact sheet” on evaluations.
Despite the need for oversimplification inherent in “myth/fact” sheets, the genre can be useful, especially about topics such as evaluation, about which there is much confusion. When advocacy groups produce them, however, the myths and facts sometimes take the form of “arguments we don’t like versus arguments we do like.”
This SF document falls into that trap. In fact, several of its claims are a little shocking. I would still like to discuss the sheet, not because I enjoy picking apart the work of others (I don’t), but rather because I think elements of both the “myths” and “facts” in this sheet could be recast as “dual myths” in a new sheet. That is, this document helps to illustrate how, in many of our most heated education debates, the polar opposite viewpoints that receive the most attention are often both incorrect, or at least severely overstated, and usually serve to preclude more productive, nuanced discussions.
Let’s take all four of SF’s “myth/fact” combinations in turn.
I’m almost tempted to leave this one alone. Putting aside the opinion-dressed-as-fact that “meaningful” evaluations must include “objective measures of student academic growth,” it’s fair to argue, with caution, that the value-added components of new teacher evaluations would not necessarily “penalize” teachers whose students begin the year far behind, as the models attempt to control for prior achievement (though the same may not be stated so strongly for other components, such as observations).
However, whether the models account for lower-scoring students covers only a small slice of the real issue here. Rather, as I’ve noted before, the better guiding question is whether they can to an acceptable degree account for all the factors that are outside of teachers’ control.
So, by altering SF’s “myths” and “facts” to cover this wider range, we have my proposal for a new “combined myth”:
MYTHS: Value-added models will penalize any teacher who teaches disadvantaged or lower-performing children AND value-added models fully account for factors outside of teachers’ control
Of course, neither claim is necessarily true or false.
Well-designed value-added models can, on the whole, go a long way toward controlling for the many test-influencing factors outside teachers’ control, to no small extent because prior achievement helps pick up on these factors. But it is inevitable that even the best models will penalize some teachers and reward others unfairly (this is true of almost any measure). In addition, the estimates from some types of models are, at the aggregate level, associated with student characteristics such as subsidized lunch eligibility.
The key here is to avoid black and white statements and acknowledge that there will be mistakes.
For now, the better approach requires: Considering that there are different types of errors (e.g., false negative, false positive); assessing the risk of misclassification versus that of alternative measures (or alternative specifications of the same measure); and constantly checking estimates from all these measures for evidence of bias. These are serious, complicated challenges, and neither alarmist rhetoric nor casual dismissal reflects the reality of the situation.
Right off the bat, seeing the phrase “value-added measures fluctuate from year to year” in the “myths” column is pretty amazing.
To say the least, this “myth” is a fact. Moreover, the second part – “basing evaluations on this measure…is unfair and unreliable” – is not a “myth” at all. It is, at best, a judgment call.
In the other column, some of the “facts” are nothing more than opinions, and, in a couple of cases, questionable ones at that (e.g., value-added is “accurate,” and that using it in evaluations is “imperative”).
That said, let’s modify the second part of SF’s “myth” (unfair/unreliable) and combine it with their misleading “fact” (“high predictive power”) to construct a dual myth:
MYTHS: Value-added is too unreliable to be useful AND value-added is a very strong predictor of who will be a “good teacher” the next year
The “predictive power” of value-added – for instance, its stability over time – really cannot be called ”high.” On the whole, it tends to be quite modest. Also, just to be clear, the claim that value-added can predict “teacher effects on student achievement” is better-phrased as “value-added can predict itself.” Let’s not make it sound more grand than it is.
At the same time, however, that value-added estimates are not particularly stable over time doesn’t preclude their usefulness. First, as discussed here, even a perfect model – one that fully captured teachers’ causal effects on student testing progress – would be somewhat unstable due to simple random error (e.g., from small samples). Second, all measures of any quality fluctuate between years (and that includes classroom observations). Third, some of the fluctuation is “real” – performance does actually vary between years.
So, we cannot expect perfect stability from any measure, and more stable does not always mean better, but the tendency of these estimates to fluctuate between (and within) years is most definitely an important consideration. The design of evaluations can help – e.g., using more years of data can improve stability, as can accounting for error directly and using alternative measures with which growth model estimates can be compared and combined. And we should calibrate the stakes with the precision of the information – e.g., the bar for dismissing teachers is much higher than that for, say, targeting professional development.
But, overall, there is a lot of simple human judgment involved here, and not much room for blanket statements.
Again, it’s rather surprising to see the very real concern that using evaluations in high-stakes decisions may impede collaboration in the “myths” column. No matter what you think of these new systems, their impact on teacher teamwork and other types of behavioral outcomes is entirely uncertain, and very important. Relegating these concerns to the “myths” column is just odd.
Also, the “facts” here consist of a series of statements about what SF thinks new teacher evaluation systems should accomplish. For example, I have no doubt that “the goal [of new evaluations] is not to create unhealthy competition,” but that doesn’t mean it won’t happen. Facts are supposed to be facts.
Anyway, there’s a good “combined myth” in here, one that addresses the overly certain predictions on both ends of the spectrum:
MYTHS: New teacher evaluations will destroy teacher collaboration and create unhealthy competition AND the new evaluations will provide teachers with useful feedback and create a “professional culture”
You can hear both these arguments rather frequently. Yet the mere existence of new evaluations will do absolutely nothing. The systems might be successful in providing feedback and encouraging collaboration, or they might end up pitting teachers against each other and poisoning the professional environment in schools.
It all depends on how carefully they are designed and how well they are implemented. Both “myths” are likely to be realized, as these outcomes will vary within and between states and districts.
And, given the sheer speed at which many states have required these new systems to be up and running, there is serious cause for concern. In a few places, the time and resources allotted have been so insufficient that any successful implementation will only be attributable to the adaptability and skill of principals and teachers.
One more thing (not at all specific to Students First or this document): we need to be careful not to overdo it with all these “feel-good” statements about new evaluations providing feedback and encouraging strong culture. I don’t mean to imply that these are not among the goals. What I’m saying instead is that statements by many policy makers and advocates, not to mention the very design of many of the systems themselves, make it very clear that one big purpose of the new evaluations (in some cases, perhaps the primary stated purpose) is to make high-stakes decisions, including dismissal and pay.
We are all well aware of this, teachers most of all, and it serves nobody to bury it in flowery language. Let’s have an honest conversation.
One last time: It is surreal to see the possibility that personal issues or incompetence “could subject [teachers] to an unfair rating” on their observations in the “myths” column. How anyone could think this a “myth” is beyond me.
And the primary “fact” – that there “should be little room for unfair bias” – can only be called naïve.
Let’s change the “could subject [teachers] to an unfair rating” in SF’s “myth” to a “would subject,” and recast it, along with their “fact,” as a combined myth:
MYTHS: Classroom observations will be an easy way for petty or incompetent principals to rate teachers unfairly AND observations will be relatively unbiased tools for assessing classroom performance
As we all know, there is plenty of room for unfairness, bias or inaccuracy, even in well-designed, well-implemented teacher observations. This is among the most important concerns about these measures. But there’s also the potential for observations, at least on the whole, to be useful tools.
What the available research suggests is that observers, whether principals or peers, must be thoroughly trained, must observe teachers multiple times throughout the year, and must be subject to challenge and validation. Put simply, observations must be executed with care (and there is serious doubt about whether some states are putting forth the time and resources to ensure this).
Like evaluations in general, observations by themselves are neither bad nor good. It’s how you design, implement and use them that matter. Downplaying this – e.g., characterizing the possibility of inaccuracy/bias as a “myth” – is precisely the opposite of the appropriate attitude.
In summary, the design and implementation of teacher evaluations is complicated, and there are few if any cut-and-dry conclusions that can be drawn at this point. Suggesting otherwise stifles desperately needed discussion, and it is by far the biggest myth of all.
- Matt Di Carlo