New Teacher Evaluations Are A Long-Term Investment, Not Test Score Arbitrage

One of the most important things in education policy to keep an eye on is the first round of changes to new teacher evaluation systems. Given all the moving parts and the lack of evidence on how these systems should be designed and their impact, course adjustments along the way are not just inevitable, but absolutely essential.

Changes might be guided by different types of evidence, such as feedback from teachers and administrators or analysis of ratings data. And, of course, human judgment will play a big role. One thing that states and districts should not be doing, however, is assessing their new systems – or making changes to them – based whether or not raw overall test scores go up or down within the first few years.

Here’s a little reality check: Even the best-designed, best-implemented new evaluations are unlikely to have an immediate measurable impact on aggregate student performance. Evaluations are an investment, not a quick fix. And they are not risk-free. Their effects will depend on the quality of systems, how current teachers and administrators react to them and how all of this shapes and plays out in the teacher labor market. As I’ve said before, the realistic expectation for overall performance – and this is no guarantee – is that there will be some very small, gradual improvements, unfolding over a period of years and decades.

States and districts that expect anything more risk making poor decisions during these crucial, early phases.

For example, the District of Columbia Public Schools (DCPS) recently announced important changes (discussed here) to their nationally-watched evaluation system (IMPACT). Many of the changes seem sensible, but arguably the most substantial were to the scoring rubric: The "ineffective" category was expanded, and DCPS established a brand new "middle" category ("developing"), which comes with stakes attached (teachers receiving this rating for three consecutive years are subject to dismissal).*

I have no doubt these changes were thoroughly considered, and I cannot say whether they make the system "better" or "worse" by some absolute standard.**

Yet their justification remains unclear. For instance, regarding the latter change (establishing a new category), every piece of evidence we have suggests that teacher evaluation measures – both value-added and observation – are limited in their ability to differentiate among teachers in the middle of the distribution. Yet DCPS is establishing a new mid-distribution category, and, more importantly, imposing consequences for receiving it.

For their part, DCPS identified their rationale as follows:

The data and feedback indicated that DCPS’s definition of teacher effectiveness needed to be more rigorous if the district is to dramatically accelerate student achievement.

Not only does this imply that DCPS expects the new evaluation system to "dramatically accelerate" achievement a few short years after its implementation, but it also suggests that the district is making major changes to the system motivated in part by that assumption (and, perhaps, by the fact that proficiency rates have been basically flat since the new evaluations went into place).

It’s one thing to have high hopes. It’s quite another to let them influence major policy decisions, especially so early in the game.

Another "early adopter" state – Tennessee – recently completed its first year with a new teacher evaluation system, and is also contemplating modifications (laid out in this state report). I won’t review these proposed changes now: some make good sense; some I cannot judge without more information; and a few are in my view highly questionable. To their credit, like DCPS, Tennessee seems to have engaged in a systematic effort to gather feedback from various stakeholders, and the authors of the report do elaborate justifications for many of their proposals.

But I am worried, perhaps prematurely, that they too may be influenced by unrealistic expectations. In the report, the state devotes considerable time to reviewing their achievement "gains" over the past year (they’re not really gains, but that's a different story). Although they’re careful to speculate about multiple possible causes (most of which, predictably, are recent policy changes), they note: "We believe teacher evaluation has also played an important role in our student achievement gains." They did the same thing in the press release.

This is, putting it mildly, completely inappropriate, and it's unbecoming a state education agency. For one thing, as I’ve discussed countless times, changes in cross-sectional proficiency rates, especially tiny changes such as Tennessee's, can’t even tell you whether average student achievement improved, to say nothing of the causes of that change. The latter must be assessed with thorough, multi-year policy evaluations, not subtraction.***

But this is not just a methodological issue: By implying that their rate changes are partially attributable to the evaluations, even in its very first year, the state seems to be sending a signal: This system is having an immediate, measurable impact on aggregate test scores, and we can judge its utility, and perhaps even make changes, based on these outcomes.

Tennessee has not, to my knowledge, made any decisions yet, and one can only hope that their report's use of testing data was a just a clumsy stab at some good publicity. Even that is a risky path, though: One problem with improper causal arguments is that the same evidence can come back to bite you. If aggregate rates don't increase next year, how will the state explain that? Will they conclude publicly that the system isn't working anymore? Will they undo the changes they make this year? Of course they won't, and they shouldn't. So don't dig that hole in the first place.

Overall, then, course adjustments in D.C., Tennessee, and elsewhere are a crucial phase for these new teacher evaluation designs. States and districts aren’t going to get these things right on the first or even the second try. The modifications should be based primarily on practitioner feedback and careful analysis, not on pre-existing beliefs or political pressure for immediate test-based gratification. This is going to be a long, difficult haul, and it's past time we stopped pretending otherwise.

- Matt Di Carlo

*****

* I characterize the new category as "middle" insofar as it is the third out of five, but it’s tough to say how the distribution of "developing" teachers will play out vis-à-vis the overall distribution of IMPACT ratings. However, it's certainly safe to say it will be toward the middle.

** Actually, given that DCPS also made significant changes to the components of final scores, including the incorporation of locally-designed assessments (weighted at 15 percent), it is impossible to say how the final scores will turn out, to say nothing of whether the new scoring rubric is appropriate.

*** The report justifies its causal argument based on the fact that "administrators consistently expressed the opinion that instruction improved this year as a result [of the new evaluations]." They don’t present any actual data, so we have to take their word on this. In any case, connecting a (presumably) majority opinion among administrators to state-level test results is pure speculation, especially given that the change in proficiency is too small to permit confidence that it's not just error.

Blog Topics