Incentives And Behavior In DC's Teacher Evaluation System

A new working paper, published by the National Bureau of Economic Research, is the first high quality assessment of one of the new teacher evaluation systems sweeping across the nation. The study, by Thomas Dee and James Wyckoff, both highly respected economists, focuses on the first three years of IMPACT, the evaluation system put into place in the District of Columbia Public Schools in 2009.

Under IMPACT, each teacher receives a point total based on a combination of test-based and non-test-based measures (the formula varies between teachers who are and are not in tested grades/subjects). These point totals are then sorted into one of four categories – highly effective, effective, minimally effective and ineffective. Teachers who receive a highly effective (HE) rating are eligible for salary increases, whereas teachers rated ineffective are dismissed immediately and those receiving minimally effective (ME) for two consecutive years can also be terminated. The design of this study exploits that incentive structure by, put very simply, comparing the teachers who were directly above the ME and HE thresholds to those who were directly below them, and to see whether they differed in terms of retention and performance from those who were not. The basic idea is that these teachers are all very similar in terms of their measured performance, so any differences in outcomes can be (cautiously) attributed to the system’s incentives.

The short answer is that there were meaningful differences.

Teachers with scores just below the ME cutoff were substantially more likely to quit, while those that remained exhibited modest but meaningful relative performance gains the following year (equivalent to about five percentile points). Similarly, teachers who earned an HE rating, and were therefore eligible for a large, permanent salary increase if they did so again, also seemed to improve their performance (though the retention “effects” were not statistically significant).

This is very promising, albeit tentative evidence that teachers responded to the incentives embedded in the IMPACT system. There are a few additional points that bear brief mention.

This is not an overall assessment of IMPACT. This is an excellent paper by talented, capable researchers. But there are several (mostly standard) caveats, all of which are discussed thoroughly by Dee and Wyckoff. The most important of these is that the estimated effects pertain only to the group of teachers who were near the ME and HE thresholds. Whether IMPACT influenced the performance and retention of all other teachers is not addressed directly by this study. Of course, this is not to say that the findings are unimportant – after all, teachers at the low and high ends of the spectrum that are of particular policy significance. Nevertheless, it is important to avoid drawing sweeping conclusions about the “average effect” of this system.

The fact that teachers seem to have responded conflicts with some of the prior research on incentives… It is well known that “traditional” merit pay systems – essentially, paying teachers cash for improving test scores – has a poor track record in the U.S. Several recent studies, including experimental evaluations, have found little no effect of these programs on performance or retention. A few possible factors, including the dismissal threat and/or the fact that IMPACT is a multi-measure evaluation system, might help explain the discrepancy (the former possibility is particularly compelling given the stronger results for teachers near the ME compared with the HE threshold).

… but it is not particularly surprising. It is not really shocking that the the promise of a huge, permanent raise or especially the threat of dismissal would influence teachers’ labor market choices and behavior. Of course, when it comes to the estimated performance effects, it is impossible to know what kinds of behavioral changes actually led to the improvement, but the study’s results suggest it wasn’t manipulation (e.g., principal favoritism), as the improvements were not concentrated in just one of IMPACT’s components (though this is somewhat less true among teachers near the HE cutoff). Moreover, it is tough to say whether teacher labor markets in other cities (or, perhaps, in D.C. over the long term) could withstand the cycle of forced and voluntary attrition that stems from systems like IMPACT, or whether and how this kind of system might influence the type of people who pursue a teaching career. In any case, it will be important, going forward, to see whether these findings are confirmed by strong research on other systems.

The estimated effects were mostly concentrated in the second year. It seems that the performance and retention effects didn’t really show up after the first year of IMPACT. Dee and Wyckoff speculate that this may be due to the system having gained “credibility” after its first year. There’s something to this, especially since the second year was the first in which ME teachers might be dismissed (as they had to receive that rating twice in a row). Nevertheless, replicating this study using additional years of data would seem to be an important next step (especially given that the design of IMPACT was changed this year).

What can we conclude from this study? This is just the first round in what will hopefully be a large body of strong evidence on the effects of these new evaluation systems, and the incentives attached to them. The fair (albeit tentative) conclusion from this particular analysis is that IMPACT seems to be having the intended effect of changing behavior, at least in the second year of the data, and among teachers close to the thresholds. That is promising and should not be diminished. On the other hand, there are still many open questions here, and it’s a good idea to keep a level head going forward.

- Matt Di Carlo

Blog Topics

The only valid conclusion that can be drawn from the study’s methodology was reported by the Washington Post’s Emma Brown. “Rewards and punishments embedded in the District’s controversial teacher evaluation program have shaped the school system’s workforce, affecting both retention and performance,” Brown explained, but the report is “silent about whether the incentives have translated into improved student achievement.”

Wyckoff and Dee compared teachers whose evaluation scores were close to the dividing lines for being considered a high performer or a low performer. This method could have been the first step in a valid social science study.

Under the D.C. IMPACT system is that the lower-rated teacher’s job is at risk, and thus has a strong incentive for changing his behavior. Such a teacher has more motivation for increasing his evaluation scores. To my knowledge, nobody has ever doubted that such a teacher would make changes to avoid termination.

The question of whether that teacher becomes more effective in teaching, however, is completely separate. Wyckoff and Dee don’t even attempt to address that, more important, question.

Real world, it should be obvious, under-the-gun teachers have more motivation to precisely follow instructions and teach more directly to the test. In such a situation, test scores and the teacher’s value-added scores are likely to increase. Similarly, threatened teachers are more likely to be more obedient when writing lesson plans, articulating the objectives and standards in precisely the right manner. When a teacher’s job is at risk, he will work harder to do what evaluators think is important. More often than not, he will put more care into his “data walls” and “word walls,” and conform to whatever the evaluator sees as the ideal presentation of those silly little details. In other words, at-risk teachers will bite their tongues, toe the line, and put much more compliant regarding the trappings of observation process. Regardless of whether that effort improves teaching and learning, it should result in higher scores on evaluations.

Because of IMPACT, the behavior of principals, other evaluators and teachers have changed enough to increase teachers’ evaluation scores by about 10 points on a 400 point scale. Perhaps such a change requires more than just stepping up effort on busy-work, or perhaps not. The more interesting question would be whether teachers increased their “value-added.” Perhaps the most interesting question is why Wyckoff and Dee do not focus on that issue …

Wyckoff and Dee make a big deal about teachers who are ranked lower leaving the system, because “less effective teachers under the threat of dismissal are more likely to voluntarily leave.” That would be a big deal if they had evidence that those who left were actually lower-performing. But, effective teachers who are wrongly indicted at “Minimally effective” are also more likely to say “take this job and shove it,” and leave.

It is safe to assume that some low-rated teachers were fairly evaluated and are actually less effective in the classroom and some are good teachers who were misidentified. Wyckoff and Dee have no clue who was correctly or incorrectly identified as low-performing.

IMPACT and other systems that use value-added are systematically biased against classrooms with larger numbers of English Language Learners, students on special educations IEPs, and low-income students. It stands to reason that effective teachers who are “false positives,” meaning that they were inaccurately categorized, will behave like their colleagues who are not effective. Both will leave the D.C. schools.

For argument's sake, however, let’s say that all of the 14% of the district’s teachers who were judged to be “Minimally Effective” were accurately placed in that category. Wyckoff and Dee proclaim IMPACT a success because about 20% just above the threshold for “Effective” left the school system at the end of a year while about 30% of teachers just below that threshold quit. Was it a good bargain for the D.C. schools to impose all of the stress and the other negative byproducts of IMPACT in order to speed the exit of such a small number?

The only metric that seems to be able to differientate very well is the Core Professional one. Had the district merely focused on the behavior of teachers, and held them accountable for good professional conduct, would such an evaluation system produced just as much good while minimizing the downsides of the controversial new system? Fire bad teachers for what they do and don't do and won't you also get rid of most ineffective ones?

Here's an interesting narrative on teacher evaluations from a blogger in Los Angeles, which in my opinion describes in more concrete terms what all these studies are talking about. Sometimes it's easy to forget how things impact real life people:

http://gatsbyinla.wordpress.com/2013/10/18/building-an-ecosystem/