In 2009, The New Teacher Project (TNTP) released a report called “The Widget Effect.” You would be hard-pressed to find too many more recent publications from an advocacy group that had a larger influence on education policy and the debate surrounding it. To this day, the report is mentioned regularly by advocates and policy makers.
The primary argument of the report was that teacher performance “is not measured, recorded, or used to inform decision making in any meaningful way.” More specifically, the report shows that most teachers received “satisfactory” or equivalent ratings, and that evaluations were not tied to most personnel decisions (e.g., compensation, layoffs, etc.). From these findings and arguments comes the catchy title – a “widget” is a fictional product commonly used in situations (e.g., economics classes) where the product doesn’t matter. Thus, treating teachers like widgets means that we treat them all as if they’re the same.
Given the influence of “The Widget Effect,” as well as how different the teacher evaluation landscape is now compared to when it was released, I decided to read it closely. Having done so, I think it’s worth discussing a few points about the report.
The first reaction I had was that the report’s primary empirical contribution, which, as is often the case, could have been expressed with a single table and a few paragraphs of text, was important and warranted much of the attention it received. Namely, it was the finding that, in the 12 districts included in the study, only a tiny minority (about 1-5 percent) of evaluated teachers received an “unsatisfactory” or equivalent rating (and thus, predictably, very few tenured teachers were dismissed during this time).
That was a bombshell of sorts. Even though these findings are sometimes portrayed inappropriately as national estimates (it’s just 12 districts), and even though the report puts forth the somewhat misleading argument that low proficiency rates in these districts mean the evaluation results must be wrong, the fact that just a tiny minority of teachers received low ratings is implausible.
The inadequacy of teacher evaluation regimes had been a long-standing issue for decades, but this was one of the first times that the public, outside of the education field at least, was made aware of the simple fact that teacher evaluations in these 12 districts, at least insofar as final ratings are the gauge, seemed to be more of a formality than anything else.
Such a conclusion was among the big catalysts for current efforts to redesign evaluations. It also shows how even the most simple descriptive statistics — in this case, a percentage — can be more powerful than the most complex statistical approaches. The report is cited frequently in academic journal articles.
Of course, this immediately raises the question: Why? There’s a simple, albeit obvious starting point to this explanation: Principals didn’t assign low ratings, and so most ratings were not low.
The report tends to imply that this was a systemic or a design failure. For example, the authors note that half of the 12 systems for which they had data offered only two categories, that many of them required only infrequent evaluations, and that principals were not properly trained to conduct them. In addition, they note that most personnel decisions, such as compensation, were not tied to evaluation ratings, and that dismissal procedures can be burdensome, which could mean that principals lacked the incentive to assign low ratings to their teachers.
These are all very important points, and probably played a role in producing the implausibly high results. Still, it bears mentioning that, in at least a few of the states that have released results of their new teacher evaluations, the ratings have not exhibited a whole lot more differentiation (see more discussion of this here). These systems have attempted to address many of the concerns discussed above, yet in some cases continue to reward the vast majority of teachers with the highest ratings.
Unfortunately, I haven’t seen much analysis of which components of these systems are driving the results, but it stands to reason that classroom observations are one of the big factors. If so, it may simply be the case that principals think their teachers are doing a good job, or are unwilling to give them unsatisfactory ratings for some other reason unconnected to personnel policies, such as their estimation of the prospects for finding better replacements. It is therefore important to examine which design features, whether in the composition of ratings or the incentives attached to them, are associated with different outcomes, and to balance the need for differentiation with sound measurement.*
Beyond these important descriptive findings, the rest of the report consists mostly of two elements. The first, mentioned briefly above, is advocacy for policies that might help address the issue of implausibly low differentiation (and a summary of whether these policies were in force in the 12 districts included in the study). Some of them, such as new, better evaluations and administrator training, receive relatively wide support (at least in general, without regard to specifics), whereas others, such as performance-based pay, are more controversial.
The second element that makes up most of the rest of the report is the results of a survey of teachers and administrators in the 12 districts included in the study. The authors use these survey results liberally throughout the entire report, in some cases to make rather bold statements such as “teachers and administrators broadly agree about the existence and scope of the problem and about what steps need to be taken to address poor performance in schools.”
The problem is that these surveys were entirely non-random (it was a voluntary online survey), and the report makes no effort to compare the characteristics of their survey sample to that of the teacher workforces in their 12 districts. The data, therefore, could have been (and were) used to offer useful insights, but they cannot be used to draw generalized conclusions about “what teachers and administrators think” even in these 12 districts, to say nothing of nationally.**
Overall, however, the “Widget Effect” did have a considerable impact on the debate about education policy, and probably even on policymaking. It remains to be seen how the issue upon which it shined light — the results of teacher evaluations — will shake out going forward.
- Matt Di Carlo
* This is one of the reasons why value-added estimates, which essentially impose a distribution on teacher performance – i.e., some teachers must be poorly rated by design – are valued by many of those who are focused on the spread of ratings across categories.