In recent discussions about teacher evaluation, some people try to distinguish between “subjective” measures (such as principal and peer observations) and “objective” measures (usually referring to value-added estimates of teachers’ effects on student test scores).
In practical usage, objectivity refers to the relative absence of bias from human judgment (“pure” objectivity being unattainable). Value-added models are called “objective” because they use standardized testing data and a single tool for analyzing them: All students in a given grade/subject take the same test and all teachers’ “effects” in a given district or state are estimated by the same model. Put differently, all teachers are treated the same (at least those 25 percent or so who teach grades and subjects that are tested), and human judgment is relatively absent.
By this standard, are value-added models objective? No. And it is somewhat misleading to suggest that they are.
There is a general tendency to view all quantitative measures as “objective.” And there is a solid case that variables such as gender and age are free of subjective judgment. Even test scores probably meet the practical definition (though the interpretation of such scores does not). Moreover, one might argue that simple forms of descriptive analysis across an entire population (such as the average test score for a school or district) are generally free of human bias (though once again, interpretations of these data often are not).
Value-added models (VAMs), however, are a far more complex method than averaging test scores. In their real world uses, VAMs entail causal inferences – this teacher has this unique effect on her students’ scores. Moreover, her effect is supposed to be independent from students’ backgrounds, other schooling influences, and anything else that might be a factor in student performance.
In order to make this claim, VAMs rely on huge assumptions and judgments, which in many respects break the seal of our practical definition of objectivity. It is true that the same model is typically used for all teachers, but all teachers are not treated the same way, and the decision to tolerate this imprecision is a very human judgment call.
It is, for instance, human judgment that chooses the type of VAM and which variables to include (e.g., race, gender, student attendance, students’ prior testing history) and which to exclude. The configuration of models varies a great deal between states and districts.
These design choices can have dramatic effects on the results, and they affect some teachers more than others. For example, excluding student attendance from the model might “penalize” teachers with more chronically-absent students, or those who get a lot of mid-year transfers. These decisions about model design are human choices and, in many instances, they are political decisions.
In addition, the models themselves rely on a powerful set of assumptions, and these assumptions and their implications also entail human judgments. The most widely-discussed assumption is that students are randomly assigned to their classrooms, even though in reality this is rarely the case. Theoretically, however, it’s more accurate to say that the models assume that students are not assigned to classes based on any unobserved/unmeasured characteristics that are associated with achievement gains. There is some tentative evidence that, with the best models and enough years of data, the mistakes from this “flaw” in the VAMs can be reduced to a “tolerable” overall level.
Even so, there are still many individual teachers whose value-added scores are severely biased by the non-random nature of their classroom assignments. In other words, even though the models “treat” all teachers the same way (as if their classes were randomly assembled), the results do not. For example, teachers assigned more challenging students are often at a disadvantage, since the models cannot account for personal crises, disruptive behavior, or other unmeasurable issues that affect performance. Making things worse, it’s usually impossible to say just how many teachers are affected, much less whom, and by how much. The decisions about how many of these mistakes are “acceptable” are also subjective and, frankly, political.
The same argument applies to other core assumptions of VAMs, such as the absence of “peer effects” (the assumption that neither students’ peers nor teachers’ colleagues affect performance) and the “interval scaling” of test scores (e.g., the difference between a score of 20 and 30 is the same as that between 70 and 80). Both assumptions often do not hold up in practice and, again, affect some teachers’ value-added scores more than others’. Tolerating this set of flaws is also a human decision.
And finally, we must also grapple with the issue of how VAMs are actually used in evaluation and/or pay systems, which relies almost completely on human judgments about what is effective and reasonable, given the model’s bias, error rates, and other limitations.
Now, I am not suggesting that we should regard VAM estimates as subjective. They are not, and it is certainly fair to say that they are considerably more objective than observations. But calling VAMs “objective measures” without qualification, and contrasting them with “subjective” observations, is misleading, at least to a non-technical audience. The models – and the way they are used – are laced with human judgments at almost every level.
Furthermore, the objective/subjective distinction might imply to some people that observations are somehow inherently inferior, even though – by almost universal agreement – a well designed observation protocol is an essential component of effective teacher evaluation. In this sense, the objective/subjective distinction is not only misleading, but fundamentally useless. After all, if objectivity is our goal, then there should be no criticism of seniority.
VAMs have the potential to become a very useful tool in the struggle to improve U.S. education, but they must be used responsibly – with an eye toward their limitations as well as their strengths. People must be aware of these shortcomings. Uncritically placing VAMs on an objectivity pedestal just impedes their potential usefulness on the ground.