Are Value-Added Models Objective?

In recent discussions about teacher evaluation, some people try to distinguish between "subjective" measures (such as principal and peer observations) and "objective" measures (usually referring to value-added estimates of teachers’ effects on student test scores).

In practical usage, objectivity refers to the relative absence of bias from human judgment ("pure" objectivity being unattainable). Value-added models are called "objective" because they use standardized testing data and a single tool for analyzing them: All students in a given grade/subject take the same test and all teachers’ "effects" in a given district or state are estimated by the same model. Put differently, all teachers are treated the same (at least those 25 percent or so who teach grades and subjects that are tested), and human judgment is relatively absent.

By this standard, are value-added models objective? No. And it is somewhat misleading to suggest that they are.

There is a general tendency to view all quantitative measures as "objective." And there is a solid case that variables such as gender and age are free of subjective judgment. Even test scores probably meet the practical definition (though the interpretation of such scores does not). Moreover, one might argue that simple forms of descriptive analysis across an entire population (such as the average test score for a school or district) are generally free of human bias (though once again, interpretations of these data often are not).

Value-added models (VAMs), however, are a far more complex method than averaging test scores. In their real world uses, VAMs entail causal inferences – this teacher has this unique effect on her students’ scores. Moreover, her effect is supposed to be independent from students’ backgrounds, other schooling influences, and anything else that might be a factor in student performance.

In order to make this claim, VAMs rely on huge assumptions and judgments, which in many respects break the seal of our practical definition of objectivity. It is true that the same model is typically used for all teachers, but all teachers are not treated the same way, and the decision to tolerate this imprecision is a very human judgment call.

It is, for instance, human judgment that chooses the type of VAM and which variables to include (e.g., race, gender, student attendance, students’ prior testing history) and which to exclude. The configuration of models varies a great deal between states and districts.

These design choices can have dramatic effects on the results, and they affect some teachers more than others. For example, excluding student attendance from the model might "penalize" teachers with more chronically-absent students, or those who get a lot of mid-year transfers. These decisions about model design are human choices and, in many instances, they are political decisions.

In addition, the models themselves rely on a powerful set of assumptions, and these assumptions and their implications also entail human judgments. The most widely-discussed assumption is that students are randomly assigned to their classrooms, even though in reality this is rarely the case. Theoretically, however, it's more accurate to say that the models assume that students are not assigned to classes based on any unobserved/unmeasured characteristics that are associated with achievement gains. There is some tentative evidence that, with the best models and enough years of data, the mistakes from this "flaw" in the VAMs can be reduced to a "tolerable" overall level.

Even so, there are still many individual teachers whose value-added scores are severely biased by the non-random nature of their classroom assignments. In other words, even though the models "treat" all teachers the same way (as if their classes were randomly assembled), the results do not. For example, teachers assigned more challenging students are often at a disadvantage, since the models cannot account for personal crises, disruptive behavior, or other unmeasurable issues that affect performance. Making things worse, it’s usually impossible to say just how many teachers are affected, much less whom, and by how much. The decisions about how many of these mistakes are "acceptable" are also subjective and, frankly, political.

The same argument applies to other core assumptions of VAMs, such as the absence of "peer effects" (the assumption that neither students’ peers nor teachers’ colleagues affect performance) and the "interval scaling" of test scores (e.g., the difference between a score of 20 and 30 is the same as that between 70 and 80). Both assumptions often do not hold up in practice and, again, affect some teachers’ value-added scores more than others’. Tolerating this set of flaws is also a human decision.

And finally, we must also grapple with the issue of how VAMs are actually used in evaluation and/or pay systems, which relies almost completely on human judgments about what is effective and reasonable, given the model’s bias, error rates, and other limitations.

Now, I am not suggesting that we should regard VAM estimates as subjective. They are not, and it is certainly fair to say that they are considerably more objective than observations. But calling VAMs "objective measures" without qualification, and contrasting them with "subjective" observations, is misleading, at least to a non-technical audience. The models – and the way they are used – are laced with human judgments at almost every level.

Furthermore, the objective/subjective distinction might imply to some people that observations are somehow inherently inferior, even though – by almost universal agreement – a well designed observation protocol is an essential component of effective teacher evaluation. In this sense, the objective/subjective distinction is not only misleading, but fundamentally useless. After all, if objectivity is our goal, then there should be no criticism of seniority.

VAMs have the potential to become a very useful tool in the struggle to improve U.S. education, but they must be used responsibly – with an eye toward their limitations as well as their strengths. People must be aware of these shortcomings. Uncritically placing VAMs on an objectivity pedestal just impedes their potential usefulness on the ground.

Permalink

Clearly VAMs and classroom observations by principals and peers measure very DIFFERENT aspects of teaching. A clear explanation for what the VAM estimates capture and what classroom observations represent is needed, indeed. They are both meaningful and useful and they are NOT THE SAME. A policy maker may have preferences for the measures to implement in a school district and may put higher weight on one or the other depending on her priorities - that is subjective!

You are raising important issues to be considered when adopting VAMs. Yes, these arguments are valid in general.

But the crucial point of your objective/subjective argument is whether you apply the same metric absent from human judgment to make meaningful inferences. Using the exact same evaluation protocol, principals and peers can rate a particular teacher differently according of their own understanding of the questions and their own perception of the observed teacher behavior. There can be a thousand principals or external/peer evaluators in a district. This is where the human judgment enters the measurement and that’s why classroom observations are more likely to be subjective.

Similarly, letter grades (As, Bs, Cs, etc.) and GPAs measures are student learning metrics but they vary across teachers and schools. For one teacher in one school, a particular essay can be worth of A+, while the same essay for another teacher (within the same or other school) can find it worth of B-. This is an example of metric subject to human judgment and does not permit meaningful comparisons across teachers and schools. In that respect, it is more likely to be subjective.

Now if a policy maker/researcher/someone is interested in evocative comparisons across students, teachers, and schools within a district or a state, one would need to use the most objective measures one can conceive prone to the least human judgment. Student test scores are an example. Rather than comparing the number of A students in a district, a more meaningful measure should use a common test score. Yes, there are many issues with tests. However, tests are more likely to be objective indicators of what they are designed to test (specific items such as fractions or broader problem solving skills). One of the main reasons for not including a writing section in standardized achievement tests is to eliminate the need for humans to grade those, thus the extensive use of multiple choice type questions.

You are absolutely right that we need to be careful with the language – a VAM estimate of a teacher fixed effect is not equivalent to the broader notion of teacher effectiveness. And paying attention to issues of VAM choice and implementation is extremely important. But all of your arguments are not making VAMs less objective.