Thoughts On Using Value Added, And Picking A Model, To Assess Teacher Performance

Our guest author today is Dan Goldhaber, Director of the Center for Education Data & Research and a Research Professor in Interdisciplinary Arts and Sciences at the University of Washington Bothell.

Let me begin with a disclosure: I am an advocate of experimenting with using value added, where possible, as part of a more comprehensive system of teacher evaluation. The reasons are pretty simple (though articulated in more detail in a brief, which you can read here). The most important reason is that value-added information about teachers appears to be a better predictor of future success in the classroom than other measures we currently use. This is perhaps not surprising when it comes to test scores, certainly an important measure of what students are getting out of schools, but research also shows that value added predicts very long run outcomes, such as college going and labor market earnings. Shouldn’t we be using valuable information about likely future performance when making high-stakes personnel decisions?

It almost goes without saying, but it’s still worth emphasizing, that it is impossible to avoid making high-stakes decisions. Policies that explicitly link evaluations to outcomes such as compensation and tenure are new, but even in the absence of such policies that are high-stakes for teachers, the stakes are high for students, because some of them are stuck with ineffective teachers when evaluation systems suggest, as is the case today, that nearly all teachers are effective.

I also believe the use of value added has helped drive a much deeper conversation about evaluating teachers, a conversation that would not be occurring were it not for the value added “threat," or, perhaps more to the point, the threat associated with using a system that forces differentiated performance measures. Put another way, value added may be a key catalyst for broader changes to teacher evaluation. Finally, it’s clearly a judgment call, but I believe the bar for experimenting with alternatives to today’s evaluation systems is pretty low given that most of them fail to recognize that there’s a big difference between the most and least effective teachers, which both casual observation and statistical analysis shows exists.

But beyond whether to use value added is the question of what approach ought to be implemented. In very general terms, the idea behind value-added models is the translation of test-based measures of student achievement growth into a gauge of teacher performance; these models, however, come in a variety of different flavors, meaning policymakers employing them have to make choices. I won’t go into too much detail here about the different statistical approaches (for more on the nitty gritty details, go here), but the choice of statistical model does, at least in some cases, have consequences for how teacher performance is judged. Moreover, these implications can be masked by very high correlations in teacher effectiveness rankings when comparing rankings for the teacher workforce (covered by valued added) as a whole.

As an example (more detail about the comparisons across different types of models can be found here), value added models and student growth percentile models, the two most common general types of models being used in teacher evaluations today, are correlated at over 0.90 -- a very strong relationship -- even though they differ substantially in terms of how, and the extent to which, they account for differences in students’ backgrounds (put very simply, value-added models, unlike most student growth percentile models being used by states, often control directly for student characteristics such as free/reduced-price lunch eligibility).

In other words, these two models strongly agree with each other in the teacher rankings they yield. Yet, despite this, we see that the value added models that include student background adjustments tend to show teachers responsible for instructing more relatively disadvantaged students as performing better than the growth percentile models (and the converse is also true). Put differently, teachers with relatively high proportions of disadvantaged students tend to get better ratings from value-added than from growth percentile models. It also appears to matter a great deal whether comparisons of teacher effectiveness are made within schools or both within and between schools.

So, what can be made of this? What is the “right” model?

Unfortunately, it is quite difficult to know from a statistical standpoint. Differences in teacher rankings according to the kinds of students taught might reflect bias in the model, but they might also reflect true differences in teacher quality across different kinds of students (e.g. disadvantaged students tend to have less experienced and credentialed teachers, and, by some estimates, teachers with lower value added), or limitations of the way that a model adjusts for students’ backgrounds. Likewise, differences in teacher rankings associated with whether a model compares teachers only within schools or within and between schools could be based on school-level factors (e.g. the environment created by principals) or on real differences in the distribution of teacher quality across schools.

(As an important aside, while modeling dilemmas are being vetted thoroughly in the case of value added, these same fundamental issues arise for any means of evaluating teachers, such as when it comes to picking an observational rubric to use.)

But the statistical standpoint is not the only relevant perspective here. Part of the reason that there is no “right” or “wrong” answer when it comes to model choice is that we ultimately care about how the use of a particular model affects the quality of the teacher workforce and, hence, student learning. This is likely to depend on using a model that produces reasonably valid estimates of teacher effectiveness - i.e., a model that yields causal estimates of a teacher’s contribution to student learning, which must be suitable for potential uses of performance evaluations, such as deciding on tenure or dismissals.

We also care about how teachers (and prospective teachers) perceive a model and how that might affect their behavior. The best model from a statistical validity standpoint might, for instance, not be the model that teachers trust, which could affect their motivation to change their practices. The bottom line is that we can’t know the “right model” up front because we do not know how teachers will react to these estimates’ use in performance evaluations. Fortunately, the experimentation with different models will afford us the opportunity to learn more about this issue over time.

In the meantime, my view is that part of the process of model adoption ought to involve policymakers applying different models to their data in order to make the differences in teacher rankings explicit to stakeholders, explaining the reasons for those differences, and hopefully getting buy-in upfront for the model that is adopted. This type of transparency would no doubt give some ammunition to those who oppose using value added. But showing that some teacher ratings may change according the specific model used is not conceptually different than finding that teachers may be judged to be different under different rubrics used for classroom observation.

In other words, value added, like any other system of evaluation, will be an imperfect measure of true performance (which is never observed). Thus, I come full-circle and conclude that when it comes to using value added, the question is not whether it is “right” in some absolute sense, but rather: Does it provide us with more or better information about teachers than the other feasible systems for evaluating them?

- Dan Goldhaber

Blog Topics

I'm unsubscribing today. I usually read all the posts on here and get some good information and some that I disregard. This is so anti-teacher & student, I'm offended. There are way too many "think tank" top down models but few get input from real educators. Good statistical luck!
Karen Walter, Ph.D. Candidate
Science Teacher & Adjunct Professor

This is the best post I have read here in a long time.

I would be interested in which parts Karen Walter thinks are anti teacher and student, because I can't identify any in what I think is a very balanced post that talks about the importance of incorporating teacher input into the choice of growth models.

The right answer is that measuring teacher performance by any value added method is not possible. The use of value added measures in high stakes decision making is bad policy and the continued use of it will lead to more litigation and the improper dismissal of educators.

When we are willing to invest in training administrators and giving the time to do their job as evaluators then we can expect them to be work to coach up or out struggling educators. In addition, peer assistance and review programs need to be expanded and implemented on a nationwide basis.

Internationally, value added is not being used to evaluate and punish teachers. Collaboration and peer assistance is the key to instructional improvement.

As a social scientist, I am sympathetic to the search for tools that provide more information and understanding. But from a policy perspective as a parent advocate, I'm just not sure what VAM adds to the conversation. Perhaps the best that can be said for it is that VAM models offer the opportunity to have a more systematically controlled measure to match up to observational evaluations.

But the main problem is that the VAM results, in and of themselves, do not tell us what is wrong or how to fix it. The current policy debate, which tends to use VAM as a "silver bullet" solution for "getting rid" of "bad" teachers and "rewarding" "good" ones, either assumes that people are inherently good or bad teachers, or that the only way of changing that is to provide strong incentives for more work effort.

This more closely resembles evaluating workers by how many bricks they can carry rather than treating teachers as professionals (human capital) who can be helped to develop.

Further problems arise from the very narrow set of tests used to calculate VAM scores, which cover only a portion of the domain teachers are responsible for covering. Being a less than stellar math teacher could get you fired, but being a fantastic social studies teacher goes unnoticed. Since the plurality of classroom teachers are usually in elementary classrooms where they are responsible for a wide range of subjects, the narrowness of the measures used to calculate VAM not only makes it of questionable value but undermines any chance that it will be seen as legitimate by those on whom it is used.

Right now, the emphasis of so-called "reform" measures has been on sorting people: teachers, students, administrators. Left largely unaddressed is, to me, the much more important question of how to make things better. It's ironic that policies which claim to be preparing our schools for the future are in fact closely tied to a mindset developed in the early years of the industrial age.