Value-Added Versus Observations, Part Two: Validity

In a previous post, I compared value-added (VA) and classroom observations in terms of reliability – the degree to which they are free of error and stable over repeated measurements. But even the most reliable measures aren’t useful unless they are valid – that is, unless they’re measuring what we want them to measure.

Arguments over the validity of teacher performance measures, especially value-added, dominate our discourse on evaluations. There are, in my view, three interrelated issues to keep in mind when discussing the validity of VA and observations. The first is definitional – in a research context, validity is less about a measure itself than the inferences one draws from it. The second point might follow from the first: The validity of VA and observations should be assessed in the context of how they’re being used.

Third and finally, given the difficulties in determining whether either measure is valid in and of itself, as well as the fact that so many states and districts are already moving ahead with new systems, the best approach at this point may be to judge validity in terms of whether the evaluations are improving outcomes. And, unfortunately, there is little indication that this is happening in most places.

Let’s start by quickly defining what is usually meant by validity. Put simply, whereas reliability is about the precision of the answers, validity addresses whether we’re using them to answer the correct questions. For example, a person’s weight is a reliable measure, but this doesn’t necessarily mean it’s valid for gauging the risk of heart disease. Similarly, in the context of VA and observations, the question is: Are these indicators, even if they can be precisely estimated (i.e., they are reliable), measuring teacher performance in a manner that is meaningful for student learning?

Needless to say, this is an exceedingly difficult – perhaps impossible - question to answer. Due in no small part to the availability of evidence, virtually all of the discussion about VA and observations proceeds based on a somewhat abstract, academic (though still very important) standard of validity.

In the case of VA, the standard might be whether the models provide relatively unbiased estimates of teachers’ causal effects on student test scores. And, it is well-established that there are problems here (though, all else being equal, random error [reliability] is arguably a bigger concern).

Namely, different value-added models can yield different results, as can different tests under the same model. Similarly, systematic bias in the estimates may arise from unmeasured factors such as curricular variation, peer effects, consistency of tested content or classroom assignments. There is little dispute that these (and other) issues exist, though there is disagreement as to their severity.

But, as is the case with reliability, the same problems can also apply to observations. You might say observations are valid if the protocols gauge the degree to which a teacher exemplifies the practices that promote student learning. It’s quite a challenge to know whether that is indeed the case. As is the case with value-added, different observation protocols yield different results. In addition, the assignment of students to certain teachers might very well influence the results of an observation.

Since all these indicators are ostensibly trying to measure a similar phenomenon (e.g., teacher performance), one common means of getting an idea as to the validity of a given measure is to see if it corresponds with other measures. And the available evidence suggests that there is a moderate relationship between VA estimates and observation scores (also here and here), and that both predict future student performance.*

Overall, to whatever extent we can draw conclusions about the validity of VA and observations by this "research-oriented" standard, it’s fair to say that both have their strengths and weaknesses, and the issue of validity is not exclusive to either.

But this doesn’t necessarily tell us very much about whether we should use one or both measures in actual evaluations, as it ignores the second key point I’d like to raise: In policy discussions, the assessment of a measure’s validity must also consider the purposes for which it is used. In other words, a given measure might be appropriate and/or useful for some types of decision and not others (this is related to the concept of “consequential validity”).

For example, using a performance indicator like value-added to target professional development might require much lower (or different) validity standards than using the estimates to make high-stakes decisions about compensation or employment, since, among other reasons, the costs of making mistakes are considerably lower in the former situation.

Also, even if VA models or observations were perfect, it’s still possible that they would be less than effective in improving teacher practice or quality, and that actual policy use might partially threaten whatever validity they have. For instance, it is possible that high-stakes use of VA will compel so-called “teaching to the test," which dilutes the degree to which the scores reflect “true” student learning (observations are not immune from this type of bias either). On a similar note, teacher buy-in is important: An unpopular system might increase turnover/mobility, and/or make it less likely that teachers use the results to improve their practice.

From this perspective, the implications of the evidence discussed above, though important, are somewhat limited.**

This brings us to the third and final point: The debate over VA and observations in teacher evaluations is about policy, and, in this context, it may be that the best question to ask is not whether VA and observations are valid by some absolute Platonic standard (which we cannot determine), but rather whether the manner in which they are used improves outcomes. Put simply, it’s much easier to assess the effect of policies than the absolute validity of measures they use.

Unfortunately but predictably, given that we are still at an early stage, evidence regarding the effects of the kinds of new evaluation systems currently being designed and implemented is still a bit scarce, and is mostly limited to low-stakes applications. Nevertheless, these types of studies are very important for discussions of the validity of teacher evaluations or their constituent components.

For instance, this paper reaches the encouraging conclusion that high-quality observations can improve teacher performance (as measured by value-added) among mid-career teachers. An evaluation of low-stakes use of value-added in Pennsylvania (i.e., teachers and administrators used the data to inform instruction) found no effects on student achievement, though the program was limited in its scope, and there wasn’t sufficient time to train users. In contrast, a randomized experiment in New York City, where test-based accountability is more well-established, found that giving principals access to value-added did have a discernible impact – it influenced their “subjective” opinions of teachers’ performance.

We know almost nothing about the validity or effects of measures in teacher evaluations for high-stakes use. Given the importance of this issue, one would expect that states implementing new systems would be making sure to have rigorous, independent program evaluations in place as an essential part of their overall plan.

They should be gathering data and closely monitoring the process and its effects – both short- and long-term, on a variety of different outcomes – at every step. And, insofar as these projects might require at least several years of data (and interviews, etc.) to reach conclusions, this effort should begin as soon as the new systems go online.

I see little indication that this is happening with teacher evaluations. Personally, my biggest concern about all this is that we’re making drastic changes but are failing to see whether and why they work (or don’t work). Instead, the design and results of new evaluation systems are being judged based on unsupported preconceptions as to what they should look like, not whether they’re accurate or improving outcomes.

Despite all the rhetoric about the validity (or lack thereof) of value-added and observations, there is little if any support for certainty. We can either argue about whether the new systems are working, or we can check.

- Matt Di Carlo

*****

* In addition, it’s worth noting that VA has been partially validated in an experimental analysis, while there is recent evidence that teacher-induced test score improvements are associated with very small increases in future earnings, educational attainment and other outcomes.

** Much of the heated debate over evaluations, especially value-added, appears to stem from differences in beliefs as to the purpose of evaluating teachers, and how the final scores should be used in decisions. Those who view evaluations as formative tools – to be used to identify strengths and weaknesses in teacher practice - tend to be more skeptical toward value-added, which is less well-suited for these purposes than observations. Yet, even the most ardent opponents of VA in evaluations often acknowledge that VA might play a useful role in evaluations as a “trigger” for some form of corrective action, such as professional development. In fact, many of these opponents are even borderline receptive when presented with a hypothetical scenario in which VA scores comprise, say, 10 percent of a teacher’s final score. People understand validity is context-specific.

Blog Topics

Classroom Observation

Teacher Evaluation

ESEA

Thanks, Matt...awesome pair of posts that I'll be rereading. ;0)

I guess my one question/reaction is not just about the accuracy of the VA results as much as it is about what the student tests are assessing in the first place. Current standardized tests offer feedback on only a very thin slice of the learning outcomes we want for our kids. Seems like that's a whole 'nother part of the conversation that may actually have to come before we get to the VA and observation piece.

Really great post.
I think your paragraph about the high stakes use of VA compelling teaching to the test is a critical one. Not only does it result in a watered down and perverse curriculum, but it dilutes the data in itself. In other words, measuring something causes a change in that measurement, but also in the value of that measurement.
To me, the tragedy of this is not just that it causes a teaching to the test. Teaching to the test would not be that terrible if the test was well-designed, and supported by a good curriculum. I teach to "the test" all the time, but sometimes "the test" is a presentation, or an essay test, a long paper. I think supporters of the use of VAM hear "teach to the test" and think to themselves "Why is that so bad? If the test assesses whether you can read, then why is teaching to the test such an awful thing? Kids need to know how to read." Needless to say, they don't spend the time to look at what the test assesses. Most do not understand the difference between different elements of reading (such as decoding and vocabulary).
But when we look at what happens when we use high stakes metrics (and this is true in any field where there are high stakes metrics) we get a narrowing of pedagogy and conservative strategies. Much better to narrow your pedagogy and game the checklist than take a risk with innovation, or with skills or knowledge that may have long term benefits rather than benefits at the end of this year.
So we then have the problem that teachers are just teaching to this year's test, not next years, or ten years.
Part of me can't understand why the business community that supports VA doesn't realize that this is exactly the same situation that perverts American business and stifles innovation, which is one of the last comparative advantages that we have in today's global economy. Enron's focus on quarterly stock price led them to game that short term metric, not caring about long term value of the company. How can businessmen celebrate long-term visionaries like Steve Jobs or Bill Gates or any number of internet innovators, but then we go about discouraging the situations that make their long-term thinking possible in our education system.
Charter schools or vouchers aren't magical innovation machines if they are held to the same high-stakes testing standards. And if the high stakes are the problem, why not free our public school teachers to innovate themselves?

Thanks for your diligent, thoughtful treatment of this topic, Matt. There is much that is troubling about value-added’s rapid rise in public policy, not the least of which is few seem to have paused and asked some very fundamental (and seemingly obvious) questions about the reliability and validity of the measure. But, hey, let’s not let a lack of supporting empirical evidence thwart the advancement of a popular political agenda.

An examination of value added gain scores for more than 7,296 cohorts of students in Ohio for 2009-10 and 2010-11 and the change in the VA gain for those cohorts from one year to the next yielded the following:

Correlations:
1. VA gain in 2009-10 is significantly and negatively correlated to VA gain in 2010-11 (r= -0.351, p<.01). 57.7% of positive VA gain scores for a cohort in 2009-10 yielded negative VA gain scores for the same cohort in 2010-11 (2,074 out of 3,593). 61.1% of negative VA gain scores for a cohort in 2009-10 yielded positive VA gain scores for the same cohort in 2010-11 (2,262 out of 3,697),

2. VA gain in 2009-10 is significantly and negatively correlated to the change (increase/decrease) in VA gain in 2010-11 (r= -0.832, p<.01). 78.7% of positive VA gain scores for a cohort in 2009-10 yielded smaller VA gain scores for the same cohort in 2010-11 (2,831 out of 3,593). 80.9% of negative VA gain scores for a cohort in 2009-10 yielded larger VA gain scores for the same cohort in 2010-11. (2,994 out of 3,697)

Accordingly, positive VA gain scores for a cohort in 2009-10 will almost certainly decrease for the same cohort in 2010-11, and negative VA gain scores for a cohort of students in 2009-10 will almost certainly increase for the same cohort in 2010-11. Similar analysis of value added gain scores from 2005-06, 2006-07, 2007-08, and 2008-09 revealed nearly identical results.

Given the stature of value-added as an accountability metric in Ohio that is about to be elevated to a teacher effectiveness, evaluation, and compensation metric, I am alarmed by empirical evidence that suggests that instead of teacher effectiveness/impact on learning, what is really on display here is little more than regression to mean.

Very interesting and informative post. Garners more questions than answers but contributes to the dialogue over effectiveness of classroom evaluations. A recent book (review below) has the potential of moving the evaluative process into the realm of professional development where I believe there is a strong likelihood of not only improving instruction, but also student achievement. Review follows:

A Value Added Decision: To Support the Delivery of the Common Core Standards by Maria C. Guilott and Gaylynn Parker. Outskirts Press, 2012 (available on Amazon and as an E-Book)

This book goes beyond its title to offer a “Values Added” aspect to bringing the cost down on staff development. In most school districts in the country the budget for professional development fosters school board knife-sharping. “Why not cut here? After all, the staff we haired are trained professionals, why spend more money on keeping up their skills when they can very well pay for it themselves through additional course work?”

Guilott and Parker have offered a sensible, dynamic and focused staff development program through this little gem of a book. Here they create a method for principals to achieve a status as instructional leaders in their respective buildings instead of being seen as evaluators of something they know nothing about. Let’s face it. The faculty rooms of America are filled with teachers who resent being observed and evaluated by people who never taught their subject or who left the classroom because they were not good in the classroom. Good teachers know good teachers and they know who was not a good teacher.

But this system of professional development partners an administrator with other excellent teachers in their buildings, it calls for frank observation of learning and analysis of why a particular method employed increased the learning possibility for students. Notice, the target is the earning not the teaching. A team of one administrator and a few teacher colleagues visit a classroom and focus on the kids learning not the teacher talking. These “walk-through’s” are not evaluative, they are learning experiences for the observing teachers who join the principal on a journey through the process of good teaching and increased student achievement. The principal acts as a guide helping her staff see good learning experiences happening right in their own buildings. As Grant Wiggins has noted, this form of “look-for,” “…Puts the camera on the players instead of the coach.” Guilott and Parker’s approach puts the camera on the learners learning rather than the teacher covering stuff. The experience ends with the likelihood that the teachers experiencing the CLW (Collegial Learning Walks) will begin to make similar changes in their own practice. The process foresees a different type of professional dialogue rather than the typical faculty room talk complaining about the frustrations of dealing with kids.

Bringing the CLW approach to instructional leadership fosters action research on the part of the participants. It encourages a new relationship between administrator and teachers. It provides small and large school districts with an inexpensive reform process that can lead to increased student achievement. Finally, it meets the single most important criteria for professional improvement voiced in all the research done on instructional improvement… it allows teachers to see other teachers doing a good job; teachers want to learn from other good teachers. This little book is one of the most powerful instruments to bring about positive change in the classroom. If implemented in the proscribed manner, this process has the potential to help the American educational system turn that long awaited corner discussed in all the journals and in the media. It is a simple, elegant idea leading the average principal into the arena of truly becoming an instructional leader in his or her building. This book is a must read for teachers, principals, union leaders and superintendents. This is at the heart of a well-planned reform process that can begin at the building level. It reflects Jay McTighe’s suggestions to all school leaders who want to have their vision of professional growth flourish. This system calls for the leaders to, “Think big, act small and go for an early win in Iowa.”

In Washington, DC public schools (DCPS) value added is part of the formula that determines a teacher's Impact score, and, as such, directly affects both compensation and continued employment. Please contemplate the implications of the following inequity:

At my school, there is one first grade classroom, one second grade classroom, and one combination first/second grade classroom. In assigning children to the combination class, the rationale seems to have been to combine the lowest performing second graders with a random group of first graders. Of the ten second graders in this class, six are English language learners and/or special education students. Not surprisingly, this group consistently makes a poor showing on the standardized tests administered five times a year, while the other second grade (with no special needs students, and the top five first graders from the year before) consistently outperforms the district average. As a testing grade, second grade outcomes carry weight At the end of the year the value added components of each teacher's professional evaluation will be vastly different, not because of any glaring disparity of skill or effort on their parts, but because of budget/enrollment factors and administrative decisions.

I wonder if anyone can argue convincingly that this is a way a fair or valid use of test scores to evaluate teacher performance, or if it gives any useful information about the relationship between teacher competence and student outcomes.