What Value-Added Research Does And Does Not Show

Value-added and other types of growth models are probably the most controversial issue in education today. These methods, which use sophisticated statistical techniques to attempt to isolate a teacher’s effect on student test score growth, are rapidly assuming a central role in policy, particularly in the new teacher evaluation systems currently being designed and implemented. Proponents view them as a primary tool for differentiating teachers based on performance/effectiveness.

Opponents, on the other hand, including a great many teachers, argue that the models’ estimates are unstable over time, subject to bias and imprecision, and that they rely entirely on standardized test scores, which are, at best, an extremely partial measure of student performance. Many have come to view growth models as exemplifying all that’s wrong with the market-based approach to education policy.

It’s very easy to understand this frustration. But it's also important to separate the research on value-added from the manner in which the estimates are being used. Virtually all of the contention pertains to the latter, not the former. Actually, you would be hard-pressed to find many solid findings in the value-added literature that wouldn't ring true to most educators.

For example, the most prominent conclusion of this body of evidence is that teachers are very important, that there's a big difference between effective and ineffective teachers, and that whatever is responsible for all this variation is very difficult to measure (see here, here, here and here). These analyses use test scores not as judge and jury, but as a reasonable substitute for "real learning," with which one might draw inferences about the overall distribution of "real teacher effects."

And then there are all the peripheral contributions to understanding that this line of work has made, including (but not limited to):

That experience does matter;
That the quality of peers affects teacher performance;
That teachers perform differently in different schools;
And that students’ backgrounds explain more of the variation in their performance than school related factors

Prior to the proliferation of growth models, most of these conclusions were already known to teachers and to education researchers, but research in this field has helped to validate and elaborate on them. That’s what good social science is supposed to do.

Conversely, however, what this body of research does not show is that it’s a good idea to use value-added and other growth model estimates as heavily-weighted components in teacher evaluations or other personnel-related systems. There is, to my knowledge, not a shred of evidence that doing so will improve either teaching or learning, and anyone who says otherwise is misinformed. It's an open question.*

As has been discussed before, there is a big difference between demonstrating that teachers matter overall – that their test-based effects vary widely, and in a manner that is not just random –and being able to accurately identify the “good” and “bad” performers at the level of individual teachers. Frankly, to whatever degree the value-added literature provides tentative guidance on how these estimates might be used productively in actual policies, it suggests that, in most states and districts, it is being done in a disturbingly ill-advised manner.

For instance, the research is very clear that the scores for individual teachers are subject to substantial random error and systematic bias, both of which can be mitigated with larger samples (teachers who have taught more students). Yet most states have taken no steps to account for random error when incorporating these estimates into evaluations, nor have any but a precious few set meaningful sample size requirements. These omissions essentially ensure that many teachers’ scores will be too imprecise to be useful, and that most teachers’ estimates will be, at the very least, interpreted improperly.

The evidence is also clear that different growth models yield different results for the same teacher, yet some states have chosen models that are less appropriate for identifying teachers’ causal effects on testing outcomes.

Finally, making all of this worse, most states are mandating that these scores count for 40-50 percent of tested teachers’ evaluations without any clue what the other components (e.g., observations) will be and how much they will vary. Many have even violated the most basic policy principles – for example, by refusing to mandate a pilot year before full implementation.

For these (and other) reasons, opponents of value-added have every reason to be skeptical of the current push to use these estimates in high-stakes decisions, and of the clumsy efforts among some advocacy organizations to erect an empirical justification for doing so. Not only is there no evidence that using these measures in high-stakes decisions will generate improvements, but the details of how it’s being done are, in most places, seemingly being ignored in a most careless, risky fashion. It's these details that will determine whether the estimates are helpful (and, used properly, they have a lot of potential).

Nevertheless, those who are (understandably) compelled to be dismissive or even hostile toward the research on value-added should consider that this line of work is about understanding teaching and learning, not personnel policies. Research and data are what they are; it’s how you use them that matters. And it's unfortunate that many self-proclaimed advocates of "data-driven decisionmaking" seem more interested in starting to make decisions than in the proper use of data.

- Matt Di Carlo

*****

* It’s true that, in many cases, the researchers provide concrete policy recommendations, but they tend to be cautious and flanked by caveats. Moreover, there are many papers in the field that do address directly the suitability of value-added estimates for use in concrete policies such as tenure decisions and layoffs, but the conclusions of these analyses are typically cautious and speculative, and none can foresee how things will play out in actual, high-stakes implementation.

Blog Topics

I couldn't agree more! Unfortunately my state is jumping on the bandwagon, and I can't help but wonder how we can afford to implement all of these new things but can't afford the basics-like substitutes or textbooks.
Ashleigh's Education Journey

Excellent post. To continue, I recommend the post and the comments here:

http://schoolfinance101.wordpress.com/2011/09/02/take-your-sgp-and-vami…

Too many of those who champion VAM or SGP or whatever are far too cavalier in setting restrictions on their use.

Great post.

Unfortunately, decisions on VAM are left to politicians with well-funded constituencies and ideological blinders.

The author wrote, "there’s a big difference between effective and ineffective teachers, and that whatever is responsible for all this variation is very difficult to measure".

Not so, in high school and middle school science and math. Research finds that the teaching METHODS are the most important factor in student learning! Effectiveness is easy to measure.

Two examples illustrate this.

1. Physics education research of David Hestenes, at Arizona State University. Modeling Instruction in K-12 science was developed from his research. Student achievement is typically double that of conventional instruction, as measured by RESEARCH-BASED CONCEPT INVENTORIES. Instead of relying on lectures and textbooks, Modeling Instruction emphasizes active student construction of conceptual and mathematical models in an interactive learning community. Students are engaged with simple scenarios to learn to model the physical world. Models reveal the structure of the science and are sequenced in a coherent story line. They form a foundation for problem-solving and they guide projects. For more information, please contact Jane.Jackson@asu.edu. http://modeling.asu.edu

2. The TIMSS 1999 Video Study of Stigler and Hiebert. It supports principles of Modeling Instruction. This was a study of eighth-grade mathematics and science teaching in seven countries. The study involved videotaping and analyzing teaching practices in more than 1000 classrooms. They found that high-achieving nations, AS MEASURED BY TIMSS TESTS, engage students in searching for patterns & relationships, in wrestling with key science & math concepts. Unfortunately, in the U.S. (which scored low on TIMSS), content plays sometimes no role at all; instead, science lessons engage students in a variety of activities; and math focuses on low-level skills: procedures rather than conceptual understanding, in unnecessarily fragmented lessons. See http://timssvideo.com/timss-video-study. In particular,
http://timssvideo.com/sites/default/files/Closing the Teaching Gap.pdf

REFERENCES:
Hestenes, D. (2000). Findings of the Modeling Workshop Project (1994-2000) (from Final Report submitted to the National Science Foundation, Arlington, VA). Note: the effect size was later calculated as 0.91; high! http://modeling.asu.edu/R&E/Research.html

Wells, M., Hestenes, D., and Swackhamer, G. (1995). A Modeling Method for High School Physics Instruction, Am. J. Phys. 63, 606-619
http://modeling.asu.edu/R&E/Research.html

Expert Panel Review (2001): Modeling Instruction in High School Physics. (Office of Educational Research and Improvement. U.S. Department of Education, Washington, DC)
http://www2.ed.gov/offices/OERI/ORAD/KAD/expert_panel/newscience_progs…

WEB RESOURCE:
http://fnoschese.wordpress.com/modeling-instruction/

Good post. I've been impressed with your blogging, though I often disagree.

Random - what's with commenter above, who seems to scour the web to cut and paste the exact same comment? Weird.