Our guest author today is Douglas N. Harris, associate professor of economics and University Endowed Chair in Public Education at Tulane University in New Orleans. His latest book, Value-Added Measures in Education, provides an accessible review of the technical and practical issues surrounding these models.
This past November, I wrote a post for this blog about shifting course in the teacher evaluation movement and using value-added as a “screening device.” This means that the measures would be used: (1) to help identify teachers who might be struggling and for whom additional classroom observations (and perhaps other information) should be gathered; and (2) to identify classroom observers who might not be doing an effective job.
Screening takes advantage of the low cost of value-added and the fact that the estimates are more accurate in making general assessments of performance patterns across teachers, while avoiding the weaknesses of value-added—especially that the measures are often inaccurate for individual teachers, as well as confusing and not very credible among teachers when used for high-stakes decisions.
I want to thank the many people who responded to the first post. There were three main camps.
One group seemed very supportive of the idea. For example, Carol Burris, a principal who is active in challenging the system in New York State, was among the most active supporters.
A second group of commenters were critical mainly of the fact that I did not go further and recommend eliminating all uses of value-added measures in teacher evaluation. I think that would be going too far because the screening approach avoids the models’ main weaknesses, while still serving the useful purpose of pushing forward the Obama Administration’s call for the development of effective teacher evaluation and feedback systems (more on this below).
Of those who wanted to go further in scaling back value-added, some argued that we should curtail student testing more generally. Count me as one who thinks testing is going too far, especially in states like Florida, which have decided to test everybody no matter their age or their course subject. As I wrote in my book on value-added, we should at least find some evidence that this approach to teacher evaluation works in grades and subjects that are already tested before expanding it further. (As an aside, I started writing this post from Asia, where school officials tell me they are continuing to reduce the amount of testing, though their exams remain extremely high-stakes.) In my proposal, however, I was just taking the testing world as it is today and trying to come up with better uses.
There were also some responses that raised more policy questions. Chad Aldeman’s piece on the Education Sector blog suggested that the screening approach I proposed was already possible within the Race to the Top (RTTT), Teacher Incentive Fund (TIF), and ESEA waivers, all of which require that student growth or value-added be a “significant factor.” What constitutes a “significant factor,” however, is unclear. The screening idea could be used in combination with composite measures, but in its “pure form,” on which I’ll focus here, value-added is not part of the composite measure. Rather, the focus is on ensuring an effective evaluation process, but there is no role for value-added in final decisions. It’s certainly not obvious that this is compatible with the “significant factor” language.
I consulted several colleagues with recent experience working in TIF and RTTT. All seemed to agree with Aldeman that the screening idea could fit within the rules, although none came up with any examples where it had been done. This is not surprising because relying on screening approach in the initial competitive grant proposals would have been a risky strategy.
If you were a state or school district applying for a competitive grant or waiver and you knew only a fraction of the submissions would win, you would do everything you could to make your proposal rise above the rest, not take a chance on an idea that was not yet part of the conversation, and which seemed to contradict the “significant factor” language. This is why, to my knowledge, almost every RTTT and TIF proposal included the composite index approach (see exceptions below). It is also why, in my original post, I wrote that the Department “encouraged—or required, depending on your vantage point—states to lump value-added or other growth model estimates together with other measures.” I still think that’s accurate.
Aldeman made several additional points, and he and I have subsequently had some productive conversation by email. He pointed out, for example, that some of the winning RTTT proposals involved using value-added in a “matrix” in which teachers could not be labeled low-performing if they had high value-added scores (and vice versa for high-performing teachers). This is different from creating a composite index, although I would argue this approach shares the same weaknesses. Many ineffective teachers have high value-added scores and some effective teachers have low value-added scores. This implies that a teacher with a low classroom observation score and a high value-added score would not be labeled ineffective under either the composite approach or the matrix approach; in the composite approach, the low and high scores would average out. Of course, when the two scores do line up, all the approaches yield the same answer, so those are not the relevant cases. For this reason, I do not see the matrix approach as a good solution.
In his blog post, Aldeman also lays out two problems with the screening idea:
[Harris] would have student growth come first. A low rating could lead to closer inspection of a teacher’s classroom practice. But there are two main problems with doing it this way. One is that it doesn’t work as well from a timing standpoint. Student growth scores often come much later than observation results. And two, most teacher evaluation systems historically have not done a good job of differentiating teachers and providing them the feedback they need to improve. If low student growth just leads to the same high evaluation scores, it’s hard to say student growth played a “significant part” in a teacher’s overall rating.
On the first point, it is important to note that value-added measures pose a timing problem no matter how they are used. This is partly because we have to wait for the student test scores and then wait again for the district or outside vendor to create the value-added measures. In addition, some states and districts are following researchers’ advice and using multiple years of test score data for each value-added calculation. For example, a composite measure in November of the current year would be based on a weighted average of last year’s classroom observation and a value-added measure from the year that just ended, averaged with value-added from one or more prior years. Aldeman is correct that this “mismatch” in timing is slightly exacerbated with the screening approach because the value-added from prior years would only trigger additional classroom observations later in the current year. However, the screening approach reduces the role of value-added as a direct factor in personnel decisions, so it is not clear that a slightly larger mismatch is relevant.
On Aldeman’s second point, a major part of my rationale in favor of the screening approach is to avoid exactly the problem he identifies—that all teachers should not receive the same high ratings when in fact their performance varies, often considerably. If all teachers receive high classroom observations, then in the proposed screening system we would see a very low correlation between value-added and classroom observations—a red flag. The need for some type of insurance policy against uniformly high ratings is one reason why I part ways with those who want to get rid of value-added altogether (see above).
It is not too late to change this. While the states and districts that won these grants and/or ESEA waivers had a strong incentive to use the composite and matrix approaches in the initial competition, they could now ask for clarification and modifications. The development of these new human capital systems is a long-term proposition, one that will require careful observation and adaption when new ideas and evidence emerge.
Finally, in another response to my initial post, Bruce Baker mentioned that the screening idea is not entirely new. I started mentioning the idea in my value-added presentations as the RTTT process began, and included the same idea in my 2011 book. No doubt others came up with similar ideas on their own. Bruce indicated that he has the idea in his forthcoming book, and he pointed to a presentation that Steve Glazerman made at Princeton in 2011 (Steve confirmed this and provided me his slides). I know and respect Bruce and Steve and I am glad they are talking about it as well. If you know of other discussions on this topic, please add a comment to this post. The more people write about it, the more people will consider it. Thanks again to all who responded and especially to Chad Aldeman for the productive back and forth.
- Douglas N. Harris