The following is written by Morgan S. Polikoff and Matthew Di Carlo. Morgan is Assistant Professor in the Rossier School of Education at the University of Southern California.
One of the primary policy levers now being employed in states and districts nationwide is teacher evaluation reform. Well-designed evaluations, which should include measures that capture both teacher practice and student learning, have great potential to inform and improve the performance of teachers and, thus, students. Furthermore, most everyone agrees that the previous systems were largely pro forma, failed to provide useful feedback, and needed replacement.
The attitude among many policymakers and advocates is that we must implement these systems and begin using them rapidly for decisions about teachers, while design flaws can be fixed later. Such urgency is undoubtedly influenced by the history of slow, incremental progress in education policy. However, we believe this attitude to be imprudent. Read More »
One can often hear opponents of value-added referring to these methods as “junk science.” The term is meant to express the argument that value-added is unreliable and/or invalid, and that its scientific “façade” is without merit.
Now, I personally am not opposed to using these estimates in evaluations and other personnel policies, but I certainly understand opponents’ skepticism. For one thing, there are some states and districts in which design and implementation has been somewhat careless, and, in these situations, I very much share the skepticism. Moreover, the common argument that evaluations, in order to be “meaningful,” must consist of value-added measures in a heavily-weighted role (e.g., 45-50 percent) is, in my view, unsupportable.
All that said, calling value-added “junk science” completely obscures the important issues. The real questions here are less about the merits of the models per se than how they’re being used. Read More »
Controversial proposals for new teacher evaluation systems have generated a tremendous amount of misinformation. It has come from both “sides,” ranging from minor misunderstandings to gross inaccuracies. Ostensibly to address some of these misconceptions, the advocacy group Students First (SF) recently released a “myth/fact sheet” on evaluations.
Despite the need for oversimplification inherent in “myth/fact” sheets, the genre can be useful, especially about topics such as evaluation, about which there is much confusion. When advocacy groups produce them, however, the myths and facts sometimes take the form of “arguments we don’t like versus arguments we do like.”
This SF document falls into that trap. In fact, several of its claims are a little shocking. I would still like to discuss the sheet, not because I enjoy picking apart the work of others (I don’t), but rather because I think elements of both the “myths” and “facts” in this sheet could be recast as “dual myths” in a new sheet. That is, this document helps to illustrate how, in many of our most heated education debates, the polar opposite viewpoints that receive the most attention are often both incorrect, or at least severely overstated, and usually serve to preclude more productive, nuanced discussions.
Let’s take all four of SF’s “myth/fact” combinations in turn. Read More »
Our guest author today is Douglas N. Harris, associate professor of economics and University Endowed Chair in Public Education at Tulane University in New Orleans. His latest book, Value-Added Measures in Education, provides an accessible review of the technical and practical issues surrounding these models.
This past November, I wrote a post for this blog about shifting course in the teacher evaluation movement and using value-added as a “screening device.” This means that the measures would be used: (1) to help identify teachers who might be struggling and for whom additional classroom observations (and perhaps other information) should be gathered; and (2) to identify classroom observers who might not be doing an effective job.
Screening takes advantage of the low cost of value-added and the fact that the estimates are more accurate in making general assessments of performance patterns across teachers, while avoiding the weaknesses of value-added—especially that the measures are often inaccurate for individual teachers, as well as confusing and not very credible among teachers when used for high-stakes decisions.
I want to thank the many people who responded to the first post. There were three main camps. Read More »
One of the most frequent criticisms of value-added and other growth models is that they are “unstable” (or, more accurately, modestly stable). For instance, a teacher who is rated highly in one year might very well score toward the middle of the distribution – or even lower – in the next year (see here, here and here, or this accessible review).
Some of this year-to-year variation is “real.” A teacher might get better over the course of a year, or might have a personal problem that impedes their job performance. In addition, there could be changes in educational circumstances that are not captured by the models – e.g., a change in school leadership, new instructional policies, etc. However, a great deal of the the recorded variation is actually due to sampling error, or idiosyncrasies in student testing performance. In other words, there is a lot of “purely statistical” imprecision in any given year, and so the scores don’t always “match up” so well between years. As a result, value-added critics, including many teachers, argue that it’s not only unfair to use such error-prone measures for any decisions, but that it’s also bad policy, since we might reward or punish teachers based on estimates that could be completely different the next year.
The concerns underlying these arguments are well-founded (and, often, casually dismissed by supporters and policymakers). At the same time, however, there are a few points about the stability of value-added (or lack thereof) that are frequently ignored or downplayed in our public discourse. All of them are pretty basic and have been noted many times elsewhere, but it might be useful to discuss them very briefly. Three in particular stand out. Read More »
** Reprinted here in the Washington Post
2012 was another busy year for market-based education reform. The rapid proliferation of charter schools continued, while states and districts went about the hard work of designing and implementing new teacher evaluations that incorporate student testing data, and, in many cases, performance pay programs to go along with them.
As in previous years (see our 2010 and 2011 reviews), much of the research on these three “core areas” – merit pay, charter schools, and the use of value-added and other growth models in teacher evaluations – appeared rather responsive to the direction of policy making, but could not always keep up with its breakneck pace.*
Some lag time is inevitable, not only because good research takes time, but also because there’s a degree to which you have to try things before you can see how they work. Nevertheless, what we don’t know about these policies far exceeds what we know, and, given the sheer scope and rapid pace of reforms over the past few years, one cannot help but get the occasional “flying blind” feeling. Moreover, as is often the case, the only unsupportable position is certainty. Read More »
The New Teacher Project’s (TNTP) recent report on teacher retention, called “The Irreplaceables,” garnered quite a bit of media attention. In a discussion of this report, I argued, among other things, that the label “irreplaceable” is a highly exaggerated way of describing their definitions, which, by the way, varied between the five districts included in the analysis. In general, TNTP’s definitions are better-described as “probably above average in at least one subject” (and this distinction matters for how one interprets the results).
I’d like to elaborate a bit on this issue – that is, how to categorize teachers’ growth model estimates, which one might do, for example, when incorporating them into a final evaluation score. This choice, which receives virtually no discussion in TNTP’s report, is always a judgment call to some degree, but it’s an important one for accountability policies. Many states and districts are drawing those very lines between teachers (and schools), and attaching consequences and rewards to the outcomes.
Let’s take a very quick look, using the publicly-released 2010 “teacher data reports” from New York City (there are details about the data in the first footnote*). Keep in mind that these are just value-added estimates, and are thus, at best, incomplete measures of the performance of teachers (however, importantly, the discussion below is not specific to growth models; it can apply to many different types of performance measures). Read More »
** Reprinted here in the Washington Post
Our guest author today is Douglas N. Harris, associate professor of economics and University Endowed Chair in Public Education at Tulane University in New Orleans. His latest book, Value-Added Measures in Education, provides an excellent, accessible review of the technical and practical issues surrounding these models.
Now that the election is over, the Obama Administration and policymakers nationally can return to governing. Of all the education-related decisions that have to be made, the future of teacher evaluation has to be front and center.
In particular, how should “value-added” measures be used in teacher evaluation? President Obama’s Race to the Top initiative expanded the use of these measures, which attempt to identify how much each teacher contributes to student test scores. In doing so, the initiative embraced and expanded the controversial reliance on standardized tests that started under President Bush’s No Child Left Behind.
In many respects, The Race was well designed. It addresses an important problem – the vast majority of teachers report receiving limited quality feedback on instruction. As a competitive grants program, it was voluntary for states to participate (though involuntary for many districts within those states). The Administration also smartly embraced the idea of multiple measures of teacher performance.
But they also made one decision that I think was a mistake. They encouraged—or required, depending on your vantage point—states to lump value-added or other growth model estimates together with other measures. The raging debate since then has been over what percentage of teachers’ final ratings should be given to value-added versus the other measures. I believe there is a better way to approach this issue, one that focuses on teacher evaluations not as a measure, but rather as a process. Read More »
People often ask me for my “bottom line” on using value-added (or other growth model) estimates in teacher evaluations. I’ve written on this topic many times, and while I have in fact given my overall opinion a couple of times, I have avoided expressing it in a strong “yes or no” format. There’s a reason for this, and I thought maybe I would write a short piece and explain myself.
My first reaction to the queries about where I stand on value-added is a shot of appreciation that people are interested in my views, followed quickly by an acute rush of humility and reticence. I know think tank people aren’t supposed to say things like this, but when it comes to sweeping, big picture conclusions about the design of new evaluations, I’m not sure my personal opinion is particularly important.
Frankly, given the importance of how people on the ground respond to these types of policies, as well as, of course, their knowledge of how schools operate, I would be more interested in the views of experienced, well-informed teachers and administrators than my own. And I am frequently taken aback by the unadulterated certainty I hear coming from advocates and others about this completely untested policy. That’s why I tend to focus on aspects such as design details and explaining the research – these are things I feel qualified to discuss. (I also, by the way, acknowledge that it’s very easy for me to play armchair policy general when it’s not my job or working conditions that might be on the line.)
That said, here’s my general viewpoint, in two parts. First, my sense, based on the available evidence, is that value-added should be given a try in new teacher evaluations. Read More »
In education policy debates, we like the “big picture.” We love to say things like “hold schools accountable” and “set high expectations.” Much less frequent are substantive discussions about the details of accountability systems, but it’s these details that make or break policy. The technical specs just aren’t that sexy. But even the best ideas with the sexiest catchphrases won’t improve things a bit unless they’re designed and executed well.
In this vein, I want to recommend a very interesting CALDER working paper by Mark Ehlert, Cory Koedel, Eric Parsons and Michael Podgursky. The paper takes a quick look at one of these extremely important, yet frequently under-discussed details in school (and teacher) accountability systems: The choice of growth model.
When value-added or other growth models come up in our debates, they’re usually discussed en masse, as if they’re all the same. They’re not. It’s well-known (though perhaps overstated) that different models can, in many cases, lead to different conclusions for the same school or teacher. This paper, which focuses on school-level models but might easily be extended to teacher evaluations as well, helps illustrate this point in a policy-relevant manner.
Read More »
One claim that gets tossed around a lot in education circles is that “the most effective teachers produce a year and a half of learning per year, while the least effective produce a half of a year of learning.”
This talking point is used all the time in advocacy materials and news articles. Its implications are pretty clear: Effective teachers can make all the difference, while ineffective teachers can do permanent damage.
As with most prepackaged talking points circulated in education debates, the “year and a half of learning” argument, when used without qualification, is both somewhat valid and somewhat misleading. So, seeing as it comes up so often, let’s very quickly identify its origins and what it means. Read More »
D.C. Public Schools (DCPS) recently announced a few significant changes to its teacher evaluation system (called IMPACT), including the alteration of its test-based components, the creation of a new performance category (“developing”), and a few tweaks to the observational component (discussed below). These changes will be effective starting this year.
As with any new evaluation system, a period of adjustment and revision should be expected and encouraged (though it might be preferable if the first round of changes occurs during a phase-in period, prior to stakes becoming attached). Yet, despite all the attention given to the IMPACT system over the past few years, these new changes have not been discussed much beyond a few quick news articles.
I think that’s unfortunate: DCPS is an early adopter of the “new breed” of teacher evaluation policies being rolled out across the nation, and any adjustments to IMPACT’s design – presumably based on results and feedback – could provide valuable lessons for states and districts in earlier phases of the process.
Accordingly, I thought I would take a quick look at three of these changes. Read More »
In all my many posts about the interpretation of state testing data, it seems that I may have failed to articulate one major implication, which is almost always ignored in the news coverage of the release of annual testing data. That is: raw, unadjusted changes in student test scores are not by themselves very good measures of schools’ test-based effectiveness.
In other words, schools can have a substantial impact on performance, but student test scores also increase, decrease or remain flat for reasons that have little or nothing to do with schools. The first, most basic reason is error. There is measurement error in all test scores – for various reasons, students taking the same test twice will get different scores, even if their “knowledge” remains constant. Also, as I’ve discussed many times, there is extra imprecision when using cross-sectional data. Often, any changes in scores or rates, especially when they’re small in magnitude and/or based on smaller samples (e.g., individual schools), do not represent actual progress (see here and here). Finally, even when changes are “real,” other factors that influence test score changes include a variety of non-schooling inputs, such as parental education levels, family’s economic circumstances, parental involvement, etc. These factors don’t just influence how highly students score; they are also associated with progress (that’s why value-added models exist).
Thus, to the degree that test scores are a valid measure of student performance, and changes in those scores a valid measure of student learning, schools aren’t the only suitors at the dance. We should stop judging school or district performance by comparing unadjusted scores or rates between years.
Read More »
I have been writing critically about states’ school rating systems (e.g., Ohio, Florida, Louisiana), and I thought I would find one that is, at least in my (admittedly value-laden) opinion, more defensibly designed. It didn’t quite turn out as I had hoped.
One big starting point in my assessment is how heavily the systems weight absolute performance (how highly students score) versus growth (how quickly students improve). As I’ve argued many times, the former (absolute level) is a poor measure of school performance in a high-stakes accountability system. It does not address the fact that some schools, particularly those in more affluent areas, serve students who, on average, enter the system at a higher-performing level. This amounts to holding schools accountable for outcomes they largely cannot control (see Doug Harris’ excellent book for more on this in the teacher context). Thus, to whatever degree testing results can be used to judge actual school effectiveness, growth measures, while themselves highly imperfect, are to be preferred in a high-stakes context.
There are a few states that assign more weight to growth than absolute performance (see this prior post on New York City’s system). One of them is Colorado’s system, which uses the well-known “Colorado Growth Model” (CGM).*
In my view, putting aside the inferential issues with the CGM (see the first footnote), the focus on growth in Colorado’s system is in theory a good idea. But, looking at the data and documentation reveals a somewhat unsettling fact: There is a double standard of sorts, by which two schools with the same growth score can receive different ratings, and it’s mostly their absolute performance levels determining whether this is the case.
Read More »
In a New York Times article a couple of weeks ago, reporter Michael Winerip discusses New York City’s school report card grades, with a focus on an issue that I have raised many times – the role of absolute performance measures (i.e., how highly students scores) in these systems, versus that of growth measures (i.e., whether students are making progress).
Winerip uses the example of two schools – P.S. 30 and P.S. 179 – one of which (P.S. 30) received an A on this year’s report card, while the other (P.S. 179) received an F. These two schools have somewhat similar student populations, at least so far as can be determined using standard education variables, and their students are very roughly comparable in terms of absolute performance (e.g., proficiency rates). The basic reason why one received an A and the other an F is that P.S. 179 received a very low growth score, and growth is heavily weighted in the NYC grade system (representing 60 out of 100 points for elementary and middle schools).
I have argued previously that unadjusted absolute performance measures such as proficiency rates are inappropriate for test-based assessments of schools’ effectiveness, given that they tell you almost nothing about the quality of instruction schools provide, and that growth measures are the better option, albeit one that also has its own issues (e.g., they are more unstable), and must be used responsibly. In this sense, the weighting of the NYC grading system is much more defensible than most of its counterparts across the nation, at least in my view.
But the system is also an example of how details matter – each school’s growth portion is calculated using an unconventional, somewhat questionable approach, one that is, as yet, difficult to treat with a whole lot of confidence. Read More »