The U.S. Department of Education has released a very short, readable report on the comparability of value-added estimates using two different tests in Indiana – one of them norm-referenced (the Measures of Academic Progress test, or MAP), and the other criterion-referenced (the Indiana Statewide Testing for Educational Progress Plus, or ISTEP+, which is also the state’s official test for NCLB purposes).
The research design here is straightforward – fourth and fifth grade students in 46 schools across 10 districts in Indiana took both tests, their teachers’ value-added scores were calculated, and the scores were compared. Since both sets of scores were based on the same students and teachers, this is allows a direct comparison of how teachers’ value-added estimates compare between these two tests. The results are not surprising, and they square with similar prior studies (see here, here, here, for example): The estimates based on the two tests are moderately correlated. Depending on the grade/subject, they are between 0.4 and 0.7. If you’re not used to interpreting correlation coefficients, consider that only around one-third of teachers were in the same quintile (fifth) on both tests, and another 40 or so percent were one quintile higher or lower. So, most teachers were within a quartile, about a quarter of teachers moved two or more quintiles, and a small percentage moved from top to bottom or vice-versa.
Although, as mentioned above, these findings are in line with prior research, it is worth remembering why this “instability” occurs (and what can be done about it). Read More »
In the three most discussed and controversial areas of market-based education reform – performance pay, charter schools and the use of value-added estimates in teacher evaluations – 2013 saw the release of a couple of truly landmark reports, in addition to the normal flow of strong work coming from the education research community (see our reviews from 2010, 2011 and 2012).*
In one sense, this building body of evidence is critical and even comforting, given not only the rapid expansion of charter schools, but also and especially the ongoing design and implementation of new teacher evaluations (which, in many cases, include performance-based pay incentives). In another sense, however, there is good cause for anxiety. Although one must try policies before knowing how they work, the sheer speed of policy change in the U.S. right now means that policymakers are making important decisions on the fly, and there is great deal of uncertainty as to how this will all turn out.
Moreover, while 2013 was without question an important year for research in these three areas, it also illustrated an obvious point: Proper interpretation and application of findings is perhaps just as important as the work itself. Read More »
As discussed in a prior post, the research on applying value-added to teacher prep programs is pretty much still in its infancy. Even just a couple of years of would go a long way toward at least partially addressing the many open questions in this area (including, by the way, the evidence suggesting that differences between programs may not be meaningfully large).
Nevertheless, a few states have decided to plow ahead and begin publishing value-added estimates for their teacher preparation programs. Tennessee, which seems to enjoy being first — their Race to the Top program is, a little ridiculously, called “First to the Top” — was ahead of the pack. They have once again published ratings for the few dozen teacher preparation programs that operate within the state. As mentioned in my post, if states are going to do this (and, as I said, my personal opinion is that it would be best to wait), it is absolutely essential that the data be presented along with thorough explanations of how to interpret and use them.
Tennessee fails to meet this standard. Read More »
Linda Darling-Hammond’s new book, Getting Teacher Evaluation Right, is a detailed, practical guide about how to improve the teaching profession. It leverages the best research and best practices, offering actionable, illustrated steps to getting teacher evaluation right, with rich examples from the U.S. and abroad.
Here I offer a summary of the book’s main arguments and conclude with a couple of broad questions prompted by the book. But, before I delve into the details, here’s my quick take on Darling-Hammond’s overall stance.
We are at a crossroads in education; two paths lay before us. The first seems shorter, easier and more straightforward. The second seems long, winding and difficult. The big problem is that the first path does not really lead to where we need to go; in fact, it is taking us in the opposite direction. So, despite appearances, more steady progress will be made if we take the more difficult route. This book is a guide on how to get teacher evaluation right, not how to do it quickly or with minimal effort. So, in a way, the big message or take away is: There are no shortcuts. Read More »
Our guest author today is Dan Goldhaber, Director of the Center for Education Data & Research and a Research Professor in Interdisciplinary Arts and Sciences at the University of Washington Bothell.
Let me begin with a disclosure: I am an advocate of experimenting with using value added, where possible, as part of a more comprehensive system of teacher evaluation. The reasons are pretty simple (though articulated in more detail in a brief, which you can read here). The most important reason is that value-added information about teachers appears to be a better predictor of future success in the classroom than other measures we currently use. This is perhaps not surprising when it comes to test scores, certainly an important measure of what students are getting out of schools, but research also shows that value added predicts very long run outcomes, such as college going and labor market earnings. Shouldn’t we be using valuable information about likely future performance when making high-stakes personnel decisions?
It almost goes without saying, but it’s still worth emphasizing, that it is impossible to avoid making high-stakes decisions. Policies that explicitly link evaluations to outcomes such as compensation and tenure are new, but even in the absence of such policies that are high-stakes for teachers, the stakes are high for students, because some of them are stuck with ineffective teachers when evaluation systems suggest, as is the case today, that nearly all teachers are effective. Read More »
There is currently a push to evaluate teacher preparation programs based in part on the value-added of their graduates. Predictably, this is a highly controversial issue, and the research supporting it is, to be charitable, still underdeveloped. At present, the evidence suggests that the differences in effectiveness between teachers trained by different prep programs may not be particularly large (see here, here, and here), though there may be exceptions (see this paper).
In the meantime, there’s an interesting little conflict underlying the debate about measuring preparation programs’ effectiveness, one that’s worth pointing out. For the purposes of this discussion, let’s put aside the very important issue of whether the models are able to account fully for where teaching candidates end up working (i.e., bias in the estimates based on school assignments/preferences), as well as (valid) concerns about judging teachers and preparation programs based solely on testing outcomes. All that aside, any assessment of preparation programs using the test-based effectiveness of their graduates is picking up on two separate factors: How well they prepare their candidates; and who applies to their programs in the first place.
In other words, programs that attract and enroll highly talented candidates might look good even if they don’t do a particularly good job preparing teachers for their eventual assignments. But does that really matter? Read More »
Recent events in Indiana and Florida have resulted in a great deal of attention to the new school rating systems that over 25 states are using to evaluate the performance of schools, often attaching high-stakes consequences and rewards to the results. We have published reviews of several states’ systems here over the past couple of years (see our posts on the systems in Florida, Indiana, Colorado, New York City and Ohio, for example).
Virtually all of these systems rely heavily, if not entirely, on standardized test results, most commonly by combining two general types of test-based measures: absolute performance (or status) measures, or how highly students score on tests (e.g., proficiency rates); and growth measures, or how quickly students make progress (e.g., value-added scores). As discussed in previous posts, absolute performance measures are best seen as gauges of student performance, since they can’t account for the fact that students enter the schooling system at vastly different levels, whereas growth-oriented indicators can be viewed as more appropriate in attempts to gauge school performance per se, as they seek (albeit imperfectly) to control for students’ starting points (and other characteristics that are known to influence achievement levels) in order to isolate the impact of schools on testing performance.*
One interesting aspect of this distinction, which we have not discussed thoroughly here, is the idea/possibility that these two measures are “in conflict.” Let me explain what I mean by that. Read More »
As noted in a nice little post over at Greater Greater Washington’s education blog, the District of Columbia Office of the State Superintendent of Education (OSSE) recently started releasing growth model scores for DC’s charter and regular public schools. These models, in a nutshell, assess schools by following their students over time and gauging their testing progress relative to similar students (they can also be used for individual teachers, but DCPS uses a different model in its teacher evaluations).
In my opinion, producing these estimates and making them available publicly is a good idea, and definitely preferable to the district’s previous reliance on changes in proficiency, which are truly awful measures (see here for more on this). It’s also, however, important to note that the model chosen by OSSE – a “median growth percentile,” or MGP model, produces estimates that have been shown to be at least somewhat more heavily associated with student characteristics than other types of models, such as value-added models proper. This does not necessarily mean the growth percentile models are “inaccurate” – there are good reasons, such as resources and more difficulty with teacher recruitment/retention, to believe that schools serving poorer students might be less effective, on average, and it’s tough to separate “real” effects from bias in the models.
That said, let’s take a quick look at this relationship using the DC MGP scores from 2011, with poverty data from the National Center for Education Statistics. Read More »
** Reprinted here
in the Washington Post
The following is written by Morgan S. Polikoff and Matthew Di Carlo. Morgan is Assistant Professor in the Rossier School of Education at the University of Southern California.
One of the primary policy levers now being employed in states and districts nationwide is teacher evaluation reform. Well-designed evaluations, which should include measures that capture both teacher practice and student learning, have great potential to inform and improve the performance of teachers and, thus, students. Furthermore, most everyone agrees that the previous systems were largely pro forma, failed to provide useful feedback, and needed replacement.
The attitude among many policymakers and advocates is that we must implement these systems and begin using them rapidly for decisions about teachers, while design flaws can be fixed later. Such urgency is undoubtedly influenced by the history of slow, incremental progress in education policy. However, we believe this attitude to be imprudent. Read More »
One can often hear opponents of value-added referring to these methods as “junk science.” The term is meant to express the argument that value-added is unreliable and/or invalid, and that its scientific “façade” is without merit.
Now, I personally am not opposed to using these estimates in evaluations and other personnel policies, but I certainly understand opponents’ skepticism. For one thing, there are some states and districts in which design and implementation has been somewhat careless, and, in these situations, I very much share the skepticism. Moreover, the common argument that evaluations, in order to be “meaningful,” must consist of value-added measures in a heavily-weighted role (e.g., 45-50 percent) is, in my view, unsupportable.
All that said, calling value-added “junk science” completely obscures the important issues. The real questions here are less about the merits of the models per se than how they’re being used. Read More »
Controversial proposals for new teacher evaluation systems have generated a tremendous amount of misinformation. It has come from both “sides,” ranging from minor misunderstandings to gross inaccuracies. Ostensibly to address some of these misconceptions, the advocacy group Students First (SF) recently released a “myth/fact sheet” on evaluations.
Despite the need for oversimplification inherent in “myth/fact” sheets, the genre can be useful, especially about topics such as evaluation, about which there is much confusion. When advocacy groups produce them, however, the myths and facts sometimes take the form of “arguments we don’t like versus arguments we do like.”
This SF document falls into that trap. In fact, several of its claims are a little shocking. I would still like to discuss the sheet, not because I enjoy picking apart the work of others (I don’t), but rather because I think elements of both the “myths” and “facts” in this sheet could be recast as “dual myths” in a new sheet. That is, this document helps to illustrate how, in many of our most heated education debates, the polar opposite viewpoints that receive the most attention are often both incorrect, or at least severely overstated, and usually serve to preclude more productive, nuanced discussions.
Let’s take all four of SF’s “myth/fact” combinations in turn. Read More »
Our guest author today is Douglas N. Harris, associate professor of economics and University Endowed Chair in Public Education at Tulane University in New Orleans. His latest book, Value-Added Measures in Education, provides an accessible review of the technical and practical issues surrounding these models.
This past November, I wrote a post for this blog about shifting course in the teacher evaluation movement and using value-added as a “screening device.” This means that the measures would be used: (1) to help identify teachers who might be struggling and for whom additional classroom observations (and perhaps other information) should be gathered; and (2) to identify classroom observers who might not be doing an effective job.
Screening takes advantage of the low cost of value-added and the fact that the estimates are more accurate in making general assessments of performance patterns across teachers, while avoiding the weaknesses of value-added—especially that the measures are often inaccurate for individual teachers, as well as confusing and not very credible among teachers when used for high-stakes decisions.
I want to thank the many people who responded to the first post. There were three main camps. Read More »
One of the most frequent criticisms of value-added and other growth models is that they are “unstable” (or, more accurately, modestly stable). For instance, a teacher who is rated highly in one year might very well score toward the middle of the distribution – or even lower – in the next year (see here, here and here, or this accessible review).
Some of this year-to-year variation is “real.” A teacher might get better over the course of a year, or might have a personal problem that impedes their job performance. In addition, there could be changes in educational circumstances that are not captured by the models – e.g., a change in school leadership, new instructional policies, etc. However, a great deal of the the recorded variation is actually due to sampling error, or idiosyncrasies in student testing performance. In other words, there is a lot of “purely statistical” imprecision in any given year, and so the scores don’t always “match up” so well between years. As a result, value-added critics, including many teachers, argue that it’s not only unfair to use such error-prone measures for any decisions, but that it’s also bad policy, since we might reward or punish teachers based on estimates that could be completely different the next year.
The concerns underlying these arguments are well-founded (and, often, casually dismissed by supporters and policymakers). At the same time, however, there are a few points about the stability of value-added (or lack thereof) that are frequently ignored or downplayed in our public discourse. All of them are pretty basic and have been noted many times elsewhere, but it might be useful to discuss them very briefly. Three in particular stand out. Read More »
** Reprinted here in the Washington Post
2012 was another busy year for market-based education reform. The rapid proliferation of charter schools continued, while states and districts went about the hard work of designing and implementing new teacher evaluations that incorporate student testing data, and, in many cases, performance pay programs to go along with them.
As in previous years (see our 2010 and 2011 reviews), much of the research on these three “core areas” – merit pay, charter schools, and the use of value-added and other growth models in teacher evaluations – appeared rather responsive to the direction of policy making, but could not always keep up with its breakneck pace.*
Some lag time is inevitable, not only because good research takes time, but also because there’s a degree to which you have to try things before you can see how they work. Nevertheless, what we don’t know about these policies far exceeds what we know, and, given the sheer scope and rapid pace of reforms over the past few years, one cannot help but get the occasional “flying blind” feeling. Moreover, as is often the case, the only unsupportable position is certainty. Read More »
The New Teacher Project’s (TNTP) recent report on teacher retention, called “The Irreplaceables,” garnered quite a bit of media attention. In a discussion of this report, I argued, among other things, that the label “irreplaceable” is a highly exaggerated way of describing their definitions, which, by the way, varied between the five districts included in the analysis. In general, TNTP’s definitions are better-described as “probably above average in at least one subject” (and this distinction matters for how one interprets the results).
I’d like to elaborate a bit on this issue – that is, how to categorize teachers’ growth model estimates, which one might do, for example, when incorporating them into a final evaluation score. This choice, which receives virtually no discussion in TNTP’s report, is always a judgment call to some degree, but it’s an important one for accountability policies. Many states and districts are drawing those very lines between teachers (and schools), and attaching consequences and rewards to the outcomes.
Let’s take a very quick look, using the publicly-released 2010 “teacher data reports” from New York City (there are details about the data in the first footnote*). Keep in mind that these are just value-added estimates, and are thus, at best, incomplete measures of the performance of teachers (however, importantly, the discussion below is not specific to growth models; it can apply to many different types of performance measures). Read More »