The Year In Research On Market-Based Education Reform: 2012 Edition

** Reprinted here in the Washington Post

2012 was another busy year for market-based education reform. The rapid proliferation of charter schools continued, while states and districts went about the hard work of designing and implementing new teacher evaluations that incorporate student testing data, and, in many cases, performance pay programs to go along with them.

As in previous years (see our 2010 and 2011 reviews), much of the research on these three “core areas” – merit pay, charter schools, and the use of value-added and other growth models in teacher evaluations – appeared rather responsive to the direction of policy making, but could not always keep up with its breakneck pace.*

Some lag time is inevitable, not only because good research takes time, but also because there's a degree to which you have to try things before you can see how they work. Nevertheless, what we don't know about these policies far exceeds what we know, and, given the sheer scope and rapid pace of reforms over the past few years, one cannot help but get the occasional “flying blind" feeling. Moreover, as is often the case, the only unsupportable position is certainty.

In the area of merit pay, there was one large-scale program evaluation released in 2012: Mathematica’s final report on the four-year evaluation of the Chicago Teacher Advancement Program (TAP). TAP is a multifaceted program that provides bonuses, career ladders, training, and other interventions for teachers in order to improve their performance and retention. The findings were somewhat mixed. Consistent with prior research (see here and here), the Mathematica team did not find any discernible effects on testing outcomes over this (relatively short) period, but there did appear to be some impact on school-level teacher retention.**

However, most of the action in 2012 was a bunch of papers that might be interpreted as attempts to understand and/or address teacher incentive programs’ failure to produce results in the U.S.

One example, which received a fair amount of attention, was this evaluation, by Roland Fryer, Steven Levitt, and colleagues. In this program, teachers were paid a bonus at the beginning of the year, with some forced to return a portion of it based on their students’ progress on tests. This kind of incentive, called “loss aversion," had a large impact among teachers in the treatment group. Although this finding is genuinely interesting from a research perspective, and definitely merits further attention, its policy implications are still less than apparent. To their credit, merit pay proponents seemed to recognize this.

A second working paper assessed incentive strength in a group-based program. Based on the idea that rewarding larger groups of teachers decreases the incentive for each individual teacher (essentially, a free-rider problem), the analysis found that impacts were slightly larger (in math, reading and social studies, but not science) for teachers who were “responsible” for larger groups of students. This, the authors speculate, may be one reason why schoolwide bonus programs (e.g., New York’s) haven’t produced results.

Finally, a conference paper using data from a schoolwide bonus program in North Carolina found that schools that just missed the cut in one year tended to exhibit large gains the following year, relative to schools that came in just above the threshold. The researchers hypothesize that this suggests teachers and administrators may respond to incentives when they receive a clear signal that rewards are possible, but that a “period of learning” may be required before these programs exhibit impacts.

On the whole, these studies (also see this paper) would seem to suggest, unsurprisingly, that teachers and administrators do respond to incentives, but not necessarily those embedded in the “traditional” models (e.g., individual bonuses for scores at the end of the year). However, most of the new merit pay policies rely heavily on the “traditional” conceptualization (from which supporters inexplicably seem to be expecting short-term testing gains), and, again, the actual policy applications of these recent findings, if any, remain unclear.

Thus, predictably, merit pay ends the year in roughly the same situation as it started. Proponents contend that the primary purpose of alternative compensation systems is less to compel effort than to attract “better candidates” to the profession, and keep them around. From this perspective, it is unlikely that we will see much in the way of strong evidence – for or against –for quite some time, and short-term testing gains may not be the most appropriate outcome by which to assess these policies (see this simulation from this year, as well as our discussion of it). In other words, merit pay remains, to no small extent, a leap of faith.

Moving on to the charter school area, 2012 was another year of extremely rapid growth in this sector. It may also represent a turning point in the direction of research on these schools.***

In contrast to previous years, very few of the analyses released employed the typical charter versus district “horse race” approach. The only notable exceptions were two CREDO reports. Their analysis of New Jersey’s small charter school sector found that the charters included had a significant positive impact vis-à-vis comparable regular public schools, though it appeared largely confined to a small group of Newark schools. Similarly, the CREDO team's evaluation of Indiana charters also found statistically significant, though rather modest positive effects statewide, mostly concentrated in Indianapolis.

Although such "horse race" studies were scarce this year, a Mathematica report addressed the methodological question of whether results from experimental evaluations of charter school impacts “match up” with those from non-experimental treatments. This is important because most charter research relies on non-experimental methods (since experiments generally require lotteries). This report (following others) put forth the encouraging result that the experimental and non-experimental estimates were not particularly different, at least not enough to substantially alter the conclusions. We discussed this paper here.

There was also some progress in building the growing and arguably most important body of evidence – analyses that attempt to move beyond the “charter versus district” debate, and begin to identify the actual differences between more and less successful schools, of whatever type.

One contentious variation on this question is whether charter schools “cream” higher-performing students, and/or “push out” lower-performing students, in order to boost their results. Yet another Mathematica supplement to their 2010 report examining around 20 KIPP middle schools was released, addressing criticisms that KIPP admits students with comparatively high achievement levels, and that the students who leave are lower-performing than those who stay. This report found little evidence to support either claim (also take a look at our post on attrition and charters).

A related analysis, this one presented in a conference paper (opens in Word), found that low-performing students in a large anonymous district did not exit charters at a discernibly higher rate than their counterparts in regular public schools. On the flip side of the entry/exit equation, this working paper found that students who won charter school lotteries (but had not yet attended the charter) saw immediate “benefits” in the form of reduced truancy rates, an interesting demonstration of the importance of student motivation.

A couple of papers also looked at more concrete policies employed by charters. Most notably, a joint report from Mathematica and the Center for Reinventing Public Education focused on practices among charter management organizations (CMOs). A previous report by the same team, released in 2010, found that the charter schools run by these CMOs were, on the whole, comparable in terms of test-based performance to their regular public school counterparts, even though the sample consisted of more established organizations (which one might expect to do well).

Among the many findings presented in this useful follow-up were that the higher-performing CMOs included in the analysis were more likely to provide teacher coaching and performance pay, and that they offered, on average, more instructional time (also take a look at this initial set of findings from CREDO about the performance trajectory of new charter schools).

So, 2012 saw no major “bombshells” in the charter school literature – and that may be a good thing, since focus may be shifting to the important, albeit unsexy task of drilling down into mechanisms underlying the overall results. In a time of unprecedented charter proliferation, explaining the consistently inconsistent performance of these schools is critical, not only for guiding the authorization of new charters, but also, more importantly, for improving all schools, regardless of their governance structures.

In the third and final area of market-based reform – the use of value-added and other growth model estimates in teacher evaluations – 2012 might be remembered as the year in which a second batch of teacher-level value-added scores were published in a major newspaper. These consisted of the “teacher data reports” from New York City. As in Los Angeles in 2010, the publication provoked opposition among value-added supporters and opponents alike. We discussed and analyzed the data in this post (also see here).

But, in terms of actual original research, we’ll begin with the paper that received more attention than most any other in recent years: The analysis of the long-term impacts of teachers, by economists Raj Chetty, John Friedman, and Jonah Rockoff (the paper was actually released very late in 2011). The enormous reaction to this working paper focused mostly on the finding that increases in estimated teacher effectiveness are associated with very small improvements in a wide variety of future student outcomes, including earnings, college attendance, and teenage pregnancy.

It is fair to say that these findings, in addition to being genuinely interesting and important from a research perspective, support the long-standing contention that value-added estimates do transmit some meaningful signal about teacher performance, and might play a role in teacher evaluations (though not necessarily the role that they’re being called upon to play).

Another part of the paper, which got comparatively little attention, was arguably just as significant from a policy perspective - the results addressing the question of whether the non-random assignment of students to classrooms biases value-added estimates. That is, whether some teachers are assigned students based on characteristics that are associated with testing performance, but are not captured by the models (see Jesse Rothstein’s highly influential articles on this – here and here).

Chetty, Friedman, and Rockoff devise a clever test for this bias, and find that the problem does not appear to be critical (also check out this earlier response to Rothstein). In addition, they provide a very easy way for states to test their own estimates for this bias, using data that are widely available. It is unclear whether any states have chosen to do so, however. You can read our discussion of the Chetty et al. paper here.

The issue of non-random assignment of students to classrooms, and its potential influence on value-added scores, was also the focus of this 2012 CALDER paper, which questioned the validity of the “Rothstein test," and found that it might identify non-random sorting even when none exists (also see this conference paper, which concluded that principals do assign students in non-random ways, and that the extent of sorting varies within and between schools). On the whole, it is likely that non-random sorting does bias teacher value-added scores, but the magnitude of this bias - and how it compares with that of alternative performance measures - remains a somewhat open question.

Other analyses in the value-added area also continued to move toward the kind of concrete, policy-relevant research that might have guided the design of new evaluation systems. For instance, one big issue facing states is the choice of a model. Although the public discourse tends to portray value-added models as a kind of monolith, there are actually a bunch of different specifications, many of which are not actually value-added models per se (value-added models, which themselves come in different forms, are generally considered to be a specific type of growth model; other types, such as student growth percentile models, are also being used by states and districts). Thus, analyses of how results differ between models are quite important.

A working paper from the Center for Education Data and Research (CEDR) compared the results of different models using North Carolina data, and found relatively high correlations between most of the models tested, including the more common types being used in actual evaluations systems. There was, however, a much lower correlation between these and more complex models (in particular, those employing fixed effects), and, in all cases, differences in the compositions of classrooms influenced the results of these comparisons (also see this similar analysis from this year).

In other “nuts and bolts” papers, Mathematica researchers laid down some statistical techniques for handling the important issue of co-teaching, while this NBER working paper presented a practical method for handling test measurement error among students who take tests in three consecutive grades. On an even more basic level, a team from the University of Wisconsin simulated the potentially serious problems that might arise from simple “clerical errors” in the datasets used to calculate value-added.

It's difficult to assess the degree to which states and districts are addressing or considering issues such as error and model choice, but, in at least some cases, it's not clear that they're getting much attention at all.

Another area that is under-researched (and related to the non-random assignment issue discussed above) is value-added among high school teachers. One potentially important NBER working paper released this year found that high school teachers’ estimates may have serious problems, particularly those stemming from tracking. The author, C. Kirabo Jackson, offers two possible interpretations: either high school teachers are not as influential (in terms of test-based impacts) as elementary school teachers; or value-added is a poor tool for measuring the effectiveness of high school teachers. (Also see this extremely interesting 2012 working paper, by the same author, comparing teachers’ impact on cognitive and non-cognitive outcomes.)

Finally, a number of new analyses tackled the important, much-discussed issue of the stability of value-added estimates. First, using a dataset spanning a full ten years, a CALDER working paper found considerable volatility, but also some persistence, in teachers’ value-added scores, even over that very long time period. A second analysis (opens in Word) concluded that the precision and year-to-year stability of teachers’ value-added scores varied considerably by the types of the students they had in their classes.

Third, this CEDR paper looked at stability, not across years, but rather between subjects. This is actually a somewhat under-researched topic, despite its obvious implications for using these estimates in accountability systems. In short, the authors found that the between-subject correlations are similar to those between years – modest.

Overall, then, there was a great deal of strong research on value-added and other growth models this year, much of which can inform (or, at least, could have informed) the design of teacher evaluations. In the end, though, the real test will be whether the new systems improve teacher and student outcomes.

There is one more 2012 publication worth mentioning in this area, which consists of a series of five “background papers” on value-added, published by the Carnegie Knowledge Network. Each deals with a different aspect of these estimates, and all are written by prominent researchers, who present the state of the research in a manner that is quite accessible to non-technical readers. They are an excellent resource.*****

***

So, 2012 was a year in which the research on charter schools, merit pay, and value-added continued to provide policy-relevant findings, even if, in some cases, the decisions this evidence might have guided had already been made. The next few years will be critical, as researchers monitor and evaluate the sweeping policy changes that have taken place, particularly new evaluation systems and the financial incentives that accompany them. It remains to be seen whether states and districts will be willing or able to adjust course accordingly.

- Matt Di Carlo

*****

* Needless to say, these three areas are not the only types of policies that might fall under a "market-based" umbrella (they are, however, arguably the "core" components, at least in terms of how often they're discussed by advocates and proposed by policy makers). In addition, the papers discussed here do not represent a comprehensive list of all the research in these areas during 2012. It is a selection of high-quality, mostly quantitative analyses, all of which were actually released during this year (i.e., this review doesn't include papers released in prior years, and published in 2012). This means that many of the papers above have not yet been subject to peer review, and should be interpreted with that in mind.

** There was also some initial evidence on the impacts of Denver’s ProComp program, which, like TAP, provides different types of incentives and opportunities for teachers. A non-experimental evaluation from the Center for Education Data and Research (CEDR) found some tentative indication that the program may improve test-based outcomes, though it was not possible to rule out the possibility that this was due to other factors (also see our post, by researcher Ellie Fulbeck, who took a look at ProComp’s effects on retention).

*** Given the volume of studies, this review does not include other types of school choice policies, such as vouchers. Those interested might check out this Brookings analysis of New York City’s voucher program’s effect on graduation outcomes; as well as the summary of final reports on Milwaukee’s school choice program.

**** The scarce evidence, thus far, suggests that the handful of charter models that get fairly consistent positive results are those utilizing a somewhat “blunt force” approach – more money, more time, more staff, and more rigid disciplinary policies.

***** Though not discussed in this review, it’s worth noting that there was a great deal of 2012 research about expanding the policy applications of value-added models into additional areas. Most notably, there were several analyses of principals’ impact on testing outcomes (see here and here), as well as a few important papers (here, here, and here, for example) about the potential for using these methods to estimate the effectiveness of teacher preparation programs.

Blog Topics

Thanks for a very comprehensive, and unbiased update on the latest research. It is pretty rare to see reporting untarnished by agenda or pre-judgement.

It is important to see that matters such as teacher performance pay are seen in the wider context of attracting and retaining teachers rather than just on the effect on pupil results.

But it is yet another form of accountability, and teaching operates best in a relaxed atmosphere, free from scrutiny. But trusting teachers to do their jobs without imposing a tight reign on how they do this is seems an idealistic goal, even if Finland shows that it can be done, and to very good effect.

NOTE TO READERS:

The fourth paragraph of this post originally stated that Mathematica's report presented findings from a three-year evaluation of the the program. It was actually a four-year period. The post has been corrected. I apologize for the error.

I am dismayed by this piece. You arealways polite. But, these studies modest findings have been twisted to support policies for attacking teachers, unions, and public schools. And you bent way too far over backwards to say nice things about them and ignored the ways that are being spun in order to beat down teachers and promote soul-killing bubble-in "accountability." Chetty et al, for instance, promotes policies that are an existiential threat to inner city teachers, and yet it excluded classes where 25% of students are on IEPs. Not having taught in the inner city, the economists (and perhaps you) do not seem to understand how important that is. Neither did you call them for ignoring qualitative evidence that sort DOES occur unofficially. Back in the day, (when i was in academia) that oversight would have been rejected by the entire scholarly community. In my field (history) economists who ignore standard steps for testing whether their models were linked to reality would have been ridiculed (did you hear the one about that econometric model that showed that slaves were treated well ...?) Besides, unofficial sorting should be obvious to anyone with actual experience in schools, so Chetty et al should have had the burden of proof, even if they didn't know that. But, why didn't you talk to some teachers about the way that students are sorted?

Here's my take on the CREDO study.
http://www.schoolsmatter.info/2012/12/the-hoover-institutes-amazing.html

You might not think its funny, but I wish you had noted the extreme difference between the study's spin and the actual findings that are scattered through it.

Also, why didn't you mention this?
http://www.washingtonpost.com/blogs/answer-sheet/wp/2012/12/23/the-fund…

I'd be sincerely interested in your (and the other researchers') take on the Florida chart and the difference between teachers' projected and actual value-added. To theorists, the gap might be small. To practitioners, and for policy analysis it is huge. If every year, you see a colleague's career being ruined - simply due to imprecise models, even if the advocates of those models say that the imprecision is small to them, how long before you say "take this job and shove it." Worse, that's an average but value-added is most invalid for high-poverty schools and, probably, neighborhood secondary schools where peer effects are most negative. If that is the average inaccuracy, in the tough secondary schools, Florida is going to see an exodus of teaching talent to the schools where it is easier to raise test scores.

And that reminds me, why did you not emphasize C. Kirabo Jackson's findings? The negative implications for value-added in that paper dwarf the policy implications of the pro-value-added papers. After all, they are already firing high school teachers before asking whether value-added can be made valid for them.

And, that also raises the question of the clear political bias of ostensibly scholary research. You don't see a pattern to the way the pro-vam side spun their findings and downplayed their evidence that the scattered through their papers that argued against their predetermined preferences?

Ooops! I'm learning the danger of writing long comments in the Washington Post and then cutting and pasting them to here. When writing a long comment, with a link, the words bounce up and down, making it impossible for me to read what I wrote. Good thing I later read what I actually wrote before others read it and assumed I'm nuts.

The chart was an EXAMPLE, not an average.I meant to ask whether gaps that seem small to a researcher would be seen differently if it was their career in jeopardy.

But, the point is the same. The size of errors make a huge difference if you are talking reality, not theory. If a researcher gets to within 5%, for example, that's great. But, would you accept a 5% PER YEAR risk that your career would be damaged or destroyed by such an errors? What about 8%? or 15% PER YEAR? What are the policy implications of imposing value-added evaluations across the nation because inaccuracies can be as small as 5%, but when the model can also be about as accurate as a coin flip? THAT is the burden of proof that those researchers should have assumed.