In the three most discussed and controversial areas of market-based education reform – performance pay, charter schools and the use of value-added estimates in teacher evaluations – 2013 saw the release of a couple of truly landmark reports, in addition to the normal flow of strong work coming from the education research community (see our reviews from 2010, 2011 and 2012).*
In one sense, this building body of evidence is critical and even comforting, given not only the rapid expansion of charter schools, but also and especially the ongoing design and implementation of new teacher evaluations (which, in many cases, include performance-based pay incentives). In another sense, however, there is good cause for anxiety. Although one must try policies before knowing how they work, the sheer speed of policy change in the U.S. right now means that policymakers are making important decisions on the fly, and there is great deal of uncertainty as to how this will all turn out.
Moreover, while 2013 was without question an important year for research in these three areas, it also illustrated an obvious point: Proper interpretation and application of findings is perhaps just as important as the work itself.
After something of a lull since the release of seminal analyses a few years ago, the research on teacher performance pay may be picking up again, as several states are in the earlier stages of implementing various types of incentives along with their new evaluation systems. This year, the first shot was fired – a study of D.C.’s teacher evaluation system (IMPACT), by which teachers receiving “highly effective” ratings are eligible for large salary increases.
Although the study’s performance pay-relevant component focused solely on the (relatively small) group of teachers near the “highly effective” threshold, the researchers did find an association between receiving an IMPACT score near that threshold and improvement in those scores the following year, at least in the second year of the system’s implementation (the researchers found no such relationship with retention). This suggests that teachers may have responded to the incentives embedded in the system – and that they were actually able to affect their scores (see our post on this here) – a conclusion not reached by most prior studies of teacher pay incentives in the U.S. (there were also results presented for teachers near the “ineffective” threshold).
(Similarly, a conference paper found that teacher incentives in Denver may in fact influence teachers’ mobility decisions [e.g., switching schools], but not always in predictable ways.)
Moving on, although it did not examine the “traditional” notion of performance pay, there was also an experimental Mathematica/IES evaluation of a program awarding large compensation increases to high-value-added teachers who transferred to low-performing schools. The results indicate that teachers who accepted the bonuses did improve student test results in elementary but not middle schools. This approach to equalizing the distribution of teacher quality (at least insofar as value-added measures that quality) requires further research, but this was an important step.
Thus, as was the case last year, the potential of performance pay to improve teacher performance remains a somewhat open question. Proponents of these incentives argue that their true purpose is less to incentivize short-term improvement (e.g., increased effort, etc.) than to attract “better candidates” to the profession, and this is an outcome that will likely take many years to occur, and even more time to examine empirically. In the meantime, the reactions to the IMPACT study in particular (from opponents and supporters alike) highlights the need to maintain a balanced and appropriate view of these types of findings and their generalizability to other contexts, as many more of these analyses will be released over the next few years, and will hopefully play a useful role in guiding policy.
Moving on the area of charter schools, the 2013 landscape was almost completely dominated by CREDO, the Stanford-based charter school research organization responsible for the influential 2009 study of charters in 16 states.
CREDO’s first big 2013 report was an interesting analysis of charter school networks (e.g., charter management organizations, or CMOs) and how they expand. The results suggested, among other things, that CMOs vary quite substantially in terms of the test-based performance of the schools they operate, and that performance during the first few years of a school’s operation is a decent predictor of future performance (good or bad). These findings may suggest the importance of choosing carefully which schools to open (and, perhaps, close).
Later in the year, however, CREDO dropped a bombshell by updating and expanding its 2009 national report, and releasing an analysis of charters in 27 states, which together serve over 90 percent of the nation’s charter students (here is our post on the report).
As was the case in the 2009 study, CREDO found no meaningful difference in test-based performance between charter and regular public schools across the entire 27-state sample (the estimated effect was not even statistically significant in math). There was, however, considerable variation between states, with large and positive charter effects in several states (and virtually none with negative estimated impacts), as well as variation by student subgroup – charters were also more effective, on average, with certain student groups (e.g., low-income students), though most of the discrepancies were not large.
Moreover, CREDO found that the charters included in the 2009 study had improved their relative effectiveness since that time, but the difference was very modest, and seemed due mostly to a slight decrease in the test-based effectiveness of the regular public schools to which charters are compared.
This important report was in many respects a large-scale reaffirmation of what the (quite extensive) literature on charter schools has long suggested – that these schools vary, within and between states, in their test-based impact vis-à-vis comparable regular public schools.
Researchers also continued to pay special attention to the handful of models that have been proven effective. This year, for instance, Mathematica updated its national evaluation of KIPP middle schools, expanding the sample from about 20 to almost 40 schools, and once again finding positive, statistically significant and educationally meaningful impacts overall (see our post on this report). They also checked out whether variation in KIPP schools’ effects were associated with any concrete characteristics (around one in five KIPP schools did not get significant results), and found better results among schools that offered a longer school day/year, as well as those reporting more rigorous schoolwide behavior regimes.
On a similar note, a new working paper analyzed the medium-term impacts of students who were enrolled in the Harlem Children’s Zone in New York City. The researchers found significant positive effects of attendance on outcomes such as college attendance, incarceration and teenage pregnancy.
Stepping back for a moment, it has been about 20 years since charter schools came on the U.S. public education landscape. The sector has grown consistently, and the pace of expansion seems to be picking up. Yet, despite a rather well-developed body of evidence on these schools, which expanded considerably in 2013, the controversy has not abated.
In reality, what these schools have taught us is that performance – at least test-based performance – is less a function of what schools are (i.e., charter versus regular public) than what they do (i.e., policies and practices). It is now more clear than ever that the time has come to move past these “horse race” studies to figuring out which policies and practices are associated with better and worse outcomes. The available evidence to date, though still very tentative, suggests that the small group of charter models that get consistently positive results tend to be those that offer massive amounts of additional time (sometimes as much as 40-45 percent more), employ intensive disciplinary policies, spend more money, and, most vaguely, maintain a “focus on student achievement.” And these may be lessons upon which all sides of the contentious charter school debate can agree.
As usual, however, of the three research areas in this annual review, teacher value-added was the most active. The biggest news this year was the release of the final Measures of Effective Teaching (MET) report, the multi-million dollar project designed to provide guidance as to how states should design new teacher evaluation systems (systems that, unfortunately, many states and districts have already finalized).
Here are just some of the important reports from MET that were released this year: a wonderfully thorough technical report on how different measures match up with each other, and how different weighting scenarios produce different results; an analysis of the reliability of classroom observations conducted by school personnel (i.e., principals); and the results of different measures when students are randomly assigned to teachers.
Like the IMPACT and CREDO reports, however, reactions to this seminal study tended to fall along pre-existing policy preference lines. Most notably, the MET press release offered very questionable interpretations, including the bold statement that the project “demonstrated that it is possible to identify great teaching.”
In reality, while a full discussion of the MET results is beyond the scope of this post, suffice it to say that the findings are not appropriate for drawing any grand conclusions about whether it’s possible to “measure good teaching,” but the project most certainly provided an unprecedented wealth of policy-relevant findings on teacher evaluation, not to mention a cache of data that will keep researchers busy for years.
The American Institutes for Research (AIR) released a series of three papers that also addressed very concrete issues faced by states and districts designing evaluations systems, including the choice and incorporation of multiple measures, the use of schoolwide value-added scores in these systems, and the important (and underdiscussed) issue of how to link teachers and students in situations where students enter the school mid-year, have multiple teachers, etc. (also check out this related 2013 conference paper looking at whether student roster verification might improve the stability of value-added estimates).
In addition, as in previous years, there were several new papers addressing whether and how these models’ results differ by context. For example, a CALDER working paper found little difference in performance trajectories over time between teachers in high- and low-poverty schools (on a related note, a different CALDER analysis focused on teachers’ early career improvements, and the degree to which this predicted future value-added).
There was also a conference paper (opens in Word) comparing teachers’ test-based effectiveness with English language learners versus that with fully English proficient students – on the whole, teachers who received strong value-added scores with one group tended to receive good scores with the other, though there did appear to be some teachers who were differentially effective with one or the other group.
Researchers from Mathematica issued a useful analysis of how value-added models might address bias from tracking, and they conclude that two classroom level indicators (the mean and standard deviation of achievement) may help mitigate the problem among middle school math teachers, while tracking indicators may have the same effect for high school reading teachers. A different Mathematica report found that a common technique (called “shrinkage”), by which value-added estimates are adjusted based on sample size, improves the precision of results for teachers of “difficult to predict” students (e.g., those who receive subsidized lunch or have low prior achievement).
A third and final Mathematica report compared the results of value-added models proper with those from the Colorado Growth Model (which many states are using in their evaluation systems). In line with previous research, the researchers concluded that the estimates from the different models were highly correlated, but that the choice of model does mediate the relationship between teachers’ scores and the students they teach. For instance, the Colorado Growth Model gave lower scores to teachers with more English language learners, and higher scores to those with students with lower prior achievement (see 2013 conference paper on issues surrounding the use of these growth percentile models).
(It is also worth taking a look at this interesting working paper by Kirabo Jackson, who uses a proxy measure for student non-cognitive skills, and finds that teachers’ effects on this measure not only predict future outcomes such as dropout, SAT-taking and college plans, but also that strictly test-based effects may fail to identify many of these effective teachers.)
Lastly, the Carnegie Knowledge Network continued to release their series of short, accessible and very well done knowledge briefs by reputable scholars, with seven new briefs issued in 2013. They are all very well-done.
When it comes to research on teacher value-added, then, given the landmark MET project, along with the steady flow of strong papers chipping away at the technical and contextual aspects of these models and their use in teacher evaluations, 2013 was by all accounts a very productive year. At this point, however, the big questions of how these estimates should be used in evaluations and other policies (if at all) will be largely determined by how teachers on the ground, in multiple locations, respond to their use over the next 5-10 years.
Overall, then, the research on market-based education reform continues to build (though it struggles, to some extent inevitably but also due to the pace of change, to keep up with the policy). Going forward, particularly in the areas of performance pay and value-added in teacher evaluations, it will be very important for advocates, policymakers and the public to maintain a level head when interpreting the results of all this research, and when using it to guide design and implementation.
- Matt Di Carlo
* Standard disclaimer: These three areas are not the only types of policies that might fall under a “market-based” umbrella (they are, however, arguably the “core” components, at least in terms of how often they’re discussed by advocates and proposed by policy makers). In addition, the papers discussed here do not represent a comprehensive list of all the research in these areas during 2013. It is a selection of high-quality, mostly quantitative analyses, all of which were actually released during this year (i.e., this review doesn’t include papers released in prior years, and published in 2013). This means that many of the papers above have not yet been subject to peer review, and should be interpreted with that in mind.