The Persistence Of Both Teacher Effects And Misinterpretations Of Research About Them

In a new National Bureau of Economic Research working paper on teacher value-added, researchers Raj Chetty, John Friedman and Jonah Rockoff present results from their analysis of an incredibly detailed dataset linking teachers and students in one large urban school district. The data include students’ testing results between 1991 and 2009, as well as proxies for future student outcomes, mostly from tax records, including college attendance (whether they were reported to have paid tuition or received scholarships), childbearing (whether they claimed dependents) and eventual earnings (as reported on the returns). Needless to say, the actual analysis includes only those students for whom testing data were available, and who could be successfully linked with teachers (with the latter group of course limited to those teaching math or reading in grades 4-8).

The paper caused a remarkable stir last week, and for good reason: It’s one of the most dense, important and interesting analyses on this topic in a very long time. Much of the reaction, however, was less than cautious, specifically the manner in which the research findings were interpreted to support actual policy implications (also see Bruce Baker’s excellent post).

What this paper shows – using an extremely detailed dataset and sophisticated, thoroughly-documented methods – is that teachers matter, perhaps in ways that some didn’t realize. What it does not show is how to measure and improve teacher quality, which are still open questions. This is a crucial distinction, one which has been discussed on this blog numerous times (also here and here), as it is frequently obscured or outright ignored in discussions of how research findings should inform concrete education policy.

In addition to the standard finding that teacher effects on test scores vary widely, Chetty, Friedman and Rockoff report two general sets of results. The first pertains to the well-known possibility that value-added and other growth model estimates are biased by non-random classroom assignment. That is, whether some teachers are assigned students with unobserved characteristics (e.g., behavioral issues) that are both associated with testing gains and not picked up by the models. If so, this may mean that some teachers are unfairly penalized (or rewarded) based on the mix of students they get. An influential 2007 paper (published in 2009) by economist Jesse Rothstein found that there may in fact be such bias, and that it might be substantial (also see here).

The actual magnitude of this problem – and what to do about it - has been the subject of several recent papers. For example, this analysis used data in which students were in fact randomly assigned to classes, and concluded that the bias was minimal (though samples were too small to rule it out). A different paper, one responding directly to Rothstein, concluded that using sophisticated models and multiple years of data could mitigate the problem down to what they interpreted as relatively low levels.

Chetty, Friedman and Rockoff contribute to the debate by devising a clever new way to test for bias from classroom assignment. Exploiting the movement of teachers between grades and schools, they essentially make predictions about student gains in these teachers’ new schools/grades based on their effects in their old schools/grades (i.e., from the previous year(s)). Put simply, their predictions hold up to a degree at which the bias from sorting based on unobservable characteristics does not appear to be critical.*

There are two policy implications of this finding. First, while it obviously doesn’t close the book on systematic bias from classroom assignments, it does represent further evidence that this bias, with models are sufficiently complex and are fit with multiple years of data (which is not the case in many places), can be reduced, perhaps to tolerable levels. Note, however, that these overall levels do not preclude the fact that many individual teachers’ estimates will be biased (and the important issue of random error remains). Second, the researchers’ technique – using teacher movement between schools and grades – provides a relatively easy way for states and districts to test their own estimates for bias using the data already available to them (and it will be very interesting to see how many actually bother to do so).

But it was the paper’s second general set of findings that got most of the attention – those pertaining to the longer-term benefits of teacher effectiveness. For this analysis, the authors exploit the incredible detail of their dataset to calculate associations between teacher effects (i.e., the testing gains among their students) and various future outcomes.

In short, the results suggest that teachers’ effects are persistent and pervasive – that having more effective teachers (as measured by test scores) is associated with small but discernible improvements in wellbeing later in students’ lives.

For instance, a one standard deviation increase in teacher value-added (in just one class) is associated with a slightly steeper earnings trajectory that ends up at a one percent increase in the average student’s annual earnings at age 28 (about $200), and a 0.5 percent increase in the likelihood of attending college [reported tuition or scholarships] by age 20 (the mean probability is about 38 percent). (More accurately, students with higher-VA teachers actually earn slightly less during their early-20s, presumably because they are more likely to attend college, but the association becomes positive and statistically significant at age 26.)

There were also small positive associations between having a more effective teacher (in terms of value-added) and other outcomes, including a lower likelihood of teenage pregnancy (i.e., claiming a dependent during teenage years), higher “quality” of college attended (as measured by self-reported earnings of previous graduates of that institution) and higher “quality” of the neighborhood in which one lives (as measured by the proportion of college graduates within ZIP codes).

As always, one should interpret these results cautiously – they only apply to a small subset of teachers/students in one district, and they only test these relationships for one particular type of measure (value-added, with all the limitations it entails) - but they do suggest that teacher effects may be longer-lasting – and affect a broader range of outcomes – than has been demonstrated previously.

For instance, prior research has shown that teacher effects on test scores “decay” rapidly (put simply, students don’t retain much of what they learn). This analysis also finds evidence of significant “fade out," but, using the unusually long time span in the data, Chetty, Friedman and Rockoff conclude that the decline stabilizes after three years, at which time about one-third of the original impact remains. In other words, the achievement gains seem to persist into later life.**

The policy implications of this second set of findings, however, are far from clear. The fact that teachers matter is not in dispute. The issues have always been how to measure teacher effectiveness at the individual-level and, more importantly, whether and how it can be improved overall.

On the one hand, the connection between value-added and important future outcomes does suggest that there may be more to test-based teacher productivity measures – at least in a low-stakes context - than may have been previously known. In other words, to whatever degree the findings of this paper can be generalized, these test-based measures may in fact be associated with long-term desired outcomes, such as earnings and college attendance. There is some strong, useful signal there.

On the other hand, this report’s findings do not really address important questions about the proper role for these estimates in measuring teacher “quality” at the individual level (as previously discussed here), particularly the critical details (e.g., the type of model used, addressing random error) that many states and districts using these estimates seem to be ignoring. Nor do they assess the appropriate relative role of alternative measures, such as principal observations, which provide important information about teacher effectiveness not captured by growth model estimates.

Most importantly, the results do not really speak directly to how teacher quality is best improved, except insofar as it adds to the body of compelling evidence that teachers are important and that successful methods for improving teacher quality – if and when they are identified and implemented – could yield benefits for a broad range of outcomes over the long-term.***

Most of the popular current proposals for quality improvement – such as performance pay, better recruitment, and improvements in teacher preparation and development – are still works in progress, and most have either scant or no evidence as to how (and whether) they can serve to improve educational outcomes (test-based and otherwise). The results presented by Chetty, Friedman and Rockoff do not tell us much, if anything, about the desirability (or lack thereof) of any of these interventions.

These are the important questions at this point – not whether teachers matter (they do, and this paper is a big contribution in this area), but how to measure individual teacher quality, and how to improve it overall.

It’s also critical to note that this analysis does not account for the (very real) possibility that high-stakes implementation of policies using VA estimates might alter teachers’ behavior, or the supply of candidates into the profession. If, for example, increasing the stakes compels teachers to teach to the test, then the connection between teacher effects (and test results in general) and future outcomes such as earnings may be compromised, perhaps severely.

Interestingly, the authors themselves raise some of these concerns – specifically, high-stakes implementation and whether the cost of errors from using VA outweighs the benefits – as “important issues” that “must be resolved before one can determine whether VA should be used to evaluate teachers." [emphasis mine]

This appropriately cautious conclusion stands in stark contrast with the fact that most states have already decided to do so. It also indicates that those using the results of this paper to argue forcefully for specific policies are drawing unsupported conclusions from otherwise very important empirical findings.

- Matt Di Carlo

*****

* This test, as usual, relies on a few (not insignificant) assumptions – e.g., that the “quality” of students doesn’t vary between adjacent cohorts, and that teachers don’t change schools/grades in a manner that correlated with student characteristics. On a separate, related note, the researchers also exploit their unusually detailed dataset to examine whether there is bias from classroom assignment based on observable characteristics, most notably parents' income (which is typically measured in terms of the inept free/reduced lunch variable). The authors conclude that the bias is not especially severe.

** Since this paper is so dense, here are a few other selected findings of note: Variation in teacher effects is about 50 percent greater in math compared with reading, but the long-term association of teacher-induced gains with future outcomes (e.g., earnings, college attendance) are larger for reading vis-à-vis math; the benefits of teacher effects are for some outcomes stronger for students from high-SES backgrounds, which suggests a complementary role for family inputs; and teacher effects are a bit stronger on girls compared with boys, though this difference was only marginally statistically significant.

*** Chetty, Friedman and Rockoff present a couple of “proposals” for improving teacher quality. One of them is the infamous simulation in which teachers are dismissed based exclusively on test scores. As discussed here, it is inappropriate to interpret this simulation as evidence supporting an actual policy proposal. Not only does it not apply to the vast majority of teachers (who are not in tested subjects/grades), but it is based on several somewhat questionable assumptions, including (but not limited to): there would be no harm from increasing turnover; replacement teachers would be of average quality; and the policy would have no effect on the “type of person” who enters the teaching profession and whether they stay. It’s also worth noting that the simulated “benefits” of this “policy," most notably the effect it would have on future earnings, are cut by as much as half due to random error from small samples, and such error would be highest among newer teachers. Overall, the test-based firing simulation is appropriately viewed as a stylistic illustration of the wide variation in teacher effects, not an evaluation of a real policy intervention.

Blog Topics

Good stuff, Matt - from both you and Bruce.

One question: should we be concerned about the homogeneity of the sample? In other words, doesn't the variability in both test scores and earnings affect the correlations between the two?

If the data set were more varied - say, a county-wide area that included a poor urban district, middle class suburbs, and wealthy exurbs - would the teacher effects still hold up? And would it hold up equally in each of these different areas?

I'm quite curious to know what this "urban district" is, because I wonder if the lessons learned here are applicable to a wide number of districts. Would a city like Portland yield different results than a city like Detroit?