In a previous post, I discussed the idea of “attracting the best candidates” to teaching by reviewing the research on the association between pre-service characteristics and future performance (usually defined in terms of teachers’ estimated effect on test scores once they get into the classroom). In general, this body of work indicates that, while far from futile, it’s extremely difficult to predict who will be an “effective” teacher based on their paper traits, including those that are typically used to define “top candidates,” such as the selectivity of the undergraduate institutions they attend, certification test scores and GPA (see here, here, here and here, for examples).
There is some very limited evidence that other, “non-traditional” measures might help. For example, a working paper, released last year, found a statistically discernible, fairly strong association between first-year math value-added and an index constructed from surveys administered to Teach for America candidates. There was, however, no association in reading (note that the sample was small), and no relationships in either subject found during these teachers’ second years.*
A recently-published paper – which appears in the peer-reviewed journal Education Finance and Policy, originally released as working paper in 2008 – represents another step forward in this area. The analysis, presented by the respected quartet of Jonah Rockoff, Brian Jacob, Thomas Kane, and Douglas Staiger (RJKS), attempts to look beyond the set of characteristics that researchers are typically constrained (by data availability) to examine.
In short, the results do reveal some meaningful, potentially policy-relevant associations between pre-service characteristics and future outcomes. From a more general perspective, however, they are also a testament to the difficulties inherent in predicting who will be a good teacher based on observable traits.
The paper’s authors focused on a sample of a few hundred new elementary school math teachers in New York City in 2006-07, giving them a battery of surveys and tests that attempted to measure (albeit imperfectly) different cognitive and non-cognitive skills.
They then took at looked at the associations between the subsequent performance of these teachers (as measured by value-added estimates and a few other outcomes) and both their survey results and their more “traditional” qualifications, such as certification status and exam scores. The overarching idea was to see if they could somehow combine these measures in a way that would more accurately predict teacher performance, based on information that is, or might be, gathered before hiring.
RJKS collect an impressive set of information on the teachers in their sample. In addition to traditional measures (such as undergraduate major, SAT scores, college selectivity, and ability to pass the certification exam on the first try), the data include survey/test information pertaining to personality, efficacy, knowledge of math subject matter, and beliefs and values. As would be expected, there are too many results to mention here, but I’ll summarize some of them briefly before discussing how and why I think they’re important.
Consistent with prior research, the study found that the relationship between the “traditional” measures and value-added are somewhat small and inconsistent. Because of the small sample, the convention of statistical significance (which depends on sample size) makes interpretation a bit complicated, but virtually none of the associations were statistically discernible at even the most relaxed levels of confidence. A couple worth mentioning: There was a slightly negative estimated association for NYC teaching fellows (relative to regularly certified teachers; previous research here), a slightly positive association among Teach for America corps members (see here for one of many papers on this topic), and a very imprecisely-estimated positive association among math majors (a related paper here).
(Note: RJKS also look at the associations between these indicators and other outcomes, most notably subjective evaluations by mentors [see this paper about the program]. Mostly, these were also too imprecise to be useful, but it’s worth noting that there was a strong and statistically discernible [even by strict conventions] relationship between the selectivity of undergraduate institution and mentors’ scores.)
Moving on to the more interesting results – those for the “non-traditional” measures – RJKS found that the relationships between math value-added and their measures of personality and other traits were all positive, although they varied widely in terms of size and statistical significance. Consistent with some prior research (also here), there was a moderately strong, marginally significant relationship between math content knowledge and value-added, in addition to a similar (in size/significance) estimated effect of “personal efficacy” (e.g., belief in one’s ability to affect student learning).
Once again, there were more clear-cut results between these indicators and mentors’ assessments, with positive, statistically discernible associations for “conscientiousness” (e.g., organization, self-discipline), “extraversion” (e.g., talkativeness) and, again, “personal efficacy.”
The final major part of the analysis focused on “combining” all the measures discussed above (which were each examined separately in terms of their association with value-added, observation scores, and other outcomes not discussed) in a manner that could better predict teacher effectiveness. I won’t bore you with the details of factor analysis, but RJKS essentially analyzed the associations discussed above to discern underlying patterns, which they then used to construct two indexes: one measuring “cognitive” skills; and another gauging “non-cognitive” skills.
In short, both the cognitive and non-cognitive indexes maintain a moderate, statistically significant association with math value-added, but only the non-cognitive index is related to mentor evaluations. For instance, a one standard deviation increase in both indexes (e.g., the difference between the median teacher and one at the 84th percentile) is associated with an increase in student math scores of roughly 0.025 standard deviations.
So what are we to make of these findings? As usual, it’s important to note a few caveats. The first is that the teachers included in this analysis had already been hired, which means that the results might suffer from selection bias – i.e., the findings might have been different had this analysis included all job applicants, as would have to be the case for these results apply directly to hiring policy. Second, due to a logistical problem, the surveys were administered in the middle of the school year, and the results might have been affected by the fact that the new teachers had already put in some classroom time.
That said, the findings do suggest that there is potential for improvement in the kind of information that most districts gather and use during the hiring process. For example, the predictive power of both the cognitive and non-cognitive indexes was greatly enhanced by the addition of the “non-traditional” surveys and tests administered by the researchers. This idea – that there is more information out there that can improve selection – confirms prior work; additional data could significantly improve districts’ ability to screen applicants. This is the paper’s primary finding.
On the other hand, the associations in this paper – to the degree they “hold up” for all applicants – are not exactly clear-cut. The effect sizes, specifically for the value-added outcomes, are rather modest, not to mention imprecisely-estimated (mostly due to the small sample). It’s certainly fair to say that the associations are large enough to be considered meaningful, and that they appear to be rather more powerful than the standard observable characteristics currently used (which is important). But they’re modest nonetheless, and it’s impossible to know whether the results would hold up in other subjects or in future years, both of which were not the case in the TFA paper mentioned above. In addition, none of this provides any insight into the related question of how to attract candidates.
It may very well be that it is worth the cost and effort to gather additional information on teacher applicants, compared with allocating resources to other strategies (including performance during a teacher’s first few years). That is an empirical question, one that cannot be addressed directly by this analysis. But there’s no doubt that improving screening processes has potential – it’s not an impermeable shell.
From a more general, less policy-oriented perspective, though, consider that the authors of this paper collected a mass of data on each teacher – personality surveys, credentials, undergraduate performance, etc – and subjected these data to exhaustive, sophisticated analysis. And yet, even with all this information, they were able to explain only a very small proportion of the total variation in new teachers’ test-based effectiveness. This hammers home how difficult it is to use pre-service characteristics to predict classroom performance, at least as measured by math tests.**
There’s just something about skillful teaching – as well as, perhaps, the difficulty in measuring teaching quality – that seems to elude even the best tools of personnel policy and econometrics.
- Matt Di Carlo
* The composite TFA index was comprised of several individual measures, and the overall association in first-year math seems to have been driven by indicators of achievement, leadership and perseverance. The fact that the association seems to dissipate completely in the second year carries interesting implications (which the author does not discuss at all beyond stating the finding). Nevertheless, while this paper is as yet unpublished, the relatively large size of the first-year math association suggests that TFA’s methods merit further examination.
** One of the interesting side questions hanging over a lot of these types of analyses is whether the findings among teachers – wide variation in test-based performance, imperfect ability to predict it, etc. – might also be the case among other professionals. One would suspect that this is indeed the case, though I’m sure there are variations by degree.