Relationship Counseling

A correlation between two variables measures the strength of the linear relationship between them. Put simply, two variables are positively correlated to the extent that individuals with relatively high or low values on one measure tend to have relatively high or low values on the other, and negatively correlated to the extent that high values on one measure are associated with low values on the other.

Correlations are used frequently in the debate about teacher evaluations. For example, researchers might assess the relationship between classroom observations and value-added measures, which is one of the simpler ways to gather information about the “validity” of one or the other – i.e., whether it is telling us what we want to know. In this case, if teachers with higher observation scores also tend to get higher value-added scores, this might be interpreted as a sign that both are capturing, at least to some extent, "true" teacher performance.

Yet there seems to be a tendency among some advocates and policy makers to get a little overeager when interpreting correlations.

For instance, in education circles, you'll sometimes hear that principal observations "predict" or "match up with" value-added scores (see here, here, here and here). It is true most analyses have found a significant positive association between these two measures -- and, as discussed below, that is meaningful -- but correlations are a matter of degree (the term "significant" doesn't carry the same meaning in statistics as in everyday conversation). The oft-cited Measures of Effective Teaching (MET) Project looked at the relationship between value-added and scores on the observation protocols they tested. They found correlations ranging from about 0.10 to 0.35, and this squares with other studies that have carried out similar comparisons (see here and here, for example). These values are conventionally interpreted as weak to modest.

Similarly, in a recent NPR interview, former D.C. schools chancellor Michelle Rhee cited the MET report, claiming that it found a “very high correlation” between student surveys and teacher value-added estimates, while the New Jersey Department of Education characterized the relationship as "tightly correlated." Several news stories offered similar blanket claims. In reality, the correlations ranged between 0.18 and 0.38, which, again, are modest at best.

Finally, in a report by The New Teacher Project, the authors calculated correlations between the four measures they use to assess candidates in their certification program: principal ratings; formal classroom observations; student surveys; and value-added. The report concludes that the “measures tend to point to similar conclusions about a teacher’s potential." This statement is not inaccurate per se, but it's a bit overblown, as most of the associations between measures were rather anemic, especially the correlations between value-added and the other three components.

(And, of course, one can find the same sorts of overstatements on the flip side of this coin, particularly when it comes to the reliability of value-added estimates. For instance, you'll often hear value-added critics summarily dismiss these measures as too unstable to be useful, even though year-to-year associations are usually at least moderate.)

Now, none of this is meant to imply that the lack of extremely strong relationships between measures means they are "invalid" or useless in teacher evaluations.

For one thing, remember that most of these indicators are themselves imprecise. As a result, even if, hypothetically, two measures were actually picking up on the exact same thing (“true teacher performance”), there is a ceiling of sorts on the strength of the relationships. Neither is precise in and of itself, and so the statistical association between them is unlikely to be extremely high either.

In addition, whether or not we should necessarily expect -- or even want -- inordinately strong associations between the components of teacher evaluation systems is kind of an open question. After all, one of the major rationales for the “multiple measures” paradigm is that different indicators are more suited for picking up on different aspects of performance.

Finally, validity is about how we use or interpret measures, not a characteristic of the measures themselves. In other words, the appropriate question is: Valid for what? A given correlation may be interpreted differently depending on the answer to this question. For example, there's a huge difference between using measures for high-stakes personnel decisions versus providing them to teachers and administrators for informational purposes. Conversely, the associations between measures may themselves be different depending on how they're used (e.g., in high- versus low-stakes contexts).

In general, bivariate correlations can be very helpful in assessing the utility of different performance indicators (though they cannot, of course, be the only tool that is used, and even strong relationships may be masking problems among subgroups of individual teachers). Any measure must be assessed versus that of current or alternative measures, and this is far from an exact science (see Doug Harris' book for a good, highly accessible discussion of reliability and validity issues surrounding value-added).

At the very least, though, we should be careful about this process in which modest relationships between two indicators are presented or interpreted as “the two measures are correlated” (which occasionally ends up becoming “strongly correlated”). Such oversimplifications are not only somewhat misleading, but, perhaps more importantly, they tend to hinder the kind of careful, context-specific interpretation that is required for decisions about the components that comprise teacher evaluations, and how this information should be used.

- Matt Di Carlo

Blog Topics

After reading Nate Silver's book, I can't figure out why we aren't talking about Bayesian analysis for teacher evaluation. It seems like a perfect match.