Settling Scores

In 2007, when the D.C. City Council passed a law giving the mayor control of public schools, it required that a five-year independent evaluation be conducted to document the law’s effects and suggest changes. The National Research Council (a division of the National Academies) was charged with performing this task. As reported by Bill Turque in the Washington Post, the first report was released a couple of weeks ago.

The primary purpose of this first report was to give “first impressions” and offer advice on how the actual evaluation should proceed. It covered several areas – finance, special programs, organizational structure, etc. – but, given the controversy surrounding Michelle Rhee’s tenure, the section on achievement results got the most attention. The team was only able to analyze preliminary performance data; the same data that are used constantly by Rhee, her supporters, and her detractors to judge her tenure at the helm of DCPS.

It was one of those reports that tells us what we should already know, but too often fail to consider.

The evaluation’s primary conclusion (i.e., “first impression”) in the student achievement area was, of course, that “student test scores alone provide useful but limited information about the causes of improvements or variability in student performance." The team also noted:

For this discussion, it is perhaps most important to underscore that most tests are not designed to support inferences about...how well students were taught, what effects their teachers had on their learning, why students in some schools or classrooms succeed while those in similar schools and classrooms do not, whether conditions in the schools have improved as a result of a policy change, or what policy makers should do to solidify gains or reverse declines.

In other words, even if test scores rise, answers to the important questions - why they increased, what that means, and how to sustain it – often remain elusive. In addition to these overall observations about the proper uses of student test data, the authors raised a series of issues that applied to DCPS specifically:

Using proficiency rates has more significant limitations than using measures that more accurately reflect the spread of scores, such as averages.

Because DC is a highly mobile district and the student population changes every year, score fluctuations may be the result of changes in the characteristics of the students taking the test, rather than improvements or declines in students’ knowledge and skills.

The DC CAS was introduced in 2006, and there is some evidence that when a new test is introduced scores first rise significantly and then level off.

Thus, in order to draw any conclusions about the effect of PERAA [the mayoral control law] on student achievement as measured by DC CAS, further study should include longitudinal studies of cohorts of students within the District.

I couldn’t help but notice that these are the same points that I made in this post shortly after Rhee’s resignation, when I argued that her actual test score “legacy” is still an open question. It’s not because I’m particularly insightful (that’s also an open question). It’s because, for anyone even remotely familiar with data analysis and interpretation, these points should be obvious.

Most serious people in education policy circles are aware of them, and generally cautious in their interpretations. But mass media coverage of education – and, it seems, policymakers – are not always so well-attuned.

That’s why, for example, every week, Michelle Rhee presents herself to state legislatures and the public, making claims about her testing “results” that reflect every single one of these mistakes. Her supporters and detractors do the same thing (and I'm not always as careful as I should be either).

In short, to whatever degree Michelle’s Rhee’s reputation is based on testing results, it is based on the shakiest of foundations - causal inferences drawn from cross-sectional data that might just as well be chalked up to demographic change, data management policies, a change in test design, policies in place before her arrival, or simple random error. The truth is that we don’t even know, at least not yet, if DCPS students actually made unusually rapid, widely-shared progress in their test scores, to say nothing of whether these changes were in any way related to anything Michelle Rhee did or didn’t do.

Consider, for example, that roughly 70 percent of the DC-CAS “gains” in math and reading proficiency during Rhee’s three-year tenure occurred in the first year. Assuming for a moment that these increases reflect actual progress among DCPS students, we all know that her primary policy reforms – the IMPACT evaluation system and the new Washington Teachers Union contract – went into effect in 2010, her third year.

Can somebody please explain to me how Michelle Rhee (or anyone else) could possibly have anything beyond marginal effects on test performance a few months after they arrived? Did her very presence increase scores?

But these misinterpretations of data are, of course, hardly new or particular to Washington, DC. They are insidious, and they come from both “sides." Raw scores go up, and market-based reform proponents proclaim success; they go down, skeptics cry failure. In actuality, most of it means very little. Isolating the effect of specific policies, such as merit pay or test-based teacher evaluations, requires sophisticated methods, detailed longitudinal data, and, if possible, experimental research designs. And even then it needs to be interpreted with caution.

Moreover, none of this addresses the inherent limitations of test scores as a measure of student learning, and, therefore, of teacher or school effectiveness. We are putting an incredible amount of faith in measures that we know become less reliable the more they are relied upon.

It seems that we’re on this test-centric path, at least in the short-term. Things could be worse: Most tests do in fact provide some useful information and, analyzed and interpreted with care, could be used as a partial indictor of how students, teachers, and schools are doing. In addition, better alternatives are difficult –and much more expensive – to come by.

But if we’re going to do this, let’s at least do it correctly. Let’s be careful about how we interpret cross-sectional data, and remember that fluctuations don’t necessarily reflect actual changes, especially in high-mobility districts such as DC. Let’s avoid chalking up test score increases or decreases – whether cross-sectional or longitudinal – to policies that we support or oppose, when such inferences are not directly tested. And, finally, let’s make sure that people, schools, and organizations can’t build or lose their reputations based on testing results. The scores do provide information, but not that kind.

Blog Topics