If Your Evidence Is Changes In Proficiency Rates, You Probably Don't Have Much Evidence

Education policymaking and debates are under constant threat from an improbable assailant: Short-term changes in cross-sectional proficiency rates.

The use of rate changes is still proliferating rapidly at all levels of our education system. These measures, which play an important role in the provisions of No Child Left Behind, are already prominent components of many states’ core accountability systems (e..g, California), while several others will be using some version of them in their new, high-stakes school/district “grading systems." New York State is awarding millions in competitive grants, with almost half the criteria based on rate changes. District consultants issue reports recommending widespread school closures and reconstitutions based on these measures. And, most recently, U.S. Secretary of Education Arne Duncan used proficiency rate increases as “preliminary evidence” supporting the School Improvement Grants program.

Meanwhile, on the public discourse front, district officials and other national leaders use rate changes to “prove” that their preferred reforms are working (or are needed), while their critics argue the opposite. Similarly, entire charter school sectors are judged, up or down, by whether their raw, unadjusted rates increase or decrease.

So, what’s the problem? In short, it’s that year-to-year changes in proficiency rates are not valid evidence of school or policy effects. These measures cannot do the job we’re having them do, even on a limited basis. This really has to stop.

The literature is replete with warnings and detailed expositions of these measures' limitations. Let's just quickly recap the major points, with links to some relevant evidence and previous posts.

Proficiency rates may be a useful way to present information accessibly to parents and the public, but they can be highly-misleading measures of student performance, as they only tell you how many test-takers are above a given (often somewhat arbitrary) cutpoint. The problems are especially salient when the rates are viewed over time – rates can increase while average scores decrease (and vice-versa), and rate changes are heavily dependent on the choice of cutpoint and distribution of cohorts' scores around it. They are really not appropriate for evaluating schools or policies, even using the best analytical approaches (for just two among dozens of examples of additional research on this topic, see this published 2008 paper and this one from 2003);
The data are (almost always) cross-sectional, and they mask changes in the sample of students taking the test, especially at the school- and district-level, where samples are smaller (note that this issue can apply to both rates and actual scores; for more, see this Mathematica report and this 2002 published article);
Most of the change in raw proficiency rates between years is transitory – i.e., it is not due to the quality of a school or the efficacy of a policy, but rather to random error, sampling variation (see the second bullet) or factors, such as students’ circumstances and characteristics, that are outside of schools’ control (see this paper analyzing Colorado data, this one on North Carolina and our quick analysis of California data).

Look, there is obviously a tremendous amount of pressure to assess programs quickly and measure schools’ performance on a regular basis, and the rate changes are readily available and easy to calculate and understand. They do have a place - they can give you a tentative sense of student performance in any given year and, with serious caution, over multiple-year periods. In addition, of course, gauging school and policy effects is not an exact science no matter how it’s done. Results must always be interpreted carefully, and they’re never perfect.

But typical, raw rate changes reflect the rather severe limitations of both cross-sectional data and cutpoint-based measures, as well as the more general fact that test performance varies for reasons other than the quality of schooling. In other words, they don't even necessarily tell us whether students actually made testing progress, to say nothing of the degree to which it was schools or specific policies responsible for those changes (or lack thereof).

The only proper way to assess the effect of schools/policies on test scores is multivariate analysis of longitudinal testing data - actual scores, not rates – which control, to the degree possible, for confounding factors that can influence results (in the case of policy evaluation, random assignment is of course preferable, but often not feasible). This takes time, but policies and schools should not be judged based on short-term outcomes anyway, whether test-based or otherwise. It also requires investment, but that's the price of good information.

If these kinds of systems and capabilities are not in place, they should be. In the meantime, unless interpreted with extreme caution, simple rate changes are not an acceptable alternative.

- Matt Di Carlo

Blog Topics

I think you are taking the wrong approach here.

You should go at the assumptions, to examine when/why they might be valid.

One assumption is that these proficiency rates are meaningful, in and of themselves. The problem is that they are set rather arbitraryily -- even if not capriously. Changes in proficiency rates could radically alter with different cut scores for proficiency. On the other hand, the more confidence we have in the setting of the proficiency cut scores (i.e. the validity of the cut scores), the more confidence we can have in changes in those proficiency rate. Unfortunately, there is little reason to have confidence in them today, and therefore little reason to have confidence in changes in proficiency rates.

Problem. Implication. Solution.

Another assumption is that changes in rates can applied on the state, district, school and perhaps even grades and teacher levels. This gets to a sample size issue. Comparisons on the state level might actually be meaningful. In large districts, they might be meaningful. But when looking a particular subgroups -- especially when they are small subgroups -- it gets much more problematic. Confidence intervals and margins of error are well known statistical techniques for addressing this issue. Unfortunately, most changes we see reported do NOT fall outside the margins of error/confidence intervals. While it may be easier to ignore that fact, we can't let convenience overwhelm what we actually know.

When it comes to multi-variate analysis of longitudinal test data, there are other problems. I think it would be good if you wrong a companion piece that was as critical of that approach as you are generally are of this one.

The recent article by the Atlanta Journal Constitution on the disparities of test scores and the possible link to cheating is similar to this. Without a robust statistical model and method, any change is wide open to interpretation, and we should have coherent policies in place.