Every year, around this time, the College Board publicizes its SAT results, and hundreds of newspapers, blogs, and television stations run stories suggesting that trends in the aggregate scores are, by themselves, a meaningful indicator of U.S. school quality. They’re not.
Everyone knows that the vast majority of the students who take the SAT in a given year didn’t take the test the previous year – i.e., the data are cross-sectional. Everyone also knows that participation is voluntary (as is participation in the ACT test), and that the number of students taking the test has been increasing for many years and current test-takers have different measurable characteristics from their predecessors. That means we cannot use the raw results to draw strong conclusions about changes in the performance of the typical student, and certainly not about the effectiveness of schools, whether nationally or in a given state or district. This is common sense.
Unfortunately, the College Board plays a role in stoking the apparent confusion – or, at least, they could do much more to prevent it. Consider the headline of this year’s press release: Read More »
In education policy debates, we like the “big picture.” We love to say things like “hold schools accountable” and “set high expectations.” Much less frequent are substantive discussions about the details of accountability systems, but it’s these details that make or break policy. The technical specs just aren’t that sexy. But even the best ideas with the sexiest catchphrases won’t improve things a bit unless they’re designed and executed well.
In this vein, I want to recommend a very interesting CALDER working paper by Mark Ehlert, Cory Koedel, Eric Parsons and Michael Podgursky. The paper takes a quick look at one of these extremely important, yet frequently under-discussed details in school (and teacher) accountability systems: The choice of growth model.
When value-added or other growth models come up in our debates, they’re usually discussed en masse, as if they’re all the same. They’re not. It’s well-known (though perhaps overstated) that different models can, in many cases, lead to different conclusions for the same school or teacher. This paper, which focuses on school-level models but might easily be extended to teacher evaluations as well, helps illustrate this point in a policy-relevant manner.
Read More »
The accountability provisions in Virginia’s original application for “ESEA flexibility” (or “waiver”) have received a great deal of criticism (see here, here, here and here). Most of this criticism focused on the Commonwealth’s expectation levels, as described in “annual measurable objectives” (AMOs) – i.e., the statewide proficiency rates that its students are expected to achieve at the completion of each of the next five years, with separate targets established for subgroups such as those defined by race (black, Hispanic, Asian, white), income (subsidized lunch eligibility), limited English proficiency (LEP), and special education.
Last week, in response to the criticism, Virginia agreed to amend its application, and it’s not yet clear how specifically they will calculate the new rates (only that lower-performing subgroups will be expected to make faster progress).
In the meantime, I think it’s useful to review a few of the main criticisms that have been made over the past week or two and what they mean. The actual table containing the AMOs is pasted below (for math only; reading AMOs will be released after this year, since there’s a new test).
Read More »
From my experience, education reporters are smart, knowledgeable, and attentive to detail. That said, the bulk of the stories about testing data – in big cities and suburbs, in this year and in previous years – could be better.
Listen, I know it’s unreasonable to expect every reporter and editor to address every little detail when they try to write accessible copy about complicated issues, such as test data interpretation. Moreover, I fully acknowledge that some of the errors to which I object – such as calling proficiency rates “scores” – are well within tolerable limits, and that news stories need not interpret data in the same way as researchers. Nevertheless, no matter what you think about the role of test scores in our public discourse, it is in everyone’s interest that the coverage of them be reliable. And there are a few mostly easy suggestions that I think would help a great deal.
Below are five such recommendations. They are of course not meant to be an exhaustive list, but rather a quick compilation of points, all of which I’ve discussed in previous posts, and all of which might also be useful to non-journalists. Read More »
Earlier this summer, the New York City Independent Budget Office (IBO) presented findings from a longitudinal analysis of NYC student performance. That is, they followed a cohort of over 45,000 students from third grade in 2005-06 through 2009-10 (though most results are 2005-06 to 2008-09, since the state changed its definition of proficiency in 2009-10).
The IBO then simply calculated the proportion of these students who improved, declined or stayed the same in terms of the state’s cutpoint-based categories (e.g., Level 1 ["below basic" in NCLB parlance], Level 2 [basic], Level 3 [proficient], Level 4 [advanced]), with additional breakdowns by subgroup and other variables.
The short version of the results is that almost two-thirds of these students remained constant in their performance level over this time period – for instance, students who scored at Level 2 (basic) in third grade in 2006 tended to stay at that level through 2009; students at the “proficient” level remained there, and so on. About 30 percent increased a category over that time (e.g., going from Level 1 to Level 2).
The response from the NYC Department of Education (NYCDOE) was somewhat remarkable. It takes a minute to explain why, so bear with me. Read More »
I have reviewed, albeit superficially, the test-based components of several states’ school rating systems (e.g., OH, FL, NYC, LA, CO), with a particular focus on the degree to which they are actually measuring student performance (how highly students score), rather than school effectiveness per se (whether students are making progress). Both types of measures have a role to play in accountability systems, even if they are often confused or conflated, resulting in widespread misinterpretation of what the final ratings actually mean, and many state systems’ failure to tailor interventions to the indicators being used.
One aspect of these systems that I rarely discuss is the possibility that the ratings systems are an end in themselves. That is, the idea that public ratings, no matter how they are constructed, provide an incentive for schools to get better. From this perspective, even if the ratings are misinterpreted or imprecise, they might still “work.”*
There’s obviously something to this. After all, the central purpose of any accountability system is less about closing or intervening in a few schools than about giving all schools incentive to up their respective games. And, no matter how you feel about school rating systems, there can be little doubt that people pay attention to them. Educators and school administrators do so, not only because they fear closure or desire monetary rewards; they also take pride in what they do, and they like being recognized for it. In short, my somewhat technocratic viewpoint on school ratings ignores the fact that their purpose is less about rigorous measurement than encouraging improvement. Read More »
The situation with vouchers in Louisiana is obviously quite complicated, and there are strong opinions on both sides of the issue, but I’d like to comment quickly on the new “accountability” provision. It’s a great example of how, too often, people focus on the concept of accountability and ignore how it is actually implemented in policy.
Quick and dirty background: Louisiana will be allowing students to receive vouchers (tuition to attend private schools) if their public schools are sufficiently low-performing, according to their “school performance score” (SPS). As discussed here, the SPS is based primarily on how highly students score, rather than whether they’re making progress, and thus tells you relatively little about the actual effectiveness of schools per se. For instance, the vouchers will be awarded mostly to schools serving larger proportions of disadvantaged students, even if many of those schools are compelling large gains (though such progress cannot be assessed adequately using year-to-year changes in the SPS, which, due in part to its reliance on cross-sectional proficiency rates, are extremely volatile).
Now, here’s where things get really messy: In an attempt to demonstrate that they are holding the voucher-accepting private schools accountable, Louisiana officials have decided that they will make these private schools ineligible for the program if their performance is too low (after at least two years of participation in the program). That might be a good idea if the state measured school performance in a defensible manner. It doesn’t. Read More »
In all my many posts about the interpretation of state testing data, it seems that I may have failed to articulate one major implication, which is almost always ignored in the news coverage of the release of annual testing data. That is: raw, unadjusted changes in student test scores are not by themselves very good measures of schools’ test-based effectiveness.
In other words, schools can have a substantial impact on performance, but student test scores also increase, decrease or remain flat for reasons that have little or nothing to do with schools. The first, most basic reason is error. There is measurement error in all test scores – for various reasons, students taking the same test twice will get different scores, even if their “knowledge” remains constant. Also, as I’ve discussed many times, there is extra imprecision when using cross-sectional data. Often, any changes in scores or rates, especially when they’re small in magnitude and/or based on smaller samples (e.g., individual schools), do not represent actual progress (see here and here). Finally, even when changes are “real,” other factors that influence test score changes include a variety of non-schooling inputs, such as parental education levels, family’s economic circumstances, parental involvement, etc. These factors don’t just influence how highly students score; they are also associated with progress (that’s why value-added models exist).
Thus, to the degree that test scores are a valid measure of student performance, and changes in those scores a valid measure of student learning, schools aren’t the only suitors at the dance. We should stop judging school or district performance by comparing unadjusted scores or rates between years.
Read More »
There have now been several stories in the New York news media about New York City’s charter schools’ “gains” on this year’s state tests (see here, here, here, here and here). All of them trumpeted the 3-7 percentage point increase in proficiency among the city’s charter students, compared with the 2-3 point increase among their counterparts in regular public schools. The consensus: Charters performed fantastically well this year.
In fact, the NY Daily News asserted that the “clear lesson” from the data is that “public school administrators must gain the flexibility enjoyed by charter leaders,” and “adopt [their] single-minded focus on achievement.” For his part, Mayor Michael Bloomberg claimed that the scores are evidence that the city should expand its charter sector.
All of this reflects a fundamental misunderstanding of how to interpret testing data, one that is frankly a little frightening to find among experienced reporters and elected officials.
Read More »
A while back, I argued that Florida’s school grading system, due mostly to its choice of measures, does a poor job of gauging school performance per se. The short version is that the ratings are, to a degree unsurpassed by most other states’ systems, driven by absolute performance measures (how highly students score), rather than growth (whether students make progress). Since more advantaged students tend to score more highly on tests when they enter the school system, schools are largely being judged not on the quality of instruction they provide, but rather on the characteristics of the students they serve.
New results were released a couple of weeks ago. This was highly anticipated, as the state had made controversial changes to the system, most notably the inclusion of non-native English speakers and special education students, which officials claimed they did to increase standards and expectations. In a limited sense, that’s true – grades were, on average, lower this year. The problem is that the system uses the same measures as before (including a growth component that is largely redundant with proficiency). All that has changed is the students that are included in them. Thus, to whatever degree the system now reflects higher expectations, it is still for outcomes that schools mostly cannot control.
I fully acknowledge the political and methodological difficulties in designing these systems, and I do think Florida’s grades, though exceedingly crude, might be useful for some purposes. But they should not, in my view, be used for high-stakes decisions such as closure, and the public should understand that they don’t tell you much about the actual effectiveness of schools. Let’s take a very quick look at the new round of ratings, this time using schools instead of districts (I looked at the latter in my previous post about last year’s results).
Read More »
New York State is set to release its annual testing data today. Throughout the state, and especially in New York City, we will hear a lot about changes in school and district proficiency rates. The rates themselves have advantages – they are easy to understand, comparable across grades and reflect a standards-based goal. But they also suffer severe weaknesses, such as their sensitivity to where the bar is set and the fact that proficiency rates and the actual scores upon which they’re based can paint very different pictures of student performance, both in a given year as well as over time. I’ve discussed this latter issue before in the NYC context (and elsewhere), but I’d like to revisit it quickly.
Proficiency rates can only tell you how many students scored above a certain line; they are completely uninformative as to how far above or below that line the scores might be. Consider a hypothetical example: A student who is rated as proficient in year one might make large gains in his or her score in year two, but this would not be reflected in the proficiency rate for his or her school – in both years, the student would just be coded as “proficient” (the same goes for large decreases that do not “cross the line”). As a result, across a group of students, the average score could go up or down while proficiency rates remained flat or moved in the opposite direction. Things are even messier when data are cross-sectional (as public data lmost always are), since you’re comparing two different groups of students (see this very recent NYC IBO report).
Let’s take a rough look at how frequently rates and scores diverge in New York City. Read More »
Last year, the New York City Department of Education (NYCDOE) rolled out its annual testing results for the city’s students in a rather misleading manner. The press release touted the “significant progress” between 2010 and 2011 among city students, while, at a press conference, Mayor Michael Bloomberg called the results “dramatic.” In reality, however, the increase in proficiency rates (1-3 percentage points) was very modest, and, more importantly, the focus on the rates hid the fact that actual scale scores were either flat or decreased in most grades. In contrast, one year earlier, when the city’s proficiency rates dropped due to the state raising the cut scores, Mayor Bloomberg told reporters (correctly) that it was the actual scores that “really matter.”
Most recently, in announcing their 2011 graduation rates, the city did it again. The headline of the NYCDOE press release proclaims that “a record number of students graduated from high school in 2011.” This may be technically true, but the actual increase in the rate (rather than the number of graduates) was 0.4 percentage points, which is basically flat (as several reporters correctly noted). In addition, the city’s “college readiness rate” was similarly stagnant, falling slightly from 21.4 percent to 20.7 percent, while the graduation rate increase was higher both statewide and in New York State’s four other large districts (the city makes these comparisons when they are favorable).*
We’ve all become accustomed to this selective, exaggerated presentation of testing data, which is of course not at all limited to NYC. And it illustrates the obvious fact that test-based accountability plays out in multiple arenas, formal and informal, including the court of public opinion.
Read More »
I have been writing critically about states’ school rating systems (e.g., Ohio, Florida, Louisiana), and I thought I would find one that is, at least in my (admittedly value-laden) opinion, more defensibly designed. It didn’t quite turn out as I had hoped.
One big starting point in my assessment is how heavily the systems weight absolute performance (how highly students score) versus growth (how quickly students improve). As I’ve argued many times, the former (absolute level) is a poor measure of school performance in a high-stakes accountability system. It does not address the fact that some schools, particularly those in more affluent areas, serve students who, on average, enter the system at a higher-performing level. This amounts to holding schools accountable for outcomes they largely cannot control (see Doug Harris’ excellent book for more on this in the teacher context). Thus, to whatever degree testing results can be used to judge actual school effectiveness, growth measures, while themselves highly imperfect, are to be preferred in a high-stakes context.
There are a few states that assign more weight to growth than absolute performance (see this prior post on New York City’s system). One of them is Colorado’s system, which uses the well-known “Colorado Growth Model” (CGM).*
In my view, putting aside the inferential issues with the CGM (see the first footnote), the focus on growth in Colorado’s system is in theory a good idea. But, looking at the data and documentation reveals a somewhat unsettling fact: There is a double standard of sorts, by which two schools with the same growth score can receive different ratings, and it’s mostly their absolute performance levels determining whether this is the case.
Read More »
Louisiana’s “School Performance Score” (SPS) is the state’s primary accountability measure, and it determines whether schools are subject to high-stakes decisions, most notably state takeover. For elementary and middle schools, 90 percent of the SPS is based on testing outcomes. For secondary schools, it is 70 percent (and 30 percent graduation rates).*
The SPS is largely calculated using absolute performance measures – specifically, the proportion of students falling into the state’s cutpoint-based categories (e.g., advanced, mastery, basic, etc.). This means that it is mostly measuring student performance, rather than school performance. That is, insofar as the SPS only tells you how high students score on the test, rather than how much they have improved, schools serving more advantaged populations will tend to do better (since their students tend to perform well when they entered the school) while those in impoverished neighborhoods will tend to do worse (even those whose students have made the largest testing gains).
One rough way to assess this bias is to check the association between SPS and student characteristics, such as poverty. So let’s take a quick look. Read More »
There is currently an ongoing rhetorical war of sorts over the gender wage gap. One “side” makes the common argument that women earn around 75 cents on the male dollar (see here, for example).
Others assert that the gender gap is a myth, or that it is so small as to be unimportant.
Often, these types of exchanges are enough to exasperate the casual observer, and inspire claims such as “statistics can be made to say anything.” In truth, however, the controversy over the gender gap is a good example of how descriptive statistics, by themselves, say nothing. What matters is how they’re interpreted.
Moreover, the manner in which one must interpret various statistics on the gender gap applies almost perfectly to the achievement gaps that are so often mentioned in education debates. Read More »