A couple of months ago, Bill Gates said something that received a lot of attention. With regard to his foundation’s education reform efforts, which focus most prominently on teacher evaluations, but encompass many other areas, he noted, “we don’t know if it will work.” In fact, according to Mr. Gates, “we won’t know for probably a decade.”
He’s absolutely correct. Most education policies, including (but not limited to) those geared toward shifting the distribution of teacher quality, take a long time to work (if they do work), and the research assessing these policies requires a great deal of patience. Yet so many of the most prominent figures in education policy routinely espouse the opposite viewpoint: Policies are expected to have an immediate, measurable impact (and their effects are assessed in the crudest manner imaginable).
A perfect example was the reaction to the recent release of results of the National Assessment of Educational Progress (NAEP). Read More »
Last week, the results of New York’s new Common Core-aligned assessments were national news. For months, officials throughout the state, including New York City, have been preparing the public for the release of these data.
Their basic message was that the standards, and thus the tests based upon them, are more difficult, and they represent an attempt to truly gauge whether students are prepared for college and the labor market. The inevitable consequence of raising standards, officials have been explaining, is that fewer students will be “proficient” than in previous years (which was, of course, the case) – this does not mean that students are performing worse, only that they are being held to higher expectations, and that the skills and knowledge being assessed require a new, more expansive curriculum. Therefore, interpretation of the new results versus those in previous year must be extremely cautious, and educators, parents and the public should not jump to conclusions about what they mean.
For the most part, the main points of this public information campaign are correct. It would, however, be wonderful if similar caution were evident in the roll-out of testing results in past (and, more importantly, future) years. Read More »
** Reprinted here in the Washington Post
Every year, a few major media outlets publish high school rankings. Most recently, Newsweek (in partnership with The Daily Beast) issued its annual list of the “nation’s best high schools.” Their general approach to this task seems quite defensible: To find the high schools that “best prepare students for college.”
The rankings are calculated using six measures: graduation rate (25 percent); college acceptance rate (25); AP/IB/AICE tests taken per student (25); average SAT/ACT score (10); average AP/IB/AICE score (10); and the percentage of students enrolled in at least one AP/IB/AICE course (5).
Needless to say, even the most rigorous, sophisticated measures of school performance will be imperfect at best, and the methods behind these lists have been subject to endless scrutiny. However, let’s take a quick look at three potentially problematic issues with the Newsweek rankings, how the results might be interpreted, and how the system compares with that published by U.S. News and World Report. Read More »
In 1998, the National Institutes of Health (NIH) lowered the threshold at which people are classified as “overweight.” Literally overnight, about 25 million Americans previously considered as having a healthy weight were now overweight. If, the next day, you saw a newspaper headline that said “number of overweight Americans increases,” you would probably find that a little misleading. America’s “overweight” population didn’t really increase; the definition changed.
Fast forward to November 2012, during which Kentucky became the first state to release results from new assessments that were aligned with the Common Core Standards (CCS). This led to headlines such as, “Scores Drop on Kentucky’s Common Core-Aligned Tests” and “Challenges Seen as Kentucky’s Test Scores Drop As Expected.” Yet, these descriptions unintentionally misrepresent what happened. It’s not quite accurate – or at least highly imprecise – to say that test scores “dropped,” just as it would have been wrong to say that the number of overweight Americans increased overnight in 1998 (actually, they’re not even scores, they’re proficiency rates). Rather, the state adopted different tests, with different content, a different design, and different standards by which students are deemed “proficient.”
Over the next 2-3 years, a large group of states will also release results from their new CCS-aligned tests. It is important for parents, teachers, administrators, and other stakeholders to understand what the results mean. Most of them will rely on newspapers and blogs, and so one exceedingly simple step that might help out is some polite, constructive language-policing.
Read More »
** Reprinted here in the Washington Post
In a recent Washington Post article called “Teachers leaning in favor of reforms,” veteran reporter Jay Mathews puts forth an argument that one hears rather frequently – that teachers are “changing their minds,” in a favorable direction, about the current wave of education reform. Among other things, Mr. Mathews cites two teacher surveys. One of them, which we discussed here, is a single-year survey that doesn’t actually look at trends, and therefore cannot tell us much about shifts in teachers’ attitudes over time (it was also a voluntary online survey).
His second source, on the other hand, is in fact a useful means of (cautiously) assessing such trends (though the article doesn’t actually look at them). That is the Education Sector survey of a nationally-representative sample of U.S. teachers, which they conducted in 2003, 2007 and, most recently, in 2011.
This is a valuable resource. Like other teacher surveys, it shows that educators’ attitudes toward education policy are diverse. Opinions vary by teacher characteristics, context and, of course, by the policy being queried. Moreover, views among teachers can (and do) change over time, though, when looking at cross-sectional surveys, one must always keep in mind that observed changes (or lack thereof) might be due in part to shifts in the characteristics of the teacher workforce. There’s an important distinction between changing minds and changing workers (which Jay Mathews, to his great credit, discusses in this article).*
That said, when it comes to the many of the more controversial reforms happening in the U.S., those about which teachers might be “changing their minds,” the results of this particular survey suggest, if anything, that teachers’ attitudes are actually quite stable. Read More »
It is a gross understatement to say that the No Child Left Behind (NCLB) law is, was – and will continue to be – a controversial piece of legislation. Although opinion tends toward the negative, there are certain features, such as a focus on student subgroup data, that many people support. And it’s difficult to make generalizations about whether the law’s impact on U.S. public education was “good” or “bad” by some absolute standard.
The one thing I would say about NCLB is that it has helped to institutionalize the improper interpretation of testing data.
Most of the attention to the methodological shortcomings of the law focuses on “adequate yearly progress” (AYP) – the crude requirement that all schools must make “adequate progress” toward the goal of 100 percent proficiency by 2014. And AYP is indeed an inept measure. But the problems are actually much deeper than AYP.
Rather, it’s the underlying methods and assumptions of NCLB (including AYP) that have had a persistent, negative impact on the way we interpret testing data. Read More »
Every year, around this time, the College Board publicizes its SAT results, and hundreds of newspapers, blogs, and television stations run stories suggesting that trends in the aggregate scores are, by themselves, a meaningful indicator of U.S. school quality. They’re not.
Everyone knows that the vast majority of the students who take the SAT in a given year didn’t take the test the previous year – i.e., the data are cross-sectional. Everyone also knows that participation is voluntary (as is participation in the ACT test), and that the number of students taking the test has been increasing for many years and current test-takers have different measurable characteristics from their predecessors. That means we cannot use the raw results to draw strong conclusions about changes in the performance of the typical student, and certainly not about the effectiveness of schools, whether nationally or in a given state or district. This is common sense.
Unfortunately, the College Board plays a role in stoking the apparent confusion – or, at least, they could do much more to prevent it. Consider the headline of this year’s press release: Read More »
From my experience, education reporters are smart, knowledgeable, and attentive to detail. That said, the bulk of the stories about testing data – in big cities and suburbs, in this year and in previous years – could be better.
Listen, I know it’s unreasonable to expect every reporter and editor to address every little detail when they try to write accessible copy about complicated issues, such as test data interpretation. Moreover, I fully acknowledge that some of the errors to which I object – such as calling proficiency rates “scores” – are well within tolerable limits, and that news stories need not interpret data in the same way as researchers. Nevertheless, no matter what you think about the role of test scores in our public discourse, it is in everyone’s interest that the coverage of them be reliable. And there are a few mostly easy suggestions that I think would help a great deal.
Below are five such recommendations. They are of course not meant to be an exhaustive list, but rather a quick compilation of points, all of which I’ve discussed in previous posts, and all of which might also be useful to non-journalists. Read More »
In all my many posts about the interpretation of state testing data, it seems that I may have failed to articulate one major implication, which is almost always ignored in the news coverage of the release of annual testing data. That is: raw, unadjusted changes in student test scores are not by themselves very good measures of schools’ test-based effectiveness.
In other words, schools can have a substantial impact on performance, but student test scores also increase, decrease or remain flat for reasons that have little or nothing to do with schools. The first, most basic reason is error. There is measurement error in all test scores – for various reasons, students taking the same test twice will get different scores, even if their “knowledge” remains constant. Also, as I’ve discussed many times, there is extra imprecision when using cross-sectional data. Often, any changes in scores or rates, especially when they’re small in magnitude and/or based on smaller samples (e.g., individual schools), do not represent actual progress (see here and here). Finally, even when changes are “real,” other factors that influence test score changes include a variety of non-schooling inputs, such as parental education levels, family’s economic circumstances, parental involvement, etc. These factors don’t just influence how highly students score; they are also associated with progress (that’s why value-added models exist).
Thus, to the degree that test scores are a valid measure of student performance, and changes in those scores a valid measure of student learning, schools aren’t the only suitors at the dance. We should stop judging school or district performance by comparing unadjusted scores or rates between years.
Read More »
There have now been several stories in the New York news media about New York City’s charter schools’ “gains” on this year’s state tests (see here, here, here, here and here). All of them trumpeted the 3-7 percentage point increase in proficiency among the city’s charter students, compared with the 2-3 point increase among their counterparts in regular public schools. The consensus: Charters performed fantastically well this year.
In fact, the NY Daily News asserted that the “clear lesson” from the data is that “public school administrators must gain the flexibility enjoyed by charter leaders,” and “adopt [their] single-minded focus on achievement.” For his part, Mayor Michael Bloomberg claimed that the scores are evidence that the city should expand its charter sector.
All of this reflects a fundamental misunderstanding of how to interpret testing data, one that is frankly a little frightening to find among experienced reporters and elected officials.
Read More »
New York State is set to release its annual testing data today. Throughout the state, and especially in New York City, we will hear a lot about changes in school and district proficiency rates. The rates themselves have advantages – they are easy to understand, comparable across grades and reflect a standards-based goal. But they also suffer severe weaknesses, such as their sensitivity to where the bar is set and the fact that proficiency rates and the actual scores upon which they’re based can paint very different pictures of student performance, both in a given year as well as over time. I’ve discussed this latter issue before in the NYC context (and elsewhere), but I’d like to revisit it quickly.
Proficiency rates can only tell you how many students scored above a certain line; they are completely uninformative as to how far above or below that line the scores might be. Consider a hypothetical example: A student who is rated as proficient in year one might make large gains in his or her score in year two, but this would not be reflected in the proficiency rate for his or her school – in both years, the student would just be coded as “proficient” (the same goes for large decreases that do not “cross the line”). As a result, across a group of students, the average score could go up or down while proficiency rates remained flat or moved in the opposite direction. Things are even messier when data are cross-sectional (as public data lmost always are), since you’re comparing two different groups of students (see this very recent NYC IBO report).
Let’s take a rough look at how frequently rates and scores diverge in New York City. Read More »
A recent Economist article on charter schools, though slightly more nuanced than most mainstream media treatments of the charter evidence, contains a very common, somewhat misleading argument that I’d like to address quickly. It’s about the findings of the so-called “CREDO study,” the important (albeit over-cited) 2009 national comparison of student achievement in charter and regular public schools in 16 states.
Specifically, the article asserts that the CREDO analysis, which finds a statistically discernible but very small negative impact of charters overall (with wide underlying variation), also finds a significant positive effect among low-income students. This leads the Economist to conclude that the entire CREDO study “has been misinterpreted,” because it’s real value is in showing that “the children who most need charters have been served well.”
Whether or not an intervention affects outcomes among subgroups of students is obviously important (though one has hardly “misinterpreted” a study by focusing on its overall results). And CREDO does indeed find a statistically significant, positive test-based impact of charters on low-income students, vis-à-vis their counterparts in regular public schools. However, as discussed here (and in countless textbooks and methods courses), statistical significance only means we can be confident that the difference is non-zero (it cannot be chalked up to random fluctuation). Significant differences are often not large enough to be practically meaningful.
And this is certainly the case with CREDO and low-income students. Read More »
Last year, the New York City Department of Education (NYCDOE) rolled out its annual testing results for the city’s students in a rather misleading manner. The press release touted the “significant progress” between 2010 and 2011 among city students, while, at a press conference, Mayor Michael Bloomberg called the results “dramatic.” In reality, however, the increase in proficiency rates (1-3 percentage points) was very modest, and, more importantly, the focus on the rates hid the fact that actual scale scores were either flat or decreased in most grades. In contrast, one year earlier, when the city’s proficiency rates dropped due to the state raising the cut scores, Mayor Bloomberg told reporters (correctly) that it was the actual scores that “really matter.”
Most recently, in announcing their 2011 graduation rates, the city did it again. The headline of the NYCDOE press release proclaims that “a record number of students graduated from high school in 2011.” This may be technically true, but the actual increase in the rate (rather than the number of graduates) was 0.4 percentage points, which is basically flat (as several reporters correctly noted). In addition, the city’s “college readiness rate” was similarly stagnant, falling slightly from 21.4 percent to 20.7 percent, while the graduation rate increase was higher both statewide and in New York State’s four other large districts (the city makes these comparisons when they are favorable).*
We’ve all become accustomed to this selective, exaggerated presentation of testing data, which is of course not at all limited to NYC. And it illustrates the obvious fact that test-based accountability plays out in multiple arenas, formal and informal, including the court of public opinion.
Read More »
In education discussions and articles, people (myself included) often say “achievement” when referring to test scores, or “student learning” when talking about changes in those scores. These words reflect implicit judgments to some degree (e.g., that the test scores actually measure learning or achievement). Every once in a while, it’s useful to remind ourselves that scores from even the best student assessments are imperfect measures of learning. But this is so widely understood – certainly in the education policy world, and I would say among the public as well – that the euphemisms are generally tolerated.
And then there are a few common terms or phrases that, in my personal opinion, are not so harmless. I’d like to quickly discuss three of them (all of which I’ve talked about before). All three appear many times every day in newspapers, blogs, and regular discussions. To criticize their use may seem like semantic nitpicking to some people, but I would argue that these distinctions are substantively important and may not be so widely-acknowledged, especially among people who aren’t heavily engaged in education policy (e.g., average newspaper readers).
So, here they are, in no particular order. Read More »
Journalists play an essential role in our society. They are charged with informing the public, a vital function in a representative democracy. Yet, year after year, large pockets of the electorate remain poorly-informed on both foreign and domestic affairs. For a long time, commentators have blamed any number of different culprits for this problem, including poverty, education, increasing work hours and the rapid proliferation of entertainment media.
There is no doubt that these and other factors matter a great deal. Recently, however, there is growing evidence that the factors shaping the degree to which people are informed about current events include not only social and economic conditions, but journalist quality as well. Put simply, better journalists produce better stories, which in turn attract more readers. On the whole, the U.S. journalist community is world class. But there is, as always, a tremendous amount of underlying variation. It’s likely that improving the overall quality of reporters would not only result in higher quality information, but it would also bring in more readers. Both outcomes would contribute to a better-informed, more active electorate.
We at the Shanker Institute feel that it is time to start a public conversation about this issue. We have requested and received datasets documenting the story-by-story readership of the websites of U.S. newspapers, large and small. We are using these data in statistical models that we call “Readers-Added Models,” or “RAMs.” Read More »