Some of the best research out there is a product not of sophisticated statistical methods or complex research designs, but rather of painstaking manual data collection. A good example is a recent paper by Morgan Polikoff, Andrew McEachin, Stephani Wrabel and Matthew Duque, which was published in the latest issue of the journal Educational Researcher.
Polikoff and his colleagues performed a task that makes most of the rest of us cringe: They read and coded every one of the over 40 state applications for ESEA flexibility, or “waivers.” The end product is a simple but highly useful presentation of the measures states are using to identify “priority” (low-performing) and “focus” (schools “contributing to achievement gaps”) schools. The results are disturbing to anyone who believes that strong measurement should guide educational decisions.
There’s plenty of great data and discussion in the paper, but consider just one central finding: How states are identifying priority (i.e., lowest-performing) schools at the elementary level (the measures are of course a bit different for secondary schools). Read More »
Having taken a look at several states’ school rating systems (see our posts on the systems in IN, OH, FL and CO), I thought it might be interesting to examine a system used by a group of charter schools – starting with the system used by charters in the District of Columbia. This is the third year the DC charter school board has released the ratings.
For elementary and middle schools (upon which I will focus in this post*), the DC Performance Management Framework (PMF) is a weighted index composed of: 40 percent absolute performance; 40 percent growth; and 20 percent what they call “leading indicators” (a more detailed description of this formula can be found in the second footnote).** The index scores are then sorted into one of three tiers, with Tier 1 being the highest, and Tier 3 the lowest.
So, these particular ratings weight absolute performance – i.e., how highly students score on tests – a bit less heavily than do most states that have devised their own systems, and they grant slightly more importance to growth and alternative measures. We might therefore expect to find a somewhat weaker relationship between PMF scores and student characteristics such as free/reduced price lunch eligibility (FRL), as these charters are judged less predominantly on the students they serve. Let’s take a quick look. Read More »
I write often (probably too often) about the difference between measures of school performance and student performance, usually in the context of school rating systems. The basic idea is that schools cannot control the students they serve, and so absolute performance measures, such as proficiency rates, are telling you more about the students a school or district serves than how effective it is in improving outcomes (which is better-captured by growth-oriented indicators).
Recently, I was asked a simple question: Can a school with very high absolute performance levels ever actually be considered a “bad school?”
This is a good question. Read More »
In the Washington Post, Emma Brown reports on a behind the scenes decision about how to score last year’s new, more difficult tests in the District of Columbia Public Schools (DCPS) and the District’s charter schools.
To make a long story short, the choice faced by the Office of the State Superintendent of Education, or OSSE, which oversees testing in the District, was about how to convert test scores into proficiency rates. The first option, put simply, was to convert them such that the proficiency bar was more “aligned” with the Common Core, thus resulting in lower aggregate proficiency rates in math, compared with last year’s (in other states, such as Kentucky and New York, rates declined markedly). The second option was to score the tests while “holding constant” the difficulty of the questions, in order to facilitate comparisons of aggregate rates with those from previous years.
OSSE chose the latter option (according to some, in a manner that was insufficiently transparent). The end result was a modest increase in proficiency rates (which DC officials absurdly called “historic”). Read More »
A couple of weeks ago, Mike Petrilli of the Fordham Institute made the case that absolute proficiency rates should not be used as measures of school effectiveness, as they are heavily dependent on where students “start out” upon entry to the school. A few days later, Fordham president Checker Finn offered a defense of proficiency rates, noting that how much students know is substantively important, and associated with meaningful outcomes later in life.
They’re both correct. This is not a debate about whether proficiency rates are at all useful (by the way, I don’t read Petrilli as saying that). It’s about how they should be used and how they should not.
Let’s keep this simple. Here is a quick, highly simplified list of how I would recommend interpreting and using absolute proficiency rates, and how I would avoid using them. Read More »
Last week, the results of New York’s new Common Core-aligned assessments were national news. For months, officials throughout the state, including New York City, have been preparing the public for the release of these data.
Their basic message was that the standards, and thus the tests based upon them, are more difficult, and they represent an attempt to truly gauge whether students are prepared for college and the labor market. The inevitable consequence of raising standards, officials have been explaining, is that fewer students will be “proficient” than in previous years (which was, of course, the case) – this does not mean that students are performing worse, only that they are being held to higher expectations, and that the skills and knowledge being assessed require a new, more expansive curriculum. Therefore, interpretation of the new results versus those in previous year must be extremely cautious, and educators, parents and the public should not jump to conclusions about what they mean.
For the most part, the main points of this public information campaign are correct. It would, however, be wonderful if similar caution were evident in the roll-out of testing results in past (and, more importantly, future) years. Read More »
Recent events in Indiana and Florida have resulted in a great deal of attention to the new school rating systems that over 25 states are using to evaluate the performance of schools, often attaching high-stakes consequences and rewards to the results. We have published reviews of several states’ systems here over the past couple of years (see our posts on the systems in Florida, Indiana, Colorado, New York City and Ohio, for example).
Virtually all of these systems rely heavily, if not entirely, on standardized test results, most commonly by combining two general types of test-based measures: absolute performance (or status) measures, or how highly students score on tests (e.g., proficiency rates); and growth measures, or how quickly students make progress (e.g., value-added scores). As discussed in previous posts, absolute performance measures are best seen as gauges of student performance, since they can’t account for the fact that students enter the schooling system at vastly different levels, whereas growth-oriented indicators can be viewed as more appropriate in attempts to gauge school performance per se, as they seek (albeit imperfectly) to control for students’ starting points (and other characteristics that are known to influence achievement levels) in order to isolate the impact of schools on testing performance.*
One interesting aspect of this distinction, which we have not discussed thoroughly here, is the idea/possibility that these two measures are “in conflict.” Let me explain what I mean by that. Read More »
In a new NBER working paper, economist Derek Neal makes an important point, one of which many people in education are aware, but is infrequently reflected in actual policy. The point is that using the same assessment to measure both student and teacher performance often contaminates the results for both purposes.
In fact, as Neal notes, some of the very features required to measure student performance are the ones that make possible the contamination when the tests are used in high-stakes accountability systems. Consider, for example, a situation in which a state or district wants to compare the test scores of a cohort of fourth graders in one year with those of fourth graders the next year. One common means of facilitating this comparability is administering some of the questions to both groups (or to some “pilot” sample of students prior to those being tested). Otherwise, any difference in scores between the two cohorts might simply be due to differences in the difficulty of the questions. If you cannot check that out, it’s tough to make meaningful comparisons.
But it’s precisely this need to repeat questions that enables one form of so-called “teaching to the test,” in which administrators and educators use questions from prior assessments to guide their instruction for the current year. Read More »
As noted in a nice little post over at Greater Greater Washington’s education blog, the District of Columbia Office of the State Superintendent of Education (OSSE) recently started releasing growth model scores for DC’s charter and regular public schools. These models, in a nutshell, assess schools by following their students over time and gauging their testing progress relative to similar students (they can also be used for individual teachers, but DCPS uses a different model in its teacher evaluations).
In my opinion, producing these estimates and making them available publicly is a good idea, and definitely preferable to the district’s previous reliance on changes in proficiency, which are truly awful measures (see here for more on this). It’s also, however, important to note that the model chosen by OSSE – a “median growth percentile,” or MGP model, produces estimates that have been shown to be at least somewhat more heavily associated with student characteristics than other types of models, such as value-added models proper. This does not necessarily mean the growth percentile models are “inaccurate” – there are good reasons, such as resources and more difficulty with teacher recruitment/retention, to believe that schools serving poorer students might be less effective, on average, and it’s tough to separate “real” effects from bias in the models.
That said, let’s take a quick look at this relationship using the DC MGP scores from 2011, with poverty data from the National Center for Education Statistics. Read More »
The annual release of state testing data makes the news in every state, but Florida is one of those places where it is to some degree a national story.*
Well, it’s getting to be that time of year again. Last week, the state released its writing exam (FCAT 2.0 Writing) results for 2013 (as well as the math and reading results for third graders only). The Florida Department of Education (FLDOE) press release noted: “With significant gains in writing scores, Florida’s teachers and students continue to show that higher expectations and support at home and in the classroom enable every child to succeed.” This interpretation of the data was generally repeated without scrutiny in the press coverage of the results.
Putting aside the fact that the press release incorrectly calls the year-to-year changes “gains” (they are actually comparisons of two different groups of students; see here), the FLDOE’s presentation of the FCAT Writing results, though common, is, at best, incomplete and, at worst, misleading. Moreover, the important issues in this case are applicable in all states, and unusually easy to illustrate using the simple data released to the public.
Read More »
** Reprinted here in the Washington Post
A big part of successful policy making is unyielding attention to detail (an argument that regular readers of this blog hear often). Choices about design and implementation that may seem unimportant can play a substantial role in determining how policies play out in practice.
A new paper, co-authored by Elizabeth Davidson, Randall Reback, Jonah Rockoff and Heather Schwartz, and presented at last month’s annual conference of The Association for Education Finance and Policy, illustrates this principle vividly, and on a grand scale: With an analysis of outcomes in all 50 states during the early years of NCLB.
After a terrific summary of the law’s rules and implementation challenges, as well as some quick descriptive statistics, the paper’s main analysis is a straightforward examination of why the proportion of schools meeting AYP varied quite a bit between states. For instance, in 2003, the first year of results, 32 percent of U.S. schools failed to make AYP, but the proportion ranged from one percent in Iowa to over 80 percent in Florida.
Surprisingly, the results suggest that the primary reasons for this variation seem to have had little to do with differences in student performance. Rather, the big factors are subtle differences in rather arcane rules that each state chose during the implementation process. These decisions received little attention, yet they had a dramatic impact on the outcomes of NCLB during this time period. Read More »
In his State of the City address last month, New York City Mayor Michael Bloomberg made some brief comments about the upcoming adoption of new assessments aligned with the Common Core State Standards (CCSS), including the following statement:
But no matter where the definition of proficiency is arbitrarily set on the new tests, I expect that our students’ progress will continue outpacing the rest of the State’s[,] the only meaningful measurement of progress we have.
On the surface, this may seem like just a little bit of healthy bravado. But there are a few things about this single sentence that struck me, and it also helps to illustrate an important point about the relationship between standards and testing results. Read More »
A recent article in Reuters, one that received a great deal of attention, sheds light on practices that some charter schools are using essentially to screen students who apply for admission. These policies include requiring long and difficult applications, family interviews, parental contracts, and even demonstrations of past academic performance.
It remains unclear how common these practices might be in the grand scheme of things, but regardless of how frequently they occur, most of these tactics are terrible, perhaps even illegal, and should be stopped. At the same time, there are two side points to keep in mind when you hear about charges such as these, as well as the accusations (and denials) of charter exclusion and segregation that tend to follow.
The first is that some degree of (self-)sorting and segregation of students by abilities, interests and other characteristics is part of the deal in a choice-based system. The second point is that screening and segregation are most certainly not unique to charter/private schools, and one primary reason is that there is, in a sense, already a lot of choice among regular public schools. Read More »
Some Florida officials are still having trouble understanding why they’re finding no relationship between the grades schools receive and the evaluation ratings of teachers in those schools. For his part, new Florida education Commissioner Tony Bennett is also concerned. According to the article linked above, he acknowledges (to his credit) that the two measures are different, but is also considering “revis[ing] the models to get some fidelity between the two rankings.”
This may be turning into a potentially risky situation. As discussed in a recent post, it is important to examine the results of the new teacher evaluations, but there is no reason one would expect to find a strong relationship between these ratings and the school grades, as they are in large part measuring different things (and imprecisely at that). The school grades are mostly (but not entirely) driven by how highly students score, whereas teacher evaluations are, to the degree possible, designed to be independent of these absolute performance levels. Florida cannot validate one system using the other.
However, as also mentioned in that post, this is not to say that there should be no relationship at all. For example, both systems include growth-oriented measures (albeit using very different approaches). In addition, schools with lower average performance levels sometimes have trouble recruiting and retaining good teachers. Due to these and other factors, the reasonable expectation is to find some association overall, just not one that’s extremely strong. And that’s basically what one finds, even using the same set of results upon which the claims that there is no relationship are based.
Read More »
** Reprinted here in the Washington Post
Former Florida Governor Jeb Bush was in Virginia last week, helping push for a new law that would install an “A-F” grading system for all public schools in the commonwealth, similar to a system that has existed in Florida for well over a decade.
In making his case, Governor Bush put forth an argument about the Florida system that he and his supporters use frequently. He said that, right after the grades went into place in his state, there was a drop in the proportion of D and F schools, along with a huge concurrent increase in the proportion of A schools. For example, as Governor Bush notes, in 1999, only 12 percent of schools got A’s. In 2005, when he left office, the figure was 53 percent. The clear implication: It was the grading of schools (and the incentives attached to the grades) that caused the improvements.
There is some pretty good evidence (also here) that the accountability pressure of Florida’s grading system generated modest increases in testing performance among students in schools receiving F’s (i.e., an outcome to which consequences were attached), and perhaps higher-rated schools as well. However, putting aside the serious confusion about what Florida’s grades actually measure, as well as the incorrect premise that we can evaluate a grading policy’s effect by looking at the simple distribution of those grades over time, there’s a much deeper problem here: The grades changed in part because the criteria changed. Read More »