Some of the best research out there is a product not of sophisticated statistical methods or complex research designs, but rather of painstaking manual data collection. A good example is a recent paper by Morgan Polikoff, Andrew McEachin, Stephani Wrabel and Matthew Duque, which was published in the latest issue of the journal Educational Researcher.
Polikoff and his colleagues performed a task that makes most of the rest of us cringe: They read and coded every one of the over 40 state applications for ESEA flexibility, or “waivers.” The end product is a simple but highly useful presentation of the measures states are using to identify “priority” (low-performing) and “focus” (schools “contributing to achievement gaps”) schools. The results are disturbing to anyone who believes that strong measurement should guide educational decisions.
There’s plenty of great data and discussion in the paper, but consider just one central finding: How states are identifying priority (i.e., lowest-performing) schools at the elementary level (the measures are of course a bit different for secondary schools). Read More »
Having taken a look at several states’ school rating systems (see our posts on the systems in IN, OH, FL and CO), I thought it might be interesting to examine a system used by a group of charter schools – starting with the system used by charters in the District of Columbia. This is the third year the DC charter school board has released the ratings.
For elementary and middle schools (upon which I will focus in this post*), the DC Performance Management Framework (PMF) is a weighted index composed of: 40 percent absolute performance; 40 percent growth; and 20 percent what they call “leading indicators” (a more detailed description of this formula can be found in the second footnote).** The index scores are then sorted into one of three tiers, with Tier 1 being the highest, and Tier 3 the lowest.
So, these particular ratings weight absolute performance – i.e., how highly students score on tests – a bit less heavily than do most states that have devised their own systems, and they grant slightly more importance to growth and alternative measures. We might therefore expect to find a somewhat weaker relationship between PMF scores and student characteristics such as free/reduced price lunch eligibility (FRL), as these charters are judged less predominantly on the students they serve. Let’s take a quick look. Read More »
As discussed in a prior post, the research on applying value-added to teacher prep programs is pretty much still in its infancy. Even just a couple of years of would go a long way toward at least partially addressing the many open questions in this area (including, by the way, the evidence suggesting that differences between programs may not be meaningfully large).
Nevertheless, a few states have decided to plow ahead and begin publishing value-added estimates for their teacher preparation programs. Tennessee, which seems to enjoy being first — their Race to the Top program is, a little ridiculously, called “First to the Top” — was ahead of the pack. They have once again published ratings for the few dozen teacher preparation programs that operate within the state. As mentioned in my post, if states are going to do this (and, as I said, my personal opinion is that it would be best to wait), it is absolutely essential that the data be presented along with thorough explanations of how to interpret and use them.
Tennessee fails to meet this standard. Read More »
A new working paper, published by the National Bureau of Economic Research, is the first high quality assessment of one of the new teacher evaluation systems sweeping across the nation. The study, by Thomas Dee and James Wyckoff, both highly respected economists, focuses on the first three years of IMPACT, the evaluation system put into place in the District of Columbia Public Schools in 2009.
Under IMPACT, each teacher receives a point total based on a combination of test-based and non-test-based measures (the formula varies between teachers who are and are not in tested grades/subjects). These point totals are then sorted into one of four categories – highly effective, effective, minimally effective and ineffective. Teachers who receive a highly effective (HE) rating are eligible for salary increases, whereas teachers rated ineffective are dismissed immediately and those receiving minimally effective (ME) for two consecutive years can also be terminated. The design of this study exploits that incentive structure by, put very simply, comparing the teachers who were directly above the ME and HE thresholds to those who were directly below them, and to see whether they differed in terms of retention and performance from those who were not. The basic idea is that these teachers are all very similar in terms of their measured performance, so any differences in outcomes can be (cautiously) attributed to the system’s incentives.
The short answer is that there were meaningful differences. Read More »
The District of Columbia Public Schools (DCPS) has recently released the first round of results from its new principal evaluation system. Like the system used for teachers, the principal ratings are based on a combination of test and non-test measures. And the two systems use the same final rating categories (highly effective, effective, minimally effective and ineffective).
It was perhaps inevitable that there would be comparisons of their results. In short, principal ratings were substantially lower, on average. Roughly half of them received one of the two lowest ratings (minimally effective or ineffective), compared with around 10 percent of teachers.
Some wondered whether this discrepancy by itself means that DC teachers perform better than principals. Of course not. It is difficult to compare the performance of teachers versus that of principals, but it’s unsupportable to imply that we can get a sense of this by comparing the final rating distributions from two evaluation systems. Read More »
I write often (probably too often) about the difference between measures of school performance and student performance, usually in the context of school rating systems. The basic idea is that schools cannot control the students they serve, and so absolute performance measures, such as proficiency rates, are telling you more about the students a school or district serves than how effective it is in improving outcomes (which is better-captured by growth-oriented indicators).
Recently, I was asked a simple question: Can a school with very high absolute performance levels ever actually be considered a “bad school?”
This is a good question. Read More »
In the Washington Post, Emma Brown reports on a behind the scenes decision about how to score last year’s new, more difficult tests in the District of Columbia Public Schools (DCPS) and the District’s charter schools.
To make a long story short, the choice faced by the Office of the State Superintendent of Education, or OSSE, which oversees testing in the District, was about how to convert test scores into proficiency rates. The first option, put simply, was to convert them such that the proficiency bar was more “aligned” with the Common Core, thus resulting in lower aggregate proficiency rates in math, compared with last year’s (in other states, such as Kentucky and New York, rates declined markedly). The second option was to score the tests while “holding constant” the difficulty of the questions, in order to facilitate comparisons of aggregate rates with those from previous years.
OSSE chose the latter option (according to some, in a manner that was insufficiently transparent). The end result was a modest increase in proficiency rates (which DC officials absurdly called “historic”). Read More »
Advocates of the so-called “Florida Formula,” a package of market-based reforms enacted throughout the 1990s and 2000s, some of which are now spreading rapidly in other states, traveled to Michigan this week to make their case to the state’s lawmakers, with particular emphasis on Florida’s school grading system. In addition to arguments about accessibility and parental involvement, their empirical (i.e., test-based) evidence consisted largely of the standard, invalid claims that cross-sectional NAEP increases prove the reforms’ effectiveness, along with a bonus appearance of the argument that since Florida starting grading schools, the grades have improved, even though this is largely (and demonstrably) a result of changes in the formula.
As mentioned in a previous post, I continue to be perplexed at advocates’ insistence on using this “evidence,” even though there is a decent amount of actual rigorous policy research available, much of it positive.
So, I thought it would be fun, though slightly strange, for me to try on my market-based reformer cap, and see what it would look like if this kind of testimony about the Florida reforms was actually research-based (at least the test-based evidence). Here’s a very rough outline of what I came up with: Read More »
There is currently a push to evaluate teacher preparation programs based in part on the value-added of their graduates. Predictably, this is a highly controversial issue, and the research supporting it is, to be charitable, still underdeveloped. At present, the evidence suggests that the differences in effectiveness between teachers trained by different prep programs may not be particularly large (see here, here, and here), though there may be exceptions (see this paper).
In the meantime, there’s an interesting little conflict underlying the debate about measuring preparation programs’ effectiveness, one that’s worth pointing out. For the purposes of this discussion, let’s put aside the very important issue of whether the models are able to account fully for where teaching candidates end up working (i.e., bias in the estimates based on school assignments/preferences), as well as (valid) concerns about judging teachers and preparation programs based solely on testing outcomes. All that aside, any assessment of preparation programs using the test-based effectiveness of their graduates is picking up on two separate factors: How well they prepare their candidates; and who applies to their programs in the first place.
In other words, programs that attract and enroll highly talented candidates might look good even if they don’t do a particularly good job preparing teachers for their eventual assignments. But does that really matter? Read More »
A couple of weeks ago, Mike Petrilli of the Fordham Institute made the case that absolute proficiency rates should not be used as measures of school effectiveness, as they are heavily dependent on where students “start out” upon entry to the school. A few days later, Fordham president Checker Finn offered a defense of proficiency rates, noting that how much students know is substantively important, and associated with meaningful outcomes later in life.
They’re both correct. This is not a debate about whether proficiency rates are at all useful (by the way, I don’t read Petrilli as saying that). It’s about how they should be used and how they should not.
Let’s keep this simple. Here is a quick, highly simplified list of how I would recommend interpreting and using absolute proficiency rates, and how I would avoid using them. Read More »
The change in New York State tests, as well as their results, has inevitably resulted in a lot of discussion of how achievement gaps have changed over the past decade or so (and what they look like using the new tests). In many cases, the gaps, and trends in the gaps, are being presented in terms of proficiency rates.
I’d like to make one quick point, which is applicable both in New York and beyond: In general, it is not a good idea to present average student performance trends in terms of proficiency rates, rather than average scores, but it is an even worse idea to use proficiency rates to measure changes in achievement gaps.
Put simply, proficiency rates have a legitimate role to play in summarizing testing data, but the rates are very sensitive to the selection of cut score, and they provide a very limited, often distorted portrayal of student performance, particularly when viewed over time. There are many ways to illustrate this distortion, but among the more vivid is the fact, which we’ve shown in previous posts, that average scores and proficiency rates often move in different directions. In other words, at the school-level, it is frequently the case that the performance of the typical student — i.e., the average score — increases while the proficiency rate decreases, or vice-versa.
Unfortunately, the situation is even worse when looking achievement gaps. To illustrate this in a simple manner, let’s take a very quick look at NAEP data (4th grade math), broken down by state, between 2009 and 2011. Read More »
Last week, the results of New York’s new Common Core-aligned assessments were national news. For months, officials throughout the state, including New York City, have been preparing the public for the release of these data.
Their basic message was that the standards, and thus the tests based upon them, are more difficult, and they represent an attempt to truly gauge whether students are prepared for college and the labor market. The inevitable consequence of raising standards, officials have been explaining, is that fewer students will be “proficient” than in previous years (which was, of course, the case) – this does not mean that students are performing worse, only that they are being held to higher expectations, and that the skills and knowledge being assessed require a new, more expansive curriculum. Therefore, interpretation of the new results versus those in previous year must be extremely cautious, and educators, parents and the public should not jump to conclusions about what they mean.
For the most part, the main points of this public information campaign are correct. It would, however, be wonderful if similar caution were evident in the roll-out of testing results in past (and, more importantly, future) years. Read More »
As reported over at Education Week, the so-called “sequester” has claimed yet another victim: The National Assessment of Educational Progress, or NAEP. As most people who follow education know, this highly respected test, which is often called the “nation’s report card,” is a very useful means of assessing student performance, both in any given year and over time.
Two of the “main assessments” – i.e., those administered in math and reading every two years to fourth and eighth graders – get most of the attention in our public debate, and these remain largely untouched by the cuts. But, last May, the National Assessment Governing Board, which oversees NAEP, decided to eliminate the 2014 NAEP exams in civics, history and geography for all but 8th graders (the exams were previously administered in grades 4, 8 and 12). Now, in its most recent announcement, the Board has decided to cancel its plans to expand the sample for 12th graders (in math, reading, and science) to make it large enough to allow state-level results. In addition, the 4th and 8th grade science samples will be cut back, making subgroup breakdowns very difficult, and the science exam will no longer be administered to individual districts. Finally, the “long-term trend NAEP,” which has tracked student performance for 40 years, has been suspended for 2016. These are substantial cutbacks.
Although its results are frequently misinterpreted, NAEP is actually among the few standardized tests in the U.S. that receives rather wide support from all “sides” of the testing debate. And one cannot help but notice the fact that federal and state governments are currently making significant investments in new tests that are used for high-stakes purposes, whereas NAEP, the primary low-stakes assessment, is being scaled back. Read More »
Recent events in Indiana and Florida have resulted in a great deal of attention to the new school rating systems that over 25 states are using to evaluate the performance of schools, often attaching high-stakes consequences and rewards to the results. We have published reviews of several states’ systems here over the past couple of years (see our posts on the systems in Florida, Indiana, Colorado, New York City and Ohio, for example).
Virtually all of these systems rely heavily, if not entirely, on standardized test results, most commonly by combining two general types of test-based measures: absolute performance (or status) measures, or how highly students score on tests (e.g., proficiency rates); and growth measures, or how quickly students make progress (e.g., value-added scores). As discussed in previous posts, absolute performance measures are best seen as gauges of student performance, since they can’t account for the fact that students enter the schooling system at vastly different levels, whereas growth-oriented indicators can be viewed as more appropriate in attempts to gauge school performance per se, as they seek (albeit imperfectly) to control for students’ starting points (and other characteristics that are known to influence achievement levels) in order to isolate the impact of schools on testing performance.*
One interesting aspect of this distinction, which we have not discussed thoroughly here, is the idea/possibility that these two measures are “in conflict.” Let me explain what I mean by that. Read More »
In a new NBER working paper, economist Derek Neal makes an important point, one of which many people in education are aware, but is infrequently reflected in actual policy. The point is that using the same assessment to measure both student and teacher performance often contaminates the results for both purposes.
In fact, as Neal notes, some of the very features required to measure student performance are the ones that make possible the contamination when the tests are used in high-stakes accountability systems. Consider, for example, a situation in which a state or district wants to compare the test scores of a cohort of fourth graders in one year with those of fourth graders the next year. One common means of facilitating this comparability is administering some of the questions to both groups (or to some “pilot” sample of students prior to those being tested). Otherwise, any difference in scores between the two cohorts might simply be due to differences in the difficulty of the questions. If you cannot check that out, it’s tough to make meaningful comparisons.
But it’s precisely this need to repeat questions that enables one form of so-called “teaching to the test,” in which administrators and educators use questions from prior assessments to guide their instruction for the current year. Read More »