The following is written by Morgan S. Polikoff and Matthew Di Carlo. Morgan is Assistant Professor in the Rossier School of Education at the University of Southern California.
One of the primary policy levers now being employed in states and districts nationwide is teacher evaluation reform. Well-designed evaluations, which should include measures that capture both teacher practice and student learning, have great potential to inform and improve the performance of teachers and, thus, students. Furthermore, most everyone agrees that the previous systems were largely pro forma, failed to provide useful feedback, and needed replacement.
The attitude among many policymakers and advocates is that we must implement these systems and begin using them rapidly for decisions about teachers, while design flaws can be fixed later. Such urgency is undoubtedly influenced by the history of slow, incremental progress in education policy. However, we believe this attitude to be imprudent. Read More »
** Reprinted here in the Washington Post
A big part of successful policy making is unyielding attention to detail (an argument that regular readers of this blog hear often). Choices about design and implementation that may seem unimportant can play a substantial role in determining how policies play out in practice.
A new paper, co-authored by Elizabeth Davidson, Randall Reback, Jonah Rockoff and Heather Schwartz, and presented at last month’s annual conference of The Association for Education Finance and Policy, illustrates this principle vividly, and on a grand scale: With an analysis of outcomes in all 50 states during the early years of NCLB.
After a terrific summary of the law’s rules and implementation challenges, as well as some quick descriptive statistics, the paper’s main analysis is a straightforward examination of why the proportion of schools meeting AYP varied quite a bit between states. For instance, in 2003, the first year of results, 32 percent of U.S. schools failed to make AYP, but the proportion ranged from one percent in Iowa to over 80 percent in Florida.
Surprisingly, the results suggest that the primary reasons for this variation seem to have had little to do with differences in student performance. Rather, the big factors are subtle differences in rather arcane rules that each state chose during the implementation process. These decisions received little attention, yet they had a dramatic impact on the outcomes of NCLB during this time period. Read More »
Earlier this week, New Jersey Governor Chris Christie announced that the state will assume control over Camden City School District. Camden will be the fourth NJ district to undergo takeover, though this is the first time that the state will be removing control from an elected local school board, which will now serve in an advisory role (and have three additional members appointed by the Governor). Over the next few weeks, NJ officials will choose a new superintendent, and begin to revamp evaluations, curricula and other core policies.
Accompanying the announcement, the Governor’s office released a two-page “fact sheet,” much of which is devoted to justifying this move to the public.
Before discussing it, let’s be clear about something - it may indeed be the case that Camden schools are so critically low-performing and/or dysfunctional as to warrant drastic intervention. Moreover, it’s at least possible that state takeover is the appropriate type of intervention to help these schools improve (though the research on this latter score is, to be charitable, undeveloped).
That said, the “fact sheet” presents relatively little valid evidence regarding the academic performance of Camden schools. Given the sheer magnitude of any takeover decision, it is crucial for the state to demonstrate publicly that they have left no stone unturned by presenting a case that is as comprehensive and compelling as possible. However, the discrepancy between that high bar and NJ’s evidence, at least that pertaining to academic outcomes, is more than a little disconcerting.
Read More »
In a story for Education Week, always reliable Stephen Sawchuk reports on what may be a trend in states’ first results from their new teacher evaluation systems: The ratings are skewed toward the top.
For example, the article notes that, in Michigan, Florida and Georgia, a high proportion of teachers (more than 90 percent) received one of the two top ratings (out of four or five). This has led to some grumbling among advocates and others, citing similarities between these results and those of the old systems, in which the vast majority of teachers were rated “satisfactory,” and very few were found to be “unsatisfactory.”
Differentiation is very important in teacher evaluations – it’s kind of the whole point. Thus, it’s a problem when ratings are too heavily concentrated toward one end of the distribution. However, as Aaron Pallas points out, these important conversations about evaluation results sometimes seem less focused on good measurement or even the spread of teachers across categories than on the narrower question of how many teachers end up with the lowest rating – i.e., how many teachers will be fired.
Read More »
In his State of the City address last month, New York City Mayor Michael Bloomberg made some brief comments about the upcoming adoption of new assessments aligned with the Common Core State Standards (CCSS), including the following statement:
But no matter where the definition of proficiency is arbitrarily set on the new tests, I expect that our students’ progress will continue outpacing the rest of the State’s[,] the only meaningful measurement of progress we have.
On the surface, this may seem like just a little bit of healthy bravado. But there are a few things about this single sentence that struck me, and it also helps to illustrate an important point about the relationship between standards and testing results. Read More »
Some Florida officials are still having trouble understanding why they’re finding no relationship between the grades schools receive and the evaluation ratings of teachers in those schools. For his part, new Florida education Commissioner Tony Bennett is also concerned. According to the article linked above, he acknowledges (to his credit) that the two measures are different, but is also considering “revis[ing] the models to get some fidelity between the two rankings.”
This may be turning into a potentially risky situation. As discussed in a recent post, it is important to examine the results of the new teacher evaluations, but there is no reason one would expect to find a strong relationship between these ratings and the school grades, as they are in large part measuring different things (and imprecisely at that). The school grades are mostly (but not entirely) driven by how highly students score, whereas teacher evaluations are, to the degree possible, designed to be independent of these absolute performance levels. Florida cannot validate one system using the other.
However, as also mentioned in that post, this is not to say that there should be no relationship at all. For example, both systems include growth-oriented measures (albeit using very different approaches). In addition, schools with lower average performance levels sometimes have trouble recruiting and retaining good teachers. Due to these and other factors, the reasonable expectation is to find some association overall, just not one that’s extremely strong. And that’s basically what one finds, even using the same set of results upon which the claims that there is no relationship are based.
Read More »
A few weeks ago, Students First NY (SFNY) released a report, in which they presented a very simple analysis of the distribution of “unsatisfactory” teacher evaluation ratings (“U-ratings”) across New York City schools in the 2011-12 school year.
The report finds that U-ratings are distributed unequally. In particular, they are more common in schools with higher poverty, more minorities, and lower proficiency rates. Thus, the authors conclude, the students who are most in need of help are getting the worst teachers.
There is good reason to believe that schools serving larger proportions of disadvantaged students have a tougher time attracting, developing and retaining good teachers, and there is evidence of this, even based on value-added estimates, which adjust for these characteristics (also see here). However, the assumptions upon which this Students First analysis is based are better seen as empirical questions, and, perhaps more importantly, the recommendations they offer are a rather crude, narrow manifestation of market-based reform principles. Read More »
Our guest author today is Douglas N. Harris, associate professor of economics and University Endowed Chair in Public Education at Tulane University in New Orleans. His latest book, Value-Added Measures in Education, provides an accessible review of the technical and practical issues surrounding these models.
This past November, I wrote a post for this blog about shifting course in the teacher evaluation movement and using value-added as a “screening device.” This means that the measures would be used: (1) to help identify teachers who might be struggling and for whom additional classroom observations (and perhaps other information) should be gathered; and (2) to identify classroom observers who might not be doing an effective job.
Screening takes advantage of the low cost of value-added and the fact that the estimates are more accurate in making general assessments of performance patterns across teachers, while avoiding the weaknesses of value-added—especially that the measures are often inaccurate for individual teachers, as well as confusing and not very credible among teachers when used for high-stakes decisions.
I want to thank the many people who responded to the first post. There were three main camps. Read More »
Last week, Florida State Senate President Don Gaetz (R – Niceville) expressed his skepticism about the recently-released results of the state’s new teacher evaluation system. The senator was particularly concerned about his comparison of the ratings with schools’ “A-F” grades. He noted, “If you have a C school, 90 percent of the teachers in a C school can’t be highly effective. That doesn’t make sense.”
There’s an important discussion to be had about the results of both the school and teacher evaluation systems, and the distributions of the ratings can definitely be part of that discussion (even if this issue is sometimes approached in a superficial manner). However, arguing that we can validate Florida’s teacher evaluations using its school grades, or vice-versa, suggests little understanding of either. Actually, given the design of both systems, finding a modest or even weak association between them would make pretty good sense.
In order to understand why, there are two facts to consider. Read More »
** Reprinted here in the Washington Post
Former Florida Governor Jeb Bush has become one of the more influential education advocates in the country. He travels the nation armed with a set of core policy prescriptions, sometimes called the “Florida formula,” as well as “proof” that they work. The evidence that he and his supporters present consists largely of changes in average statewide test scores – NAEP and the state exam (FCAT) – since the reforms started going into place. The basic idea is that increases in testing results are the direct result of these policies.
Governor Bush is no doubt sincere in his effort to improve U.S. education, and, as we’ll see, a few of the policies comprising the “Florida formula” have some test-based track record. However, his primary empirical argument on their behalf – the coincidence of these policies’ implementation with changes in scores and proficiency rates – though common among both “sides” of the education debate, is simply not valid. We’ve discussed why this is the case many times (see here, here and here), as have countless others, in the Florida context as well as more generally.*
There is no need to repeat those points, except to say that they embody the most basic principles of data interpretation and causal inference. It would be wonderful if the evaluation of education policies – or of school systems’ performance more generally – was as easy as looking at raw, cross-sectional testing data. But it is not.
Luckily, one need not rely on these crude methods. We can instead take a look at some of the rigorous research that has specifically evaluated the core reforms comprising the “Florida formula.” As usual, it is a far more nuanced picture than supporters (and critics) would have you believe. Read More »
** Reprinted here in the Washington Post
2012 was another busy year for market-based education reform. The rapid proliferation of charter schools continued, while states and districts went about the hard work of designing and implementing new teacher evaluations that incorporate student testing data, and, in many cases, performance pay programs to go along with them.
As in previous years (see our 2010 and 2011 reviews), much of the research on these three “core areas” – merit pay, charter schools, and the use of value-added and other growth models in teacher evaluations – appeared rather responsive to the direction of policy making, but could not always keep up with its breakneck pace.*
Some lag time is inevitable, not only because good research takes time, but also because there’s a degree to which you have to try things before you can see how they work. Nevertheless, what we don’t know about these policies far exceeds what we know, and, given the sheer scope and rapid pace of reforms over the past few years, one cannot help but get the occasional “flying blind” feeling. Moreover, as is often the case, the only unsupportable position is certainty. Read More »
The New Teacher Project’s (TNTP) recent report on teacher retention, called “The Irreplaceables,” garnered quite a bit of media attention. In a discussion of this report, I argued, among other things, that the label “irreplaceable” is a highly exaggerated way of describing their definitions, which, by the way, varied between the five districts included in the analysis. In general, TNTP’s definitions are better-described as “probably above average in at least one subject” (and this distinction matters for how one interprets the results).
I’d like to elaborate a bit on this issue – that is, how to categorize teachers’ growth model estimates, which one might do, for example, when incorporating them into a final evaluation score. This choice, which receives virtually no discussion in TNTP’s report, is always a judgment call to some degree, but it’s an important one for accountability policies. Many states and districts are drawing those very lines between teachers (and schools), and attaching consequences and rewards to the outcomes.
Let’s take a very quick look, using the publicly-released 2010 “teacher data reports” from New York City (there are details about the data in the first footnote*). Keep in mind that these are just value-added estimates, and are thus, at best, incomplete measures of the performance of teachers (however, importantly, the discussion below is not specific to growth models; it can apply to many different types of performance measures). Read More »
Whatever one thinks of the heavy reliance on standardized tests in U.S. public education, one of the things on which there is wide agreement is that cheating must be prevented, and investigated when there’s evidence it might have occurred.
For anyone familiar with test-based accountability, recent cheating scandals in Atlanta, Washington, D.C., Philadelphia and elsewhere are unlikely to have been surprising. There has always been cheating, and it can take many forms, ranging from explicit answer-changing to subtle coaching on test day. One cannot say with any certainty how widespread cheating is, but there is every reason to believe that high-stakes testing increases the likelihood that it will happen. The first step toward addressing that problem is to recognize it.
A district, state or nation that is unable or unwilling to acknowledge the possibility of cheating, do everything possible to prevent it, and face up to it when evidence suggests it has occurred, is ill-equipped to rely on test-based accountability policies. Read More »
Charter schools in New Orleans, LA (NOLA) receive a great deal of attention, in no small part because they serve a larger proportion of public school students than do charters in any other major U.S. city. Less discussed, however, is the prevalence of NOLA’s “selective schools” (elsewhere, they are sometimes called “exam schools”). These schools maintain criteria for admission and/or retention, based on academic and other qualifications (often grades and/or standardized test scores).
At least six of NOLA’s almost 90 public schools are selective – one high school, four (P)K-8 schools and one serving grades K-12. When you add up their total enrollment, around one in eight NOLA students attends one of these schools.*
Although I couldn’t find recent summary data on the prevalence of selective schools in urban districts around the U.S., this is almost certainly an extremely high proportion (for instance, selective schools in New York City and Chicago, which are mostly secondary schools, serve only a tiny fraction of students in those cities). Read More »
** Reprinted here in the Washington Post
Our guest author today is Douglas N. Harris, associate professor of economics and University Endowed Chair in Public Education at Tulane University in New Orleans. His latest book, Value-Added Measures in Education, provides an excellent, accessible review of the technical and practical issues surrounding these models.
Now that the election is over, the Obama Administration and policymakers nationally can return to governing. Of all the education-related decisions that have to be made, the future of teacher evaluation has to be front and center.
In particular, how should “value-added” measures be used in teacher evaluation? President Obama’s Race to the Top initiative expanded the use of these measures, which attempt to identify how much each teacher contributes to student test scores. In doing so, the initiative embraced and expanded the controversial reliance on standardized tests that started under President Bush’s No Child Left Behind.
In many respects, The Race was well designed. It addresses an important problem – the vast majority of teachers report receiving limited quality feedback on instruction. As a competitive grants program, it was voluntary for states to participate (though involuntary for many districts within those states). The Administration also smartly embraced the idea of multiple measures of teacher performance.
But they also made one decision that I think was a mistake. They encouraged—or required, depending on your vantage point—states to lump value-added or other growth model estimates together with other measures. The raging debate since then has been over what percentage of teachers’ final ratings should be given to value-added versus the other measures. I believe there is a better way to approach this issue, one that focuses on teacher evaluations not as a measure, but rather as a process. Read More »