A couple of weeks ago, the New York State Education Department (NYSED) released data from the first year of the state’s new teacher and principal evaluation system (called the “Annual Professional Performance Review,” or APPR). In what has become a familiar pattern, this prompted a wave of criticism from advocates, much of it focused on the proportion of teachers in the state to receive the lowest ratings.
To be clear, evaluation systems that produce non-credible results should be examined and improved, and that includes those that put implausible proportions of teachers in the highest and lowest categories. Much of the commentary surrounding this and other issues has been thoughtful and measured. As usual, though, there have been some oversimplified reactions, as exemplified by this piece on the APPR results from Students First NY (SFNY).
SFNY notes what it considers to be the low proportion of teachers rated “ineffective,” and points out that there was more differentiation across rating categories for the state growth measure (worth 20 percent of teachers’ final scores), compared with the local “student learning” measure (20 percent) and the classroom observation components (60 percent). Based on this, they conclude that New York’s “state test is the only reliable measure of teacher performance” (they are actually talking about validity, not reliability, but we’ll let that go). Again, this argument is not representative of the commentary surrounding the APPR results, but let’s use it as a springboard for making a few points, most of which are not particularly original. (UPDATE: After publication of this post, SFNY changed the headline of their piece from “the only reliable measure of teacher performance” to “the most reliable measure of teacher performance.”) Read More »
Several months ago, the American Statistical Association (ASA) released a statement on the use of value-added models in education policy. I’m a little late getting to this (and might be repeating points that others made at the time), but I wanted to comment on the statement, not only because I think it’s useful to have ASA add their perspective to the debate on this issue, but also because their statement seems to have become one of the staple citations for those who oppose the use of these models in teacher evaluations and other policies.
Some of these folks claimed that the ASA supported their viewpoint – i.e., that value-added models should play no role in accountability policy. I don’t agree with this interpretation. To be sure, the ASA authors described the limitations of these estimates, and urged caution, but I think that the statement rather explicitly reaches a more nuanced conclusion: That value-added estimates might play a useful role in education policy, as one among several measures used in formal accountability systems, but this must be done carefully and appropriately.*
Much of the statement puts forth the standard, albeit important, points about value-added (e.g., moderate stability between years/models, potential for bias, etc.). But there are, from my reading, three important takeaways that bear on the public debate about the use of these measures, which are not always so widely acknowledged. Read More »
Our guest authors today are Alan J. Daly, Professor and Chair of Education Studies at the University of California San Diego, and Kara S. Finnigan, Associate Professor at the Warner School of Education at the University of Rochester. Daly and Finnigan have published numerous articles on social network analysis in education and recently co-edited Using Research Evidence in Education: From the Schoolhouse Door to Capitol Hill (Springer, 2014), which explores the use and diffusion of different types of evidence across levels of the educational system.
Teacher evaluation is a hotly contested topic, with vigorous debate happening around issues of testing, measurement, and what is considered ‘important’ in terms of student learning, not to mention the potential high stakes decisions that may be made as a result of these assessments. At its best, this discussion has reinvigorated a national dialogue around teaching practice and research; at its worst it has polarized and entrenched stakeholder groups into rigid camps. How is it we can avoid the calcification of opinion and continue a constructive dialogue around this important and complex issue?
One way, as we suggest here, is to continue to discuss alternatives around teacher evaluation, and to be thoughtful about the role of social interactions in student outcomes, particularly as it relates to the current conversation around valued added models. It is in this spirit that we ask: Is there a ‘social side’ to a teacher’s ability to add value to their students’ growth and, if so, what are the implications for current teacher evaluation models? Read More »
In a previous post, I discussed simple data from the District of Columbia Public Schools (DCPS) on teacher turnover in high- versus lower-poverty schools. In that same report, which was issued by the D.C. Auditor and included, among other things, descriptive analyses by the excellent researchers from Mathematica, there is another very interesting table showing the evaluation ratings of DC teachers in 2010-11 by school poverty (and, indeed, DC officials deserve credit for making these kinds of data available to the public, as this is not the case in many other states).
DCPS’ well-known evaluation system (called IMPACT) varies between teachers in tested versus non-tested grades, but the final ratings are a weighted average of several components, including: the teaching and learning framework (classroom observations); commitment to the school community (attendance at meetings, mentoring, PD, etc.); schoolwide value-added; teacher-assessed student achievement data (local assessments); core professionalism (absences, etc.); and individual value-added (tested teachers only).
The table I want to discuss is on page 43 of the Auditor’s report, and it shows average IMPACT scores for each component and overall for teachers in high-poverty schools (80-100 percent free/reduced-price lunch), medium poverty schools (60-80 percent) and low-poverty schools (less than 60 percent). It is pasted below. Read More »
Our guest author today is Cory Koedel, Assistant Professor of Economics at the University of Missouri.
In a 2012 post on this blog, Dr. Di Carlo reviewed an article that I coauthored with colleagues Mark Ehlert, Eric Parsons and Michael Podgursky. The initial article (full version here, or for a shorter, less-technical version, see here) argues for the policy value of growth models that are designed to force comparisons to be between schools and teachers in observationally-similar circumstances.
The discussion is couched within the context of achieving three key policy objectives that we associate with the adoption of more-rigorous educational evaluation systems: (1) improving system-wide instruction by providing useful performance signals to schools and teachers; (2) eliciting optimal effort from school personnel; and (3) ensuring that current labor-market inequities between advantaged and disadvantaged schools are not exacerbated by the introduction of the new systems.
We argue that a model that forces comparisons to be between equally-circumstanced schools and teachers – which we describe as a “proportional” model – is best-suited to achieve these policy objectives. The conceptual appeal of the proportional approach is that it fully levels the playing field between high- and low-poverty schools. In contrast, some other growth models have been shown to produce estimates that are consistently associated with the characteristics of students being served (e.g., Student Growth Percentiles). Read More »
In 2009, The New Teacher Project (TNTP) released a report called “The Widget Effect.” You would be hard-pressed to find too many more recent publications from an advocacy group that had a larger influence on education policy and the debate surrounding it. To this day, the report is mentioned regularly by advocates and policy makers.
The primary argument of the report was that teacher performance “is not measured, recorded, or used to inform decision making in any meaningful way.” More specifically, the report shows that most teachers received “satisfactory” or equivalent ratings, and that evaluations were not tied to most personnel decisions (e.g., compensation, layoffs, etc.). From these findings and arguments comes the catchy title – a “widget” is a fictional product commonly used in situations (e.g., economics classes) where the product doesn’t matter. Thus, treating teachers like widgets means that we treat them all as if they’re the same.
Given the influence of “The Widget Effect,” as well as how different the teacher evaluation landscape is now compared to when it was released, I decided to read it closely. Having done so, I think it’s worth discussing a few points about the report. Read More »
The U.S. Department of Education has released a very short, readable report on the comparability of value-added estimates using two different tests in Indiana – one of them norm-referenced (the Measures of Academic Progress test, or MAP), and the other criterion-referenced (the Indiana Statewide Testing for Educational Progress Plus, or ISTEP+, which is also the state’s official test for NCLB purposes).
The research design here is straightforward – fourth and fifth grade students in 46 schools across 10 districts in Indiana took both tests, their teachers’ value-added scores were calculated, and the scores were compared. Since both sets of scores were based on the same students and teachers, this is allows a direct comparison of how teachers’ value-added estimates compare between these two tests. The results are not surprising, and they square with similar prior studies (see here, here, here, for example): The estimates based on the two tests are moderately correlated. Depending on the grade/subject, they are between 0.4 and 0.7. If you’re not used to interpreting correlation coefficients, consider that only around one-third of teachers were in the same quintile (fifth) on both tests, and another 40 or so percent were one quintile higher or lower. So, most teachers were within a quartile, about a quarter of teachers moved two or more quintiles, and a small percentage moved from top to bottom or vice-versa.
Although, as mentioned above, these findings are in line with prior research, it is worth remembering why this “instability” occurs (and what can be done about it). Read More »
In the three most discussed and controversial areas of market-based education reform – performance pay, charter schools and the use of value-added estimates in teacher evaluations – 2013 saw the release of a couple of truly landmark reports, in addition to the normal flow of strong work coming from the education research community (see our reviews from 2010, 2011 and 2012).*
In one sense, this building body of evidence is critical and even comforting, given not only the rapid expansion of charter schools, but also and especially the ongoing design and implementation of new teacher evaluations (which, in many cases, include performance-based pay incentives). In another sense, however, there is good cause for anxiety. Although one must try policies before knowing how they work, the sheer speed of policy change in the U.S. right now means that policymakers are making important decisions on the fly, and there is great deal of uncertainty as to how this will all turn out.
Moreover, while 2013 was without question an important year for research in these three areas, it also illustrated an obvious point: Proper interpretation and application of findings is perhaps just as important as the work itself. Read More »
The recently released study of IMPACT, the teacher evaluation system in the District of Columbia Public Schools (DCPS), has garnered a great deal of attention over the past couple of months (see our post here).
Much of the commentary from the system’s opponents was predictably (and unfairly) dismissive, but I’d like to quickly discuss the reaction from supporters. Some took the opportunity to make grand proclamations about how “IMPACT is working,” and there was a lot of back and forth about the need to ensure that various states’ evaluations are as “rigorous” as IMPACT (as well as skepticism as to whether this is the case).
The claim that this study shows that “IMPACT is working” is somewhat misleading, and the idea that states should now rush to replicate IMPACT is misguided. It also misses the important points about the study and what we can learn from its results. Read More »
Linda Darling-Hammond’s new book, Getting Teacher Evaluation Right, is a detailed, practical guide about how to improve the teaching profession. It leverages the best research and best practices, offering actionable, illustrated steps to getting teacher evaluation right, with rich examples from the U.S. and abroad.
Here I offer a summary of the book’s main arguments and conclude with a couple of broad questions prompted by the book. But, before I delve into the details, here’s my quick take on Darling-Hammond’s overall stance.
We are at a crossroads in education; two paths lay before us. The first seems shorter, easier and more straightforward. The second seems long, winding and difficult. The big problem is that the first path does not really lead to where we need to go; in fact, it is taking us in the opposite direction. So, despite appearances, more steady progress will be made if we take the more difficult route. This book is a guide on how to get teacher evaluation right, not how to do it quickly or with minimal effort. So, in a way, the big message or take away is: There are no shortcuts. Read More »
A new working paper, published by the National Bureau of Economic Research, is the first high quality assessment of one of the new teacher evaluation systems sweeping across the nation. The study, by Thomas Dee and James Wyckoff, both highly respected economists, focuses on the first three years of IMPACT, the evaluation system put into place in the District of Columbia Public Schools in 2009.
Under IMPACT, each teacher receives a point total based on a combination of test-based and non-test-based measures (the formula varies between teachers who are and are not in tested grades/subjects). These point totals are then sorted into one of four categories – highly effective, effective, minimally effective and ineffective. Teachers who receive a highly effective (HE) rating are eligible for salary increases, whereas teachers rated ineffective are dismissed immediately and those receiving minimally effective (ME) for two consecutive years can also be terminated. The design of this study exploits that incentive structure by, put very simply, comparing the teachers who were directly above the ME and HE thresholds to those who were directly below them, and to see whether they differed in terms of retention and performance from those who were not. The basic idea is that these teachers are all very similar in terms of their measured performance, so any differences in outcomes can be (cautiously) attributed to the system’s incentives.
The short answer is that there were meaningful differences. Read More »
The District of Columbia Public Schools (DCPS) has recently released the first round of results from its new principal evaluation system. Like the system used for teachers, the principal ratings are based on a combination of test and non-test measures. And the two systems use the same final rating categories (highly effective, effective, minimally effective and ineffective).
It was perhaps inevitable that there would be comparisons of their results. In short, principal ratings were substantially lower, on average. Roughly half of them received one of the two lowest ratings (minimally effective or ineffective), compared with around 10 percent of teachers.
Some wondered whether this discrepancy by itself means that DC teachers perform better than principals. Of course not. It is difficult to compare the performance of teachers versus that of principals, but it’s unsupportable to imply that we can get a sense of this by comparing the final rating distributions from two evaluation systems. Read More »
Our guest author today is Dan Goldhaber, Director of the Center for Education Data & Research and a Research Professor in Interdisciplinary Arts and Sciences at the University of Washington Bothell.
Let me begin with a disclosure: I am an advocate of experimenting with using value added, where possible, as part of a more comprehensive system of teacher evaluation. The reasons are pretty simple (though articulated in more detail in a brief, which you can read here). The most important reason is that value-added information about teachers appears to be a better predictor of future success in the classroom than other measures we currently use. This is perhaps not surprising when it comes to test scores, certainly an important measure of what students are getting out of schools, but research also shows that value added predicts very long run outcomes, such as college going and labor market earnings. Shouldn’t we be using valuable information about likely future performance when making high-stakes personnel decisions?
It almost goes without saying, but it’s still worth emphasizing, that it is impossible to avoid making high-stakes decisions. Policies that explicitly link evaluations to outcomes such as compensation and tenure are new, but even in the absence of such policies that are high-stakes for teachers, the stakes are high for students, because some of them are stuck with ineffective teachers when evaluation systems suggest, as is the case today, that nearly all teachers are effective. Read More »
In a new NBER working paper, economist Derek Neal makes an important point, one of which many people in education are aware, but is infrequently reflected in actual policy. The point is that using the same assessment to measure both student and teacher performance often contaminates the results for both purposes.
In fact, as Neal notes, some of the very features required to measure student performance are the ones that make possible the contamination when the tests are used in high-stakes accountability systems. Consider, for example, a situation in which a state or district wants to compare the test scores of a cohort of fourth graders in one year with those of fourth graders the next year. One common means of facilitating this comparability is administering some of the questions to both groups (or to some “pilot” sample of students prior to those being tested). Otherwise, any difference in scores between the two cohorts might simply be due to differences in the difficulty of the questions. If you cannot check that out, it’s tough to make meaningful comparisons.
But it’s precisely this need to repeat questions that enables one form of so-called “teaching to the test,” in which administrators and educators use questions from prior assessments to guide their instruction for the current year. Read More »
U.S. Secretary of Education Arne Duncan recently announced that states will be given the option to postpone using the results of their new teacher evaluations for high-stakes decisions during the phase-in of the new Common Core-aligned assessments. The reaction from some advocates was swift condemnation – calling the decision little more than a “delay” and a “victory for the status quo.”
We hear these kinds of arguments frequently in education. The idea is that change must be as rapid as possible, because “kids can’t wait.” I can understand and appreciate the urgency underlying these sentiments. Policy change in education (as in other arenas) can sometimes be painfully slow, and what seem likes small roadblocks can turn out to be massive, permanent obstacles.
I will not repeat my views regarding the substance of Secretary Duncan’s decision – see this op-ed by Morgan Polikoff and myself. I would, however, like to make one very quick point about these “we need change right now because students can’t wait” arguments: Sometimes, what is called “delay” is actually better described as good policy making, and kids can wait for good policy making. Read More »