Several months ago, the American Statistical Association (ASA) released a statement on the use of value-added models in education policy. I’m a little late getting to this (and might be repeating points that others made at the time), but I wanted to comment on the statement, not only because I think it’s useful to have ASA add their perspective to the debate on this issue, but also because their statement seems to have become one of the staple citations for those who oppose the use of these models in teacher evaluations and other policies.
Some of these folks claimed that the ASA supported their viewpoint – i.e., that value-added models should play no role in accountability policy. I don’t agree with this interpretation. To be sure, the ASA authors described the limitations of these estimates, and urged caution, but I think that the statement rather explicitly reaches a more nuanced conclusion: That value-added estimates might play a useful role in education policy, as one among several measures used in formal accountability systems, but this must be done carefully and appropriately.*
Much of the statement puts forth the standard, albeit important, points about value-added (e.g., moderate stability between years/models, potential for bias, etc.). But there are, from my reading, three important takeaways that bear on the public debate about the use of these measures, which are not always so widely acknowledged. Read More »
** Reprinted here in the Washington Post
The recent release of the latest New York State testing results created a little public relations coup for the controversial Success Academies charter chain, which operates over 20 schools in New York City, and is seeking to expand.
Shortly after the release of the data, the New York Post published a laudatory article noting that seven of the Success Academies had overall proficiency rates that were among the highest in the state, and arguing that the schools “live up to their name.” The Daily News followed up by publishing an op-ed that compares the Success Academies’ combined 94 percent math proficiency rate to the overall city rate of 35 percent, and uses that to argue that the chain should be allowed to expand because its students “aced the test” (this is not really what high proficiency rates mean, but fair enough).
On the one hand, this is great news, and a wonderfully impressive showing by these students. On the other, decidedly less sensational hand, it’s also another example of the use of absolute performance indicators (e.g., proficiency rates) as measures of school rather than student performance, despite the fact that they are not particularly useful for the former purpose since, among other reasons, they do not account for where students start out upon entry to the school. I personally don’t care whether Success Academy gets good or bad press. I do, however, believe that how one gauges effectiveness, test-based or otherwise, is important, even if one reaches the same conclusion using different measures. Read More »
In a previous post, I discussed simple data from the District of Columbia Public Schools (DCPS) on teacher turnover in high- versus lower-poverty schools. In that same report, which was issued by the D.C. Auditor and included, among other things, descriptive analyses by the excellent researchers from Mathematica, there is another very interesting table showing the evaluation ratings of DC teachers in 2010-11 by school poverty (and, indeed, DC officials deserve credit for making these kinds of data available to the public, as this is not the case in many other states).
DCPS’ well-known evaluation system (called IMPACT) varies between teachers in tested versus non-tested grades, but the final ratings are a weighted average of several components, including: the teaching and learning framework (classroom observations); commitment to the school community (attendance at meetings, mentoring, PD, etc.); schoolwide value-added; teacher-assessed student achievement data (local assessments); core professionalism (absences, etc.); and individual value-added (tested teachers only).
The table I want to discuss is on page 43 of the Auditor’s report, and it shows average IMPACT scores for each component and overall for teachers in high-poverty schools (80-100 percent free/reduced-price lunch), medium poverty schools (60-80 percent) and low-poverty schools (less than 60 percent). It is pasted below. Read More »
The so-called Vergara trial in California, in which the state’s tenure and layoff statutes were deemed unconstitutional, already has its first “spin-off,” this time in New York, where a newly-formed organization, the Partnership for Educational Justice (PEJ), is among the organizations and entities spearheading the effort.
Upon first visiting PEJ’s new website, I was immediately (and predictably) drawn to the “Research” tab. It contains five statements (which, I guess, PEJ would characterize as “facts”). Each argument is presented in the most accessible form possible, typically accompanied by one citation (or two at most). I assume that the presentation of evidence in the actual trial will be a lot more thorough than that offered on this webpage, which seems geared toward the public rather than the more extensive evidentiary requirements of the courtroom (also see Bruce Baker’s comments on many of these same issues surrounding the New York situation).
That said, I thought it might be useful to review the basic arguments and evidence PEJ presents, not really in the context of whether they will “work” in the lawsuit (a judgment I am unqualified to make), but rather because they’re very common, and also because it’s been my observation that advocates, on both “sides” of the education debate, tend to be fairly good at using data and research to describe problems and/or situations, yet sometimes fall a bit short when it comes to evidence-based discussions of what to do about them (including the essential task of acknowledging when the evidence is still undeveloped). PEJ’s five bullet points, discussed below, are pretty good examples of what I mean. Read More »
The Washington Post reports on an issue that we have discussed here on many occasions: The incompleteness of the testing results released annually by the District of Columbia Public Schools (DCPS), or, more accurately, the Office of the State Superintendent of Education (OSSE), which is responsible for testing in DC schools.
Here’s the quick backstory: For the past 7-8 years or so, DCPS/OSSE have not released a single test score for the state assessment (the DC-CAS). Instead, they have released only the percentage of students whose scores meet the designated cutoff points for the NCLB-style categories of below basic, basic, proficient and advanced. I will not reiterate all of the problems with these cutpoint-based rates and how they serve to distort the underlying data, except to say that they are by themselves among the worst ways to present these data, and there is absolutely no reason why states and districts should not release both rates and average scale scores.
The Post reports, however, that one organization — the Broader, Bolder Approach to Education — was able to obtain the actual scale score data (by subgroup and grade) for 2010-2013, and that this group published a memo-style report alleging that DCPS’ public presentation of their testing results over the past few years has been misleading. I had a mixed reaction to this report and the accompanying story. Read More »
There is a tendency in education circles these days, one that I’m sure has been discussed by others, and of which I myself have been “guilty,” on countless occasions. The tendency is to use terms such “effective/ineffective teacher” or “teacher performance” interchangeably with estimates from value-added and other growth models.
Now, to be clear, I personally am not opposed to the use of value-added estimates in teacher evaluations and other policies, so long as it is done cautiously and appropriately (which, in my view, is not happening in very many places). Moreover, based on my reading of the research, I believe that these estimates can provide useful information about teachers’ performance in the classroom. In short, then, I am not disputing whether value-added scores should be considered to be one useful proxy measure for teacher performance and effectiveness (and described as such), both formally and informally.
Regardless of one’s views on value-added and its policy deployment, however, there is a point at which our failure to define terms can go too far, and perhaps cause confusion. Read More »
A couple of weeks ago, the website Vox.com published an article entitled, “11 facts about U.S. teachers and schools that put the education reform debate in context.” The article, in the wake of the Vergara decision, is supposed to provide readers with the “basic facts” about the current education reform environment, with a particular emphasis on teachers. Most of the 11 facts are based on descriptive statistics.
Vox advertises itself as a source of accessible, essential, summary information — what you “need to know” — for people interested in a topic but not necessarily well-versed in it. Right off the bat, let me say that this is an extraordinarily difficult task, and in constructing lists such as this one, there’s no way to please everyone (I’ve read a couple of Vox’s education articles and they were okay).
That said, someone sent me this particular list, and it’s pretty good overall, especially since it does not reflect overt advocacy for given policy positions, as so many of these types of lists do. But I was compelled to comment on it. I want to say that I did this to make some lofty point about the strengths and weaknesses of data and statistics packaged for consumption by the general public. It would, however, be more accurate to say that I started doing it and just couldn’t stop. In any case, here’s a little supplemental discussion of each of the 11 items: Read More »
There is an ongoing debate about widespread administration of standardized tests to kindergartners. This is of course a serious decision. My personal opinion about whether this is a good idea depends on several factors, such as how good the tests will be and, most importantly, how the results will be used (and I cannot say that I am optimistic about the latter).
Although the policy itself must be considered seriously on its merits, there is one side aspect of testing kindergarteners that fascinates me: It would demonstrate how absurd it is to judge school performance, as does NCLB, using absolute performance levels – i.e., how highly students score on tests, rather than their progress over time.
Basically, the kindergarten tests would inevitably shake out the same way as those administered in later grades. Schools and districts serving more disadvantaged students would score substantially lower than their counterparts in more affluent areas. If the scores were converted to proficiency rates or similar cut-score measures, they would show extremely low pass rates in urban districts such as Detroit. Read More »
Unlike many of my colleagues, I don’t have a negative view of the Gates Foundation’s education programs. Although I will admit that part of me is uneasy with the sheer amount of resources (and influence) they wield, and there are a few areas where I don’t see eye-to-eye with their ideas (or grantees), I agree with them on a great many things, and I think that some of their efforts, such as the Measuring Effective Teachers project, are important and beneficial (even if I found their packaging of the MET results a bit overblown).
But I feel obliged to say that I am particularly impressed with their recent announcement of support for a two-year delay on attaching stakes to the results of new assessments aligned with the Common Core. Granted, much of this is due to the fact that I think this is the correct policy decision (see my opinion piece with Morgan Polikoff). Independent of that, however, I think it took intellectual and political courage for them to take this stance, given their efforts toward new teacher evaluations that include test-based productivity measures.
The announcement was guaranteed to please almost nobody. Read More »
Over the past few years, one can find a regular flow of writing attempting to explain the increase in teacher attrition. Usually, these explanations come in the form of advocacy – that is, people who don’t like a given policy or policies assert that they are the reasons for the rise in teachers leaving. Putting aside that these arguments are usually little more than speculation, as well as the fact that they often rely on highly limited approaches to measuring attrition (e.g., teacher experience distributions), there is a prior issue that must be addressed here: Is teacher attrition really increasing?
The short answer, at least at the national level and over the longer term, is yes, but, as usual, it’s more complicated than a simple yes/no answer.
Obviously, not all attrition is “bad,” as it depends on who’s leaving, but any attempt to examine levels of or trends in teacher attrition (leaving the profession) or mobility (switching schools) requires good data. When looking at individual districts, one often must rely on administrative datasets that make it very difficult to determine whether teachers left the profession entirely or simply moved to another district (though remember that whether teachers leave the profession or simply switch schools doesn’t really matter to individual schools, since they must replace the teachers regardless). In addition, the phenomenon of teachers leaving for a temporary period and then returning (e.g., after childbirth) is more common than many people realize. Read More »
A few weeks ago, I wrote a post that made a fairly simple point about the practice of expressing estimated charter effects on test scores as “days of additional learning”: Among the handful of states, districts, and multi-site operators that consistently have been shown to have a positive effect on testing outcomes, might not those “days of learning” be explained, at least in part, by the fact that they actually do offer additional days of learning, in the form of much longer school days and years?
That is, there is a small group of charter models/chains that seem to get good results. There are many intangible factors that make a school effective, but to the degree we can chalk this up to concrete practices or policies, additional time may be the most compelling possibility. Although it’s true that school time must be used wisely, it’s difficult to believe that the sheer amount of extra time that the flagship chains offer would not improve testing performance substantially.
To their credit, many charter advocates do acknowledge the potentially crucial role of extended time in explaining their success stories. And the research, tentative though it still is, is rather promising. Nevertheless, there are a few important points that bear repeating when it comes to the idea of massive amounts of additional time, particularly given the fact that there is a push to get regular public schools to adopt the practice. Read More »
A recent story in the Chicago Tribune notes that Illinois’ NCLB waiver plan sets lower targets for certain student subgroups, including minority and low-income students. This, according to the article, means that “Illinois students of different backgrounds no longer will be held to the same standards,” and goes on to quote advocates who are concerned that this amounts to lower expectations for traditionally lower-scoring groups of children.
The argument that expectations should not vary by student characteristics is, of course, valid and important. Nevertheless, as Chad Aldeman notes, the policy of setting different targets for different groups of students has been legally required since the enactment of NCLB, under which states must “give credit to lower-performing groups that demonstrate progress.” This was supposed to ensure, albeit with exceedingly crude measures, that schools weren’t punished due to the students they serve, and how far behind were those students upon entry into the schools.
I would take that a step further by adding two additional points. The first is quite obvious, and is mentioned briefly in the Tribune article, but too often is obscured in these kinds of conversations: Neither NCLB nor the waivers actually hold students to different standards. The cut scores above which students are deemed “proficient,” somewhat arbitrary though they may be, do not vary by student subgroup, or by any other factor within a given state. All students are held to the same exact standard. Read More »
A recent story in the New York Times reports that, according to an Obama Administration-commissioned panel, the measures being used to evaluate the performance of healthcare providers are unfairly penalizing those that serve larger proportions of disadvantaged patients (thanks to Mike Petrilli for sending me the article). For example, if you’re grading hospitals based on simple, unadjusted re-admittance rates, it might appear as if hospitals serving high poverty populations are doing worse — even if the quality of their service is excellent — since readmissions are more likely for patients who can’t afford medication, or aren’t able to take off from work, or don’t have home support systems.
The panel recommended adjusting the performance measures, which, for instance, are used for Medicare reimbursement, using variables such as patient income and education, as this would provide a more fair accountability system – one that does not penalize healthcare institutions and their personnel for factors that are out of their control.
There are of course very strong, very obvious parallels here to education accountability policy, in which schools are judged in part based on raw proficiency rates that make no attempt to account for differences in the populations of students in different schools. The comparison also reveals an important feature of formal accountability systems in other policy fields. Read More »
At the end of February, the District of Columbia Council’s Education Committee held its annual hearing on the performance of the District’s Public Schools (DCPS). The hearing (full video is available here) lasted over four hours, and included discussion on a variety of topics, but there was, inevitably, a block of time devoted to the discussion of DCPS testing results (and these questions were the focus of the news coverage).
These exchanges between Council members and DCPS Chancellor Kaya Henderson focused particularly on the low-stakes Trial Urban District Assessment (TUDA).* Though it was all very constructive and not even remotely hostile, it’s fair to say that Ms. Henderson was grilled quite a bit (as is often the case at these kinds of hearings). Unfortunately, the arguments from both sides of the dais were fraught with the typical misinterpretations of TUDA, and I could not get past how tragic it is to see legislators question the superintendent of a large urban school district based on a misinterpretation of what the data mean – and to hear that superintendent respond based on the same flawed premises.
But what I really kept thinking — as I have before in similar contexts — was how effective Chancellor Henderson could have been in answering the Council’s questions had she chosen to interpret the data properly (and I still hold out hope that this will become the norm some day). So, let’s take a quick look at a few major arguments that were raised during the hearing, and how they might have been answered. Read More »
Anyone who follows education policy debates might hear the term “standard deviation” fairly often. Most people have at least some idea of what it means, but I thought it might be useful to lay out a quick, (hopefully) clear explanation, since it’s useful for the proper interpretation of education data and research (as well as that in other fields).
Many outcomes or measures, such as height or blood pressure, assume what’s called a “normal distribution.” Simply put, this means that such measures tend to cluster around the mean (or average), and taper off in both directions the further one moves away from the mean (due to its shape, this is often called a “bell curve”). In practice, and especially when samples are small, distributions are imperfect — e.g., the bell is messy or a bit skewed to one side — but in general, with many measures, there is clustering around the average.
Let’s use test scores as our example. Suppose we have a group of 1,000 students who take a test (scored 0-20). A simulated score distribution is presented in the figure below (called a “histogram”). Read More »