Some Florida officials are still having trouble understanding why they’re finding no relationship between the grades schools receive and the evaluation ratings of teachers in those schools. For his part, new Florida education Commissioner Tony Bennett is also concerned. According to the article linked above, he acknowledges (to his credit) that the two measures are different, but is also considering “revis[ing] the models to get some fidelity between the two rankings.”
This may be turning into a potentially risky situation. As discussed in a recent post, it is important to examine the results of the new teacher evaluations, but there is no reason one would expect to find a strong relationship between these ratings and the school grades, as they are in large part measuring different things (and imprecisely at that). The school grades are mostly (but not entirely) driven by how highly students score, whereas teacher evaluations are, to the degree possible, designed to be independent of these absolute performance levels. Florida cannot validate one system using the other.
However, as also mentioned in that post, this is not to say that there should be no relationship at all. For example, both systems include growth-oriented measures (albeit using very different approaches). In addition, schools with lower average performance levels sometimes have trouble recruiting and retaining good teachers. Due to these and other factors, the reasonable expectation is to find some association overall, just not one that’s extremely strong. And that’s basically what one finds, even using the same set of results upon which the claims that there is no relationship are based.
Read More »
** Reprinted here in the Washington Post
Former Florida Governor Jeb Bush was in Virginia last week, helping push for a new law that would install an “A-F” grading system for all public schools in the commonwealth, similar to a system that has existed in Florida for well over a decade.
In making his case, Governor Bush put forth an argument about the Florida system that he and his supporters use frequently. He said that, right after the grades went into place in his state, there was a drop in the proportion of D and F schools, along with a huge concurrent increase in the proportion of A schools. For example, as Governor Bush notes, in 1999, only 12 percent of schools got A’s. In 2005, when he left office, the figure was 53 percent. The clear implication: It was the grading of schools (and the incentives attached to the grades) that caused the improvements.
There is some pretty good evidence (also here) that the accountability pressure of Florida’s grading system generated modest increases in testing performance among students in schools receiving F’s (i.e., an outcome to which consequences were attached), and perhaps higher-rated schools as well. However, putting aside the serious confusion about what Florida’s grades actually measure, as well as the incorrect premise that we can evaluate a grading policy’s effect by looking at the simple distribution of those grades over time, there’s a much deeper problem here: The grades changed in part because the criteria changed. Read More »
Last week, Florida State Senate President Don Gaetz (R – Niceville) expressed his skepticism about the recently-released results of the state’s new teacher evaluation system. The senator was particularly concerned about his comparison of the ratings with schools’ “A-F” grades. He noted, “If you have a C school, 90 percent of the teachers in a C school can’t be highly effective. That doesn’t make sense.”
There’s an important discussion to be had about the results of both the school and teacher evaluation systems, and the distributions of the ratings can definitely be part of that discussion (even if this issue is sometimes approached in a superficial manner). However, arguing that we can validate Florida’s teacher evaluations using its school grades, or vice-versa, suggests little understanding of either. Actually, given the design of both systems, finding a modest or even weak association between them would make pretty good sense.
In order to understand why, there are two facts to consider. Read More »
** Reprinted here in the Washington Post
Former Florida Governor Jeb Bush has become one of the more influential education advocates in the country. He travels the nation armed with a set of core policy prescriptions, sometimes called the “Florida formula,” as well as “proof” that they work. The evidence that he and his supporters present consists largely of changes in average statewide test scores – NAEP and the state exam (FCAT) – since the reforms started going into place. The basic idea is that increases in testing results are the direct result of these policies.
Governor Bush is no doubt sincere in his effort to improve U.S. education, and, as we’ll see, a few of the policies comprising the “Florida formula” have some test-based track record. However, his primary empirical argument on their behalf – the coincidence of these policies’ implementation with changes in scores and proficiency rates – though common among both “sides” of the education debate, is simply not valid. We’ve discussed why this is the case many times (see here, here and here), as have countless others, in the Florida context as well as more generally.*
There is no need to repeat those points, except to say that they embody the most basic principles of data interpretation and causal inference. It would be wonderful if the evaluation of education policies – or of school systems’ performance more generally – was as easy as looking at raw, cross-sectional testing data. But it is not.
Luckily, one need not rely on these crude methods. We can instead take a look at some of the rigorous research that has specifically evaluated the core reforms comprising the “Florida formula.” As usual, it is a far more nuanced picture than supporters (and critics) would have you believe. Read More »
A while back, I argued that Florida’s school grading system, due mostly to its choice of measures, does a poor job of gauging school performance per se. The short version is that the ratings are, to a degree unsurpassed by most other states’ systems, driven by absolute performance measures (how highly students score), rather than growth (whether students make progress). Since more advantaged students tend to score more highly on tests when they enter the school system, schools are largely being judged not on the quality of instruction they provide, but rather on the characteristics of the students they serve.
New results were released a couple of weeks ago. This was highly anticipated, as the state had made controversial changes to the system, most notably the inclusion of non-native English speakers and special education students, which officials claimed they did to increase standards and expectations. In a limited sense, that’s true – grades were, on average, lower this year. The problem is that the system uses the same measures as before (including a growth component that is largely redundant with proficiency). All that has changed is the students that are included in them. Thus, to whatever degree the system now reflects higher expectations, it is still for outcomes that schools mostly cannot control.
I fully acknowledge the political and methodological difficulties in designing these systems, and I do think Florida’s grades, though exceedingly crude, might be useful for some purposes. But they should not, in my view, be used for high-stakes decisions such as closure, and the public should understand that they don’t tell you much about the actual effectiveness of schools. Let’s take a very quick look at the new round of ratings, this time using schools instead of districts (I looked at the latter in my previous post about last year’s results).
Read More »
About a week ago, Florida officials went into crisis mode after revealing that the proficiency rate on the state’s writing test (FCAT) dropped from 81 percent to 27 percent among fourth graders, with similarly large drops in the other two grades in which the test is administered (eighth and tenth). The panic was almost immediate. For one thing, performance on the writing FCAT is counted in the state’s school and district ratings. Many schools would end up with lower grades and could therefore face punitive measures.
Understandably, a huge uproar was also heard from parents and community members. How could student performance decrease so dramatically? There was so much blame going around that it was difficult to keep track – the targets included the test itself, the phase-in of the state’s new writing standards, and test-based accountability in general.
Despite all this heated back-and-forth, many people seem to have overlooked one very important, widely-applicable lesson here: That proficiency rates, which are not “scores,” are often extremely sensitive to where you set the bar. Read More »
A while back, I noted that states and districts should exercise caution in assigning weights (importance) to the components of their teacher evaluation systems before they know what the other components will be. For example, most states that have mandated new evaluation systems have specified that growth model estimates count for a certain proportion (usually 40-50 percent) of teachers’ final scores (at least those in tested grades/subjects), but it’s critical to note that the actual importance of these components will depend in no small part on what else is included in the total evaluation, and how it’s incorporated into the system.
In slightly technical terms, this distinction is between nominal weights (the percentage assigned) and effective weights (the percentage that actually ends up being the case). Consider an extreme hypothetical example – let’s say a district implements an evaluation system in which half the final score is value-added and half is observations. But let’s also say that every teacher gets the same observation score. In this case, even though the assigned (nominal) weight for value-added is 50 percent, the actual importance (effective weight) will be 100 percent, since every teacher receives the same observation score, and so all the variation between teachers’ final scores will be determined by the value-added component.
This issue of nominal/versus effective weights is very important, and, with exceptions, it gets almost no attention. And it’s not just important in teacher evaluations. It’s also relevant to states’ school/district grading systems. So, I think it would be useful to quickly illustrate this concept in the context of Florida’s new district grading system. Read More »
There is some controversy over the fact that Florida’s recently-announced value-added model (one of a class often called “covariate adjustment models”), which will be used to determine merit pay bonuses and other high-stakes decisions, doesn’t include a direct measure of poverty.
Personally, I support adding a direct income proxy to these models, if for no other reason than to avoid this type of debate (and to facilitate the disaggregation of results for instructional purposes). It does bear pointing out, however, that the measure that’s almost always used as a proxy for income/poverty – students’ eligibility for free/reduced-price lunch – is terrible as a poverty (or income) gauge. It tells you only whether a student’s family has earnings below (or above) a given threshold (usually 185 percent of the poverty line), and this masks most of the variation among both eligible and non-eligible students. For example, families with incomes of $5,000 and $20,000 might both be coded as eligible, while families earning $40,000 and $400,000 are both coded as not eligible. A lot of hugely important information gets ignored this way, especially when the vast majority of students are (or are not) eligible, as is the case in many schools and districts.
That said, it’s not quite accurate to assert that Florida and similar models “don’t control for poverty.” The model may not include a direct income measure, but it does control for prior achievement (a student’s test score in the previous year[s]). And a student’s test score is probably a better proxy for income than whether or not they’re eligible for free/reduced-price lunch.
Even more importantly, however, the key issue about bias is not whether the models “control for poverty,” but rather whether they control for the range of factors – school and non-school – that are known to affect student test score growth, independent of teachers’ performance. Income is only one part of this issue, which is relevant to all teachers, regardless of the characteristics of the students that they teach. Read More »
Just this week, Florida announced its new district grading system. These systems have been popping up all over the nation, and given the fact that designing one is a requirement of states applying for No Child Left Behind waivers, we are sure to see more.
I acknowledge that the designers of these schemes have the difficult job of balancing accessibility and accuracy. Moreover, the latter requirement – accuracy – cannot be directly tested, since we cannot know “true” school quality. As a result, to whatever degree it can be partially approximated using test scores, disagreements over what specific measures to include and how to include them are inevitable (see these brief analyses of Ohio and California).
As I’ve discussed before, there are two general types of test-based measures that typically comprise these systems: absolute performance and growth. Each has its strengths and weaknesses. Florida’s attempt to balance these components is a near total failure, and it shows in the results. Read More »