** Reprinted here in the Washington Post
Former Florida Governor Jeb Bush has become one of the more influential education advocates in the country. He travels the nation armed with a set of core policy prescriptions, sometimes called the “Florida formula,” as well as “proof” that they work. The evidence that he and his supporters present consists largely of changes in average statewide test scores – NAEP and the state exam (FCAT) – since the reforms started going into place. The basic idea is that increases in testing results are the direct result of these policies.
Governor Bush is no doubt sincere in his effort to improve U.S. education, and, as we’ll see, a few of the policies comprising the “Florida formula” have some test-based track record. However, his primary empirical argument on their behalf – the coincidence of these policies’ implementation with changes in scores and proficiency rates – though common among both “sides” of the education debate, is simply not valid. We’ve discussed why this is the case many times (see here, here and here), as have countless others, in the Florida context as well as more generally.*
There is no need to repeat those points, except to say that they embody the most basic principles of data interpretation and causal inference. It would be wonderful if the evaluation of education policies – or of school systems’ performance more generally – was as easy as looking at raw, cross-sectional testing data. But it is not.
Luckily, one need not rely on these crude methods. We can instead take a look at some of the rigorous research that has specifically evaluated the core reforms comprising the “Florida formula.” As usual, it is a far more nuanced picture than supporters (and critics) would have you believe.
The easiest way to approach this review is to take the core components of the “Florida formula” one at a time. The plan seems to consist of several catchy-sounding concepts, each of which is embodied in a concrete policy or policies (noted below in parentheses).
Hold schools accountable (“A-F” school grading systems): In the late-1990s, Florida was one of the first states to adopt its own school grading system, now ubiquitous throughout the nation (see this post for a review of how Florida currently calculates these grades and what they mean).
The main purposes of these rating systems are to inform parents and other stakeholders and incentivize improvement and innovation by attaching consequences and rewards to the results. Starting in the late 1990s, the grades in Florida were high-stakes – students who attended schools that received an F for multiple years were made eligible for private school vouchers (the voucher program itself was shut down in 2006, after being ruled unconstitutional by the state’s Supreme Court).
In addition to the voucher threat, low-rated schools received other forms of targeted assistance, such as reading coaches, while high-rated schools were eligible for bonuses (discussed below). In this sense, the grading system plays a large role in Florida’s overall accountability system (called the “A+ Accountability Plan”).
Among the best analyses of the effect of the system is presented in this paper, which was originally released (in working form) in 2007. Using multiple tests, as well as surveys of principals over a five-year period during the early-2000s, the authors sought to assess both the test-based impact of the grading system as well as, importantly, how low-rated schools responded to the accountability pressure in terms of changes in concrete policy and practice.
The researchers concluded that test-based performance did indeed improve among the schools that had received F grades during the early 2000s, relative to similar schools that had received a higher grade. The difference was somewhat modest but large enough to be educationally meaningful, and it persisted in future years. A fair amount of the improvement appeared to be associated with specific steps that the schools had taken, such as increasing their focus on lower-performing students and lengthening instruction time (also see here). This, along with the inclusion of low-stakes exam data in the analysis, suggests that the improvements were not driven, at least not entirely, by “gaming” strategies, such as so-called “teaching to the test.”**
A different paper, also released in 2007 but using older data from the mid-1990s to early-2000s, reached the same conclusion – that F-rated schools responded to the pressure and were able to generate modest improvements in their performance. In this analysis, however, there was more evidence of “gaming” responses, such as focusing attention on students directly below the proficiency cutpoints and redirecting instruction toward subjects, writing in particular, in which score improvements were perceived as easier to achieve (also see here and here).
It’s important to note, however, that these findings only apply to F-rated schools, which, in any given year, are assigned to only 2-3 percent of the state’s schools. There is some tentative evidence (paper opens as PDF) that schools receiving D grades improved a little bit (relative to those receiving C’s), but that schools receiving A-C grades did not vary in their performance (this doesn’t necessarily mean they didn’t improve, only that, according to this analysis, they didn’t do any better than schools receiving higher grades).
(Interesting side note: These findings for D-rated schools, which did not face the voucher threat, in addition to other analyses [see this 2005 working paper and this very recent conference paper], suggest that the impact of the grading system may have as much to do with the response to the stigma of receiving a poor grade as that to the threat of voucher eligibility.)
Overall, based on this work, which is still growing, it’s fair to say that the small group of Florida schools that received low ratings, when faced with the threat of punishment for and/or stigma attached to those ratings, responded in strategic (though not always entirely “desirable”) ways, and that this response may have generated small but persistent test-based improvements. This is consistent with research on grading systems elsewhere (see here and here for national analyses of accountability effects).
However, the degree to which these increases reflected “real” improvements in student/school performance is not easy to isolate. For instance, at least some schools seem to have responded in part by focusing on students near the cutoffs, and even the more “desirable” strategies may have less-than-ideal side effects – e.g., a school’s increased focused on some students/subjects may come at the expense of others. These are very common issues when assessing test-based accountability systems’ impact on testing outcomes.
Give families the power to choose (charters and various types of vouchers): Florida’s various school choice-related policies have been subject to a decent amount of empirical scrutiny. In a sense, some of the papers discussed in the previous sub-section represent examples of such work, since eligibility for one of state’s voucher programs was tied directly to the grading system.
In addition, this interesting paper examined whether the introduction of a Florida tax credit-funded “scholarship” program (sometimes called “neovouchers”), by which low-income students attend private schools, spurred improvement among affected schools. Consistent with some other analyses of competitive effects elsewhere, the results suggest that the extent of competition (in this case, put simply, the ease of access to and stock of private schools in the area) was associated with increased performance of public schools. The magnitudes of these associations were somewhat modest (and, interestingly, were reduced considerably when Dade County, home to Miami, was excluded from the data).
Similarly, this analysis of a Florida voucher program for students with disabilities found that the number of participating private schools in the area was associated with higher performance among public school students diagnosed with mild disabilities, but not among those with more severe disabilities.
Thus, while it’s still early in the evidentiary game, these results once again indicate that schools responded to the threat of losing students to private schools, at least during the first couple of years after the introduction of this pressure.
For instance, this 2006 article finds that Florida charters initially produce inferior gains in math and reading, but, by their fifth year, their average performance is modestly higher in reading and statistically indistinguishable in math. The analysis also suggests that nearby regular public schools respond to the pressure from charters, with (very minor) differences showing up in math, but not reading.
CREDO’s 2009 analysis of Florida charters found statistically significant negative effects, but the size of these effects was extremely small. Similarly, this RAND report on charter performance in eight locations, including Florida, found no significant differences between the test-based performance of the state’s charters and comparable regular public schools. Finally, the state’s charter high schools seem to do a bit better, at least according to this paper, which found a substantial positive impact of charter attendance on the likelihood of graduation and college attendance.
On the whole, then, the evidence suggests that, depending on the age profile of Florida’s charter sector at any given time, the impact of Florida charters is rather mixed, and, if there are noteworthy test-based impacts either way, they are likely rather small. Once again, this squares with the research on charters in other states. Over time, it’s possible that the maturation of the state’s charters, as well as, perhaps, the competition they (and vouchers) engender among nearby district schools, might generate different results, but that remains to be seen.
Set high expectations (retention/remediation of third graders): There has been plenty of often-contentious disagreement about social promotion, and the debate goes back many years. Nevertheless, in 2003, Florida began holding back (and remediating) students with very low scores third grade reading exams.***
The evidence on Florida’s policy is just starting to come out, as the first third grade cohorts are just now finishing their K-12 education. For instance, a recent analysis of Florida’s retention policy suggests that students who were held back did substantially better than their counterparts who just barely made the cut (i.e., they were promoted to fourth grade). These (relative) impacts seem to have persisted through seventh grade (the point at which the data end).
Critics might dispute these findings, but, for the sake of our discussion here, let’s just say they lend initial support to the (plausible) conclusion that an extra year of schooling for low-scoring students, particularly when accompanied by extensive remediation efforts, might improve testing results in the medium-term.
It is still a bit too early to get a sense of the longer-term effects of this policy, which only affects a small subset of students every year. Also, there are of course financial costs associated with retention, as well as other types of outcomes to consider (e.g., graduation, non-cognitive outcomes).
(Side note: Another element of Florida’s “high standards” component is increased graduation requirements, but there doesn’t seem to be any high-quality evidence on this policy as yet, and any attempt to evaluate it should probably rely on post-graduation outcomes, such as college attainment.)
Funding for school/student success (direct resources to low-rated schools/students and rewarding those receiving high ratings): This component of the formula, which is also connected to the grading system discussed above, targets funding (around $700 million) toward low-performing (i.e., low-rated) schools and students, and/or rewards those schools that improve. For instance, some of it goes toward programs such as summer school and dropout prevention, while some is used to award bonuses to teachers and principals.
I was unable to find any empirical examinations of the test-based effect of these specific policies (or their cost-effectiveness), though it is quite possible that part of their impact, if any, is reflected in some of the analyses of the “A+ Plan” discussed above. For instance, if an F-rated school improved, that outcome may be due in part to the targeted assistance, and not just the accountability pressure. One might also have trouble separating the “funding for success” policy itself from the interventions for which it pays. For example, if the dropout prevention programs show results, should the takeaway be that the funding policy is working, or that it’s wise to invest in dropout prevention?
On a related note, Florida, due to the “A+ Accountability Plan,” has had a teacher performance pay in place since 2000 (for example, teachers are given bonuses for students passing AP exams). Once again, there don’t seem to be any high-quality evaluations of this particular intervention, but, if the evidence on other U.S. bonus programs is any indication (see here, here and here), the incentives are unlikely to have had an effect on short-term testing results.
Overall, then, the “funding for student success” component of the formula per se is a difficult target for empirical research, and there’s not much out there. Insofar as it is a resource allocation policy, its evaluation should probably focus primarily on how it affects funding, rather than testing outcomes (see here).
Quality educators (alternative certification and, more recently, evaluations and other personnel policies): Like many other states, Florida is currently overhauling several of its teacher personnel policies, including the design of new performance evaluations and “tenure reform.” The impact of these policies, which fall under the “quality educators” component of the “Florida formula,” remains to be seen, as they have not yet been fully implemented. During the 2000s, however, the primary manifestation of the “quality educators” component was opening up alternative certification routes.
The idea here, put simply, is to attract a wider pool of candidates into the teaching profession. Due in part to recent reforms, Florida maintains several alternative paths by which individuals may enter teaching, and the number of teachers entering the classroom through these paths has increased a great deal.
A recent, very extensive analysis of alternative pathways to teaching in Florida suggests that the qualifications and (test-based) performance of teachers entering the classroom varies considerably by program. Candidates from some programs do quite well (at least by the standard of value-added compared with traditionally-certified teachers), whereas others do not. On the whole, alternatively-certified teachers perform comparably to their traditionally-certified colleagues. This squares with evidence elsewhere (also see here and here).
It’s reasonable to argue that these additional pathways might have helped with teacher shortages in many locations, and it’s also possible that they served to attract some qualified individuals into the profession who would have chosen a different career path had they been required to travel the traditional route. It’s tough to say whether these teachers are, on the whole, better than those who would have been hired via traditional programs.
The verdict: Rendering a blanket judgment about a package of reforms as varied as the “Florida formula” is almost always unwise. It’s usually the case that some policies work, while others might not. In addition, even if the evidence is positive, it would not necessarily mean that a reform or reforms should be replicated elsewhere. School systems are complex and path-dependent; that which works well in one place may not work well in another.
That said, the available evidence on these policies, at least those for which some solid evidence exists, might be summarized as mixed but leaning toward modestly positive, with important (albeit common) caveats. A few of the reforms may have generated moderate but meaningful increases in test-based performance (with all the limitations that this implies) among the students and schools they affected. In a couple of other cases, there seems to have been little discernible impact on testing outcomes (and/or there is not yet sufficient basis to draw even highly tentative conclusions). It’s a good bet – or at least wishful thinking – that most of the evidence is still to come.
In the meantime, regardless of one’s opinion on whether the “Florida formula” is a success and/or should be exported to other states, the assertion that the reforms are responsible for the state’s increases in NAEP scores and FCAT proficiency rates during the late 1990s and 2000s not only violates basic principles of policy analysis, but it is also, at best, implausible. The reforms’ estimated effects, if any, tend to be quite small, and most of them are, by design, targeted at subgroups (e.g., the “lowest-performing” students and schools). Thus, even large impacts are no guarantee to show up at the aggregate statewide level (see the papers and reviews in the first footnote for more discussion).
In this sense, the first-order problem with the publicity accorded the “Florida formula,” and especially its proliferation to other states, is less about the merits of the policies themselves than the fact that they are being presented as a means to relatively immediate, large improvements in overall performance. Unfortunately, this problem – unrealistic promises based on invalid evidence – is not at all limited to Florida.****
We seem to not only expect, but demand instant gratification from interventions that, when they have any track record at all, have never been shown to have anything resembling such an impact. This is often harmful to the policy process. For example, it fosters a disincentive to invest in programs that will not have an immediate effect and, perhaps, an incentive to misuse evidence. Moreover, policies that don’t produce huge results might be shut down even if they’re working; the most effective policies are often those that have a modest, persistent effect that accumulates over time.
Whether we like it or not, real improvements at aggregate levels are almost always slow and incremental. There are no “miracles,” in Florida or anywhere else. The sooner we realize that, and start choosing and judging policies based on attainable expectations that accept the reality of the long haul, the better.
- Matt Di Carlo
* This analysis, which only covered the first few years of Florida’s “A+ Accountability Plan, and looked at both high- and low-stakes tests, found that the unadjusted increases, particularly on the low-stakes test and in reading, were due largely to changes in the cohorts of students taking the exams. In addition, in 2011, the National Education Policy Center (NEPC) published a review , authored by Professor William Mathis (University of Colorado at Boulder), of a presentation given by Governor Bush about the evidence on the “Florida formula.” Professor Mathis’ review also takes a component-by-component approach, but places primary focus on assessing the presentation’s claim that the Florida reforms generated large increases in NAEP/FCAT results (though he does also discuss the literature, and makes a bunch of the same points made above). A similar NEPC review, written by Professor Madhabi Chatterji (Teachers College), put forth similar arguments about the FCAT/NAEP evidence, this time in response to a think tank report about the “effect” of Florida’s reforms on the state’s achievement gap. Finally, for more general treatments of the issues surrounding interpretation of cross-sectional testing data, see here, here and here.
** A review of this paper by researcher Damian Betebrenner, though generally positive, noted that parts of the narrative overstated the degree to which the findings represented causal evidence – i.e., they couldn’t rule out the possibility that something other than accountability pressure was responsible. In addition, Mr. Betebrenner pointed out that the low-stakes test used by the researchers (the Stanford-9) did not preclude “teaching to the test.”
*** One of the criticisms of Governor Bush’s test-based evidence, which relies a great deal on fourth grade cohort changes in NAEP performance, was that the retention policy temporarily altered the composition of the fourth grade cohort taking the next round of exams (for example, see the Mathis and Chatterji reviews linked in the first footnote). In other words, if you hold back low-performing third graders, the fourth grade scores the next year will appear to jump. There was indeed a spike in fourth grade scores on NAEP, one which coincided roughly with the retention policy. However, insofar as NAEP results are not, by themselves, policy evidence, that objection is not germane to this review (though it’s certainly true that limiting social promotion severely complicates the examination of trends in raw student testing performance during this time period).
**** In fairness, the NAEP/FCAT “evidence” on the Florida formula, though invalid, does span 10-15 years, so it is at least not as much of a “quick fix” promise as, say, those made by advocates for the reforms in the District of Columbia.