Creating A Valid Process For Using Teacher Value-Added Measures

Posted by on November 28, 2012

** Reprinted here in the Washington Post

Our guest author today is Douglas N. Harris, associate professor of economics and University Endowed Chair in Public Education at Tulane University in New Orleans. His latest book, Value-Added Measures in Education, provides an excellent, accessible review of the technical and practical issues surrounding these models. 

Now that the election is over, the Obama Administration and policymakers nationally can return to governing.  Of all the education-related decisions that have to be made, the future of teacher evaluation has to be front and center.

In particular, how should “value-added” measures be used in teacher evaluation? President Obama’s Race to the Top initiative expanded the use of these measures, which attempt to identify how much each teacher contributes to student test scores. In doing so, the initiative embraced and expanded the controversial reliance on standardized tests that started under President Bush’s No Child Left Behind.

In many respects, The Race was well designed. It addresses an important problem – the vast majority of teachers report receiving limited quality feedback on instruction. As a competitive grants program, it was voluntary for states to participate (though involuntary for many districts within those states). The Administration also smartly embraced the idea of multiple measures of teacher performance.

But they also made one decision that I think was a mistake.  They encouraged—or required, depending on your vantage point—states to lump value-added or other growth model estimates together with other measures. The raging debate since then has been over what percentage of teachers’ final ratings should be given to value-added versus the other measures. I believe there is a better way to approach this issue, one that focuses on teacher evaluations not as a measure, but rather as a process.

The idea of combining the measures has some advantages.  For example, as I wrote in my book on about value-added measures, combined measures have greater reliability and probably better validity as well.  But there is also one major issue: Teachers by and large do not like or trust value-added measures. There are some good reasons for this: The measures are not very reliable and therefore bounce around from year to year in ways that have nothing to do with actual performance. There is more debate about whether the measures are, in any given year, providing useful information about “true” teacher performance (i.e., whether they are valid).

The larger problem is that policymakers have tended to look at the teacher evaluation problem like measurement experts rather than school leaders. Measurement experts naturally want validity and reliable measures—ones that accurately capture teacher effectiveness. School leaders, on the other hand, can and should be more concerned about whether the entire process leads to valid and reliable conclusions about teacher effectiveness. The process includes measures, but also clear steps, checks and balances, and opportunities to identify and fix evaluation mistakes. It is that process, perhaps as much as the measures themselves, that instills trust in the system among educators. But the idea of combining multiple measures has short-circuited discussion about how the multiple measures—and especially value-added—could be used to create a better process.

One possible process comes from the medical profession. It is common for doctors to “screen” for major diseases, using procedures that can identify all the people who do have the disease, but some who do not (the latter being false positives). Those who are positive on the screening test are given another “gold standard” test that is more expensive but almost perfectly accurate.  They do not average the screening test together with the gold standard test to create a combined index. Instead, the two pieces are considered in sequence.

Ineffective teachers could be identified the same way.

Value-added measures could become the educational equivalent of screening tests. They are generally inexpensive and somewhat inaccurate. As in medicine, a value-added score, combined with some additional information, should lead us to engage in additional classroom observations to identify truly low-performing teachers and to provide feedback to help those teachers improve. If all else fails, within a reasonable amount of time, after continued observation, administrators could counsel the teacher out or pursue a formal dismissal procedure.

The most obvious problem with this approach is that value-added measures, unlike the medical screening tests, do not capture all potential low-performers.  They are statistically noisy, for example, and so many low-performers will get high scores by chance.  For this reason, value-added would not be the sole screener.  Instead, some other measure could also be used as a screener.  If teachers failed on either measure, then that would be a reason for collecting additional information. (This approach also solves another problem discussed later.)

There is a second way in which value-added could be used as a screener – not of teachers, but of their teacher evaluators. To explain how, I need to say more about the “other” measures in an evaluation system. Almost every school system that has moved to alternative teacher evaluations has chosen to also use classroom observations by peers, master teachers, and/or school principals. The Danielson Framework, PLATO, and others are now household names among educators. Classroom observations have many advantages: They allow the observer to take account of the local context. They yield information that is more useful to teachers for improving practice.  And we can increase their reliability by observing teachers more often.

The difficulty is that these measures, too, have validity and reliability issues.  Two observers can look at the same classroom and see different things.  That problem is more likely when the observers vary in their training. Also, some observers might know teachers’ value-added scores and let those color their views during the observations – they might think, “I already know this teacher is not very good so I will give her a low score.”

Value-added measures might actually be used to fix these problems with classroom observations. To see how, note that researchers have found consistent, positive correlations between value-added and classroom observations scores. They are far from perfect correlations (mainly because of statistical noise), but they provide a benchmark against which we can compare (validate, if you will) the scores across individual observers.  Inaccurate classroom observation scores would likely show up as low correlations with value-added. Conversely, if observers were having their scores influenced by value-added, then the correlations might be very high, which might also be a red flag.

In these cases, an additional observer might be used to make sure the information is accurate. In other words, value-added can screen the performance of not only teachers, but observers as well. Used in these ways, value-added would be a key part of the system but without being the determining factor in personnel decisions.

This screening approach would solve a host of problems.

  1. The screening approach maintains the new and important focus on teacher evaluation and the use of student test scores in those systems.  The NEA and AFT themselves have been rightly critical of traditional-style evaluation systems because they provide so little useful feedback to teachers. Screening with value-added places the emphasis on formative, feedback-based measures such as observations.
  2. The screening approach represents a “feedback loop” in which both value-added and observations are used to ensure that the other is functioning well – i.e., observations are used to verify the identification of low-performing teachers based on value-added (and help them improve), while value-added is used to identify observers whose performance may be lacking.  All measures have their flaws and value-added can help address these.
  3. The screening approach ensures that value-added measures are never the primary determinants of high-stakes personnel decisions. Rather, in this alternative proposal, value-added would only serve to trigger a closer look at a teacher’s performance, but the actual decisions would be based on classroom observations by experts. These have much greater support among teachers and provide more useful feedback.
  4. The screening approach helps schools focus their evaluation resources where they count: On low-performing teachers and low-performing classroom observers. This is crucial in these tough economic and fiscal times, during which schools must allocate resources carefully.
  5. The screening approach can be applied to all teachers, not just those in tested grades and subjects. A common criticism of value-added is that it cannot be applied to all teachers. With the approach I am proposing, only the initial screening process would differ (e.g., a single classroom observation that all teachers would receive) and the remainder of the process could be based on a more standard set of measures (additional classroom observations).
  6. The screening approach, because it works in all grades and subjects, avoids the unfortunate response, in states such as Florida, of expanding testing to every grade and subject. Teaching to the test is a real problem and this will make it worse. Value-added could serve to test screeners even of non-tested grades and subjects as long as those same screeners have some teachers in tested classrooms.
  7. The screening approach ensures that there is enough information that educational leaders will be able sleep at night knowing they are making the best possible personnel decisions – that their tough choices will not be over-turned by lawsuits alleging arbitrary and capricious firings.

Since I started with a medical analogy, some might want to call this a “triage” approach. This term fits in some ways but not in others.  In both cases, the focus is on allocating resources in cost-effective ways. The higher-performing teachers get less attention just as healthier patients do.  On the other hand, there is a difference between this approach and medical triage, as the latter entails devoting few resources to those who are least likely to make it. Instead, part of this point is to collect more information on these struggling teachers so that personnel decisions can be made with confidence and in keeping with legal requirements.

The screening approach certainly wouldn’t solve all the problems with the new teacher evaluation systems. The choice of additional measures beyond value-added, and the implementation of these measures, are critical. So are the ways in which the evaluations are used in personnel decisions.

Value-added measures have played a valuable role in sparking this important debate, but they need not do all the heavy lifting for our reformed teacher evaluation systems. We need more than a number, but a process for identifying low-performing teachers and helping them get better.

- Douglas Harris


17 Comments posted so far

  • Kudos to Dr. Harris for providing a clear approach as to how data can be used to drive improvement, and for acknowledging and addressing the political challenges of implementing teacher evaluation reforms.

    Comment by Cara
    November 28, 2012 at 10:43 AM
  • This is a slightly more palatable approach, but I reject the medical analogy. Medical tests are so narrow and so simple. The classroom is a thoroughly dynamic combination of dozens of people, and dozens of factors known and unknown – and it differs each year. I am hardly reassured by the fact that economists find enough patterns in the data to rationalize the use of measures that are neither valid nor reliable, and then reassure teachers that everything will be okay. As it applies to my own subject area (HS English), the use of VAM could not even attempt to measure most of what I teach, because the tests are simply that awful – and we can’t even look at the tests in order to determine their potential (unlikely) usefulness in improving teaching. Let’s dispense with VAM for evaluation and dedicate our time and energy to developing and constantly improving evaluation methods that enjoy greater support among those engaged in the work. After all, none of the top schools or systems in the world got there by using VAM for evaluation.

    Comment by David B. Cohen
    November 28, 2012 at 11:28 AM
  • Something to consider when using two different measures and treating ‘failure’ (or whatever one wants to call it) as low performance on either measure is that you increase the false positive rate (i.e., a teacher isn’t performing poorly despite one poor measurement). If there is a cost to overestimating poor performance (e.g., administrative time, teacher demoralization, etc.), this would seem to be a problem.

    November 28, 2012 at 1:51 PM
  • Dr. Harris’ system is an improvement, but it’s mostly for the PR-ish reason that it’s easy to understand and doesn’t come with accompanying rhetoric designed to scare teachers. As for the key reason of ensuring balance — #3 on the list — most current systems do in fact “ensure that value-added measures are never the primary determinants of high-stakes personnel decisions.” That’s why value-added measures generally make up less than 40% of an evaluation. If you don’t perform poorly on the other 60%-90%, you’re not going to lose your job.

    The fact that this system is seen an as a real improvement — and it is — shows that much of the hubbub over value-added measures is about politics and not policy.

    Comment by Eric
    November 28, 2012 at 6:34 PM
  • Eric, I disagree. This is a significant departure and improvement. First, it is used not to rate teachers on a bell curve, which is what most systems do, but to identify those who need the most support for intensive intervention. Right now, I am spending an inordinate amount of my time observing strong teachers who frankly need one yearly observation and perhaps some short drop ins. Last year, before APPR in NY, we could spend our time working with teachers who really needed support. That is at the heart of this system.
    Second, because the number generated from VAM is not part of an evaluation number, it avoids the unintended consequences of intense teaching to the test, or the fear teachers now experience when difficult to teach students are assigned to their class.
    Third, the structure of many systems place far more weight to the VAM or achievement score portion than what appears at first glance. In NY student achievement is 40%–however, because you need 65 points to get out of ineffective which is “the dismissal zone”, that 40% takes on significant weight–if you score ineffective, then you are ineffective overall. Which state has a system where VAM is a true 10%? I know of know.
    Dough Harris gets the problem. I assure you, as a longtime principal, of an excellent school, this is not about politics. It is about opposition to an evaluation system that will harm school improvement efforts.

    Comment by carolcorbettburris
    November 28, 2012 at 7:43 PM
  • Correction… Last sentence paragraph 3. I know of none.

    Comment by carolcorbettburris
    November 28, 2012 at 7:46 PM
  • I agree that the system is an improvement — and said so — particularly because of the efficiencies you mention regarding not focusing on strong teachers. But in terms of avoiding things like teaching to the test the system is not logically different. Your point seems to be that you don’t have to worry about test scores because you will only raise red flags if your observations are bad. But even under standard systems that weight observations enough to give them a “veto” you don’t have to worry about test scores if your observations aren’t bad.

    I think what you’re really saying is that Dr. Harris’ system is better because it guarantees observations this “veto,” and thus ensures that less weight will be placed on test scores. That’s a fair point to make, but it has nothing to do with the clever design of the system. My point is that If you take two systems where test scores are not weighted enough to cause a dismissal on their own, the decision over whether to count observations and test scores together or in sequence will have no effect on a teacher’s chances of being dismissed. The fact that you think this is the case supports my first point — using observations and value-added measures in sequence sounds like it’s less arbitrary.

    Comment by Eric
    November 28, 2012 at 9:15 PM
  • Eric–I disagree as well. As Bruce Baker, Matt (I believe), Carol, and others have pointed out, initial weights that have VAMs at less than 50% of an overall eval score can end up with VAMs acting as a sole determinant of effectiveness if the variance in VAMs is much greater than the variance in the other components. Non-VAM achievement measures tend to be worrisome as well becayuuse they don;t control for the factors outside the control of the teacher (not that VAMs control for all these factors, but at least they try to).

    Comment by Ed Fuller
    November 28, 2012 at 9:53 PM
  • And the convincing empirical basis for the use of value-added measures in educational improvement is? Well?

    There are more effective ways to spend this huge, and ever growing amount of money involved in this VAM-enterprise with dubious or simply useless outcomes. Measurement policies of ‘teacher effectiveness’ have led nowhere but to demoralization of teachers, and the erosion of the teaching profession.

    That is typically *not* in the interest of our students.

    Comment by Hannes Minkema
    November 28, 2012 at 11:06 PM
  • We “could” do this, it may even “work” to some degree, but is the best way to do the initial triage?

    Any consideration of that question has to include the human cost to our students of the extensive testing necessary to produce VAM measures that have any worth at all (as well as the financial costs, the costs in staff time…).

    Comment by Thomas J. Mertz
    November 29, 2012 at 8:26 AM
  • Ed–My argument is more of a semantic one at this point. The idea of only using VAMs after observations isn’t functionally different from a system where VAMs make up X% of the evaluation, with X being some number low enough that VAMs can’t be solely responsible for dismissal. If in fact there are lots of systems where high test score variance can lead teachers to be dismissed based on VAMs alone when the designers did not intend it that way, then this system does fix that, but I’m skeptical of how often that occurs in practice. That’s why I think point #3 in the post is less a consequence of an original system and more a result of ensuring a low enough VAM component. This matters because at this point it may be easier to convince people to have lower VAM components than to adopt a new type of sequential system.

    Comment by Eric
    November 29, 2012 at 9:33 AM
  • I’ve heard this said many times lately, but it is true – teaching (and education) cannot be simplified into an algorithm. The diversity of any given classroom on any given day, let alone any given year, will always make the evaluation process a messy one. VAM is yet another effort to try and neaten up a process that really defies simplification. It also puts yet an even heavier weight on standardized tests that they were never meant to bear. The push for VAM is part of a larger catch-22 created by NCLB. Most states adopting it ignore the studies citing the ineffectiveness of such measures, because in the larger picture, including VAM will allow them opportunity to apply for waivers from the more punitive elements of NCLB. It’s a sad cycle.

    Comment by Tracie Weisz
    November 29, 2012 at 12:40 PM
  • [...] Child Left Behind, K-12 Education, Teacher and Principal QualityYesterday, the ShankerBlog posted a piece by Doug Harris proposing that states and districts incorporate student growth into teacher and [...]

    November 29, 2012 at 3:59 PM
  • [...] Ctd. Of Course D.C. Should Be A State Pro-Life Activists Conveniently Ignore the Abortion Drop Creating A Valid Process For Using Teacher Value-Added Measures Use the recess appointment Inequality is not just about taxes and education Grover Norquist is [...]

    November 30, 2012 at 4:23 PM
  • [...] [...]

    December 4, 2012 at 3:38 AM
  • [...] the full article from Douglas Harris, please visit: http://shankerblog.org/?p=7242 This entry was posted in Teacher Evaluation and tagged teacher evaluation, teacher [...]

    December 17, 2012 at 12:07 PM
  • ” The measures are not very reliable and therefore bounce around from year to year in ways that have nothing to do with actual performance.” This quote is later followed by step one of the process – “The screening approach maintains the new and important focus on teacher evaluation and the use of student test scores in those systems.” How is making anything that is not very reliable a reasonable first step in a teacher’s evaluation no matter what weight it is given?

    There is also this statement – “The most obvious problem with this approach is that value-added measures, unlike the medical screening tests, do not capture all potential low-performers. They are statistically noisy, for example, and so many low-performers will get high scores by chance.” Again, if the measures are not reliable, just what percentage of low performers are captured and just how many high performers get low scores by chance? If the we are right to be so critical of traditional style evaluations, how does adding an unreliable component help since neither the measures nor the traditional style evaluations are apparently reliable in identifying low performers.

    Actually, step one conflates two different ideas, the identification of low performers and useful feedback to teachers. If the traditional style evaluation does not provide useful feedback to a teacher, how do unreliable measures that have nothing to do with performance help?

    In the guise of seeming reasonable, this screening process in step two tries to give legitimacy to using value added measures in conjunction with observations by suggesting two unreliable processes can somehow provide a ““feedback loop” in which both value-added and observations are used to ensure that the other is functioning well – i.e., observations are used to verify the identification of low-performing teachers based on value-added (and help them improve), while value-added is used to identify observers whose performance may be lacking.” This presupposes value added measures can be tied directly to teacher activities an observer should be able to see. Furthermore, we have now added another layer of complexity to the process in which a teacher is considered a poor performer because of an unreliable measure and an observer is possibly considered a poor observer whose performance might be lacking because the observer’s feedback is not aligned with that same unreliable measure. Unless we have observers for the observers, how does this situation not bias an observer’s judgement toward validating that value added measure?

    Maybe value added measures could be useful as a starting point in reflecting on, conversing about, or focusing observations on current practices though all those things could take place without value added measures. The real question is why step three does not say the screening approach ensures that value-added measures are never used as determinants of high-stakes personnel decisions?

    Comment by Scott E
    December 19, 2012 at 5:00 AM

Sorry, the comment form is closed at this time.

Disclaimer

This web site and the information contained herein are provided as a service to those who are interested in the work of the Albert Shanker Institute (ASI). ASI makes no warranties, either express or implied, concerning the information contained on or linked from shankerblog.org. The visitor uses the information provided herein at his/her own risk. ASI, its officers, board members, agents, and employees specifically disclaim any and all liability from damages which may result from the utilization of the information provided herein. The content in the shankerblog.org may not necessarily reflect the views or official policy positions of ASI or any related entity or organization.

Banner image adapted from 1975 photograph by Jennie Shanker, daughter of Albert Shanker.