Reuniting 'is' and 'ought' in empirical legal scholarship.
INTRODUCTION I. RELATING THE MEASURABLE TO THE GOOD A. Normative Metrics B. Medical Research C. Economics D. Empirical Legal Scholarship II. JUDICIAL CITATION COUNTS III. REVERSAL RATES IV. MEASURING THE RULE OF LAW: STUDIES OF INTERJUDGE DISPARITY A. The Normative Implications of Disparity B. Consistency, Predictability, and Comparative Justice C. Determinacy and Correctness D. Conclusion V. BRIDGING THE GAP BETWEEN 'IS' AND 'OUGHT' A. Prioritizing Normative Goals B. Rethinking Empirical Legal Methodology C. Accomodating Subjective Phenomena D. Emphasizing Generalizable Results CONCLUSION
A century ago, Roscoe Pound set forth his agenda for a "sociological jurisprudence" (1) that would study the "actual social effects of legal institutions and legal doctrines." (2) Pound sought to use empirical social science to advance normative goals: he "regard[ed] law as a social institution which may be improved by intelligent human effort" and proposed that social science could "discover the best means of furthering and directing such effort." (3) Two decades later, Karl Llewellyn issued his call for a "realistic jurisprudence" that would use empirical social science to study the determinants and consequences of judicial decisions. (4) Llewellyn was also motivated by normative ends, believing that in order to investigate whether the "law does what it ought," one must "first answer what it is doing now." (5) Pound and Llewellyn sparred over their respective visions, (6) but it is important to remember that they shared a common aim: to use empirical social science to improve the law.
The early legal empiricists were mindful of the challenges of connecting positive and normative approaches to legal scholarship. In his exchange with Pound, Llewellyn famously called for a "temporary divorce of Is and Ought." (7) He believed that the separation of 'is' and 'ought' was necessary for scientific credibility, but that it must be temporary in order to serve the goals of legal reform. (8) In the years that followed, however, legal empiricists struggled to balance the competing demands of social science and legal reform. (9) Some failed to separate 'is' and 'ought,' allowing their normative commitments to
influence their factual findings. (10) Others failed to reunite 'is' and 'ought,' producing "a mindless amassing of statistics without reference to any guiding theory whatsoever." (11) Years later, a disillusioned Llewellyn mocked his fellow realists for their pointless empirical projects. (12) He wrote: "I read all the results, but I never dug out what most of the counting was good for." (13)
The early legal empiricists had worthy ambitions, but their accomplishments were meager. (14) There were many reasons for their failure, (15) but prominent among them was their inability to develop any kind of theoretical framework for making their empirical findings relevant to normative legal scholarship. (16) Today, empirical legal scholarship is flourishing again, (17) and contemporary empiricists are far more sophisticated than their predecessors. Many law professors now have advanced social science training (18) and employ sophisticated methodologies from other disciplines to analyze and interpret data. Like the early empiricists, however, they are still struggling to balance the methodological imperatives of social science with the desire for legal reform. Often, the quest for scientific credibility leads contemporary empiricists to lose sight of the normative goals of legal scholarship. Some empirical studies make efforts to relate their findings to normative questions about law, and some even offer policy prescriptions, but such studies rarely explain how they derive an 'ought' from an 'is.' Even a cursory examination of the premises underlying such claims often reveals them to be untenable.
Empirical research projects need not generate immediate prescriptions, but even positive legal research should address topics that have some importance for legal scholarship. Because the law is a normative practice and exists to serve social purposes, determining what is important in legal scholarship requires some reference to the normative goals of law. (19) Thus, any empirical research that purports to be relevant to legal scholarship requires some framework for connecting 'is' and 'ought.'
As Barry Friedman has prominently argued, empirical legal scholars should "ask, at the outset of every project, why we ... might care about what is being studied." (20) Yet it is not enough to admonish legal empiricists to pay more attention to normative implications. In many settings, there are complex relationships between the phenomena that are readily measured and the values that can justify legal reform. Intuition alone cannot suffice to relate observable data to normative claims; legal scholarship needs conceptual frameworks and empirical methods that can bridge the gap between 'is' and 'ought.' Developing such frameworks will require a sustained agenda that integrates empirical methodology with legal theory.
Part I of this Article begins by considering how other disciplines have developed methods for relating quantitative empirical findings to normative claims. Typically, this is accomplished by formulating a normative metric that quantifies the goodness of the results. Using medicine and economics as examples, Part I shows how scholars in these disciplines have developed frameworks and methods for connecting the positive and the normative.
Empirical legal scholars, by contrast, often seek normative relevance by examining measureable phenomena that have some intuitive but only vaguely specified connection to a normative goal. Many studies simply conflate the measureable with the good, justifying policy proposals on the basis of the measureable objects. Parts II-IV provide illustrations of this approach for three commonly discussed judicial statistics. Part II focuses on judicial citation counts, Part III examines reversal rates, and Part IV critiques measures of interjudge disparity. These statistics are often used in empirical legal scholarship to capture conceptions of good judicial decisionmaking, and all three have been used to justify bold policy proposals.
For example, scholars have argued that judicial citation counts should be used to determine a shortlist for Supreme Court nominations, (21) to assess the merits of judicial selection procedures, (22) to determine whether judges are overpaid, (23) and even to examine whether men or women make better judges. (24) Studies documenting interjudge disparities played a prominent role in the enactment of the United States Sentencing Guidelines (25) and have also been used to justify reforms in Social Security (26) and immigration adjudication. (27) Reversal rates have been cited in debates about whether to split the Ninth Circuit (28) and used to appraise reforms in immigration adjudication. (29) Because such measures lack intrinsic normative force, however, policy arguments based on these measures alone are untenable. These measures may well have some relevance to normative concerns, but the studies are seldom explicit about their normative goals, how the data relate to these goals, and what premises are needed to justify the conclusions.
Part V discusses ways that legal empiricists can bridge the gap between 'is' and 'ought.' Most fundamentally, legal empiricists need to prioritize normative questions; research should focus on what is important, not what is easily measureable. In addition, empiricists need to rethink some aspects of empirical legal methodology. The choice of methods should be driven by questions, not the other way around. Empiricists should not try to seek objective, assumption-free conclusions, but rather should indicate how findings can be combined with assumptions to generate meaningful conclusions. Finally, due to the nature of the questions that arise in legal scholarship and the limits of experimentation, legal scholars should pay more attention to how their findings can generalize to new settings.
I. RELATING THE MEASURABLE TO THE GOOD
Empirical research is inherently descriptive, yet legal scholarship is predominantly normative. (30) Bridging the gap between 'is' and 'ought' therefore requires some form of normative premise. When empirical legal scholars seek to relate their empirical findings to normative claims about the law or legal institutions, however, their claims often have vague, unstated foundations. There is frequently a striking contrast between the effort devoted to making credible statistical inferences and the lax attitude toward articulating premises that can connect empirical findings to normative claims about law.
The challenge of relating empirical findings to normative claims is hardly unique to legal scholarship. Many professional disciplines and applied sciences--such as medicine, engineering, education, and environmental studies--harness scientific knowledge in the pursuit of social purposes. Although most empirical research in the social science disciplines is positive, the research questions of these disciplines are similarly motivated by normative ends.
This Part discusses the use of normative metrics in disciplines other than law. In some settings, the relevant metrics are directly measureable, and the results are self-interpreting. This part then examines frameworks for connecting 'is' and 'ought' in medical research and in economics, which use more sophisticated theories and methods to relate empirical findings to normative goals. In contrast to law, scholars in these disciplines are explicit about how empirical findings are used to support normative claims.
A. Normative Metrics
In quantitative studies, a normative premise is typically formulated in terms of a metric that maps states of the world into levels of goodness. A function f would constitute an normative metric if f(A) > f(B) whenever state A is preferred over state B. In economics, for example, the function f typically represents economic surplus or some conception of social welfare. Similarly, research on criminal justice might evaluate policing policies in terms of crime rates, (31) medical researchers examine health outcomes and survival rates, (32) and education research often examines academic achievement. (33)
Any policy claim derived from an empirical study is only as credible as the normative metric that is employed. Justifying a metric requires two steps. First, one needs a theory of the good. For example, the metrics described above are premised on the desirability of low crime, economic efficiency, good health, or academic achievement. These normative premises are uncontroversial, even if there might be disagreement about how tradeoffs should be made among competing goals.
Second, one needs to relate observable phenomena to the measure of goodness. When good or bad outcomes are directly measurable--such as when the outcomes of a medical trial are "survival" and "death"--the results will be self-interpreting and no deeper theory is needed. If such a trial is well controlled, simple statistical methods may be adequate to assess the impact of a treatment and to justify prescriptive claims.
In many settings, however, the normative metric will not be directly measureable, but rather must be inferred from other observable variables. In these settings, more complex inferential methods and deeper theories are needed to justify normative claims. The following Sections will discuss concepts and methods that other disciplines have developed to relate measureable outcomes to normative claims.
B. Medical Research
Medicine is a prominent example of a discipline that is both scientific and prescriptive. Medical research uses scientific methods to examine the effects of various treatments, but the practice of medicine has explicit normative goals: the "diagnosis, treatment, and prevention of disease." (34) Thus, the commonly accepted metrics for evaluating medical treatments are outcomes that represent "how a patient feels, functions, or survives." (35)
When these outcomes are directly measureable, the normative implications of a medical trial may be obvious. Often, however, these normatively relevant outcomes cannot be readily measured, especially when the effects of a treatment may not accrue until many years after the treatment is administered. In such trials, medical researchers often use "surrogate outcomes" to proxy for the clinically meaningful outcomes. For instance, when the effect of a drug regimen on heart disease and life expectancy might not be observed for many years, a study might examine whether the drug regimen significantly reduces levels of blood cholesterol. Here, blood cholesterol is a surrogate; it has no normative significance beyond its tendency to promote coronary disease.
In this example, a treatment cannot be justified merely on the basis of its estimated effect on cholesterol levels. To justify an intervention, any surrogate measure must be validated by showing that an effect of the treatment on the surrogate will correspond to an effect on a meaningful clinical outcome. Validation of the surrogate measure requires two steps. First, one must specify the clinical outcome that the surrogate is intended to measure, such as survival, comfort, or functional capacity. (36) Second, one must explain the relationship between the surrogate and the clinical outcome and show how inferences about the former can facilitate inferences about the latter. This step requires both a statistical association between the surrogate and the clinical outcome and an understanding of the causal relationship between the two.
Biostatisticians have developed a rich literature on the use of surrogate measures, providing a variety of complex conditions under which surrogates can be used to support inferences about true outcomes (37) In particular, correlation between the surrogate and the true measure is not sufficient to justify the use of a surrogate in a clinical trial. (38) Often, there may be multiple causal pathways between a disease and a true clinical outcome, only one of which is captured by the surrogate measure. In such a situation, measuring the impact of a treatment on the surrogate will fail to capture the impact of the treatment on the true outcome. (39)
For example, it would be infeasible to measure the effectiveness of youth anti-smoking programs by examining the proportion of treated youth who die prematurely from lung cancer. Researchers might instead use subsequent smoking behavior as a surrogate for premature lung cancer death. (40) Similarly, reduction in tumor size might be a valid surrogate for survival rates in estimating the impact of chemotherapy regimens on lung cancer patients. Both cigarette smoking and tumor size are highly correlated with lung cancer deaths, and both have a direct causal impact. But smoking rates could not be used as a surrogate for measuring the effectiveness of chemotherapy, and tumor size could not be used as a surrogate for antismoking campaigns. (41)
In a number of instances, drugs have been approved on the basis of their effect on surrogate measures but were subsequently discovered to have harmful effects on clinical outcomes. (42) As these experiences show, the relationship between surrogates and meaningful outcomes cannot simply be asserted, but must be carefully scrutinized. Understanding the causal relationship between medical interventions, surrogates, and clinically meaningful outcomes is essential to the validation of any surrogate measure.
Economics, like many of the social sciences, combines positive research with normative goals. Economists study the production, consumption, and distribution of goods and services, but scholarship in economics is not merely motivated by idle curiosity about producers, consumers, and markets. Rather, the study of economics is motivated by an understanding that economic activity serves social purposes and that certain policies may advance or hinder those purposes.
Empirical economists often have access to voluminous data on prices and levels of output for goods in various markets. Such data, however, typically have no intrinsic normative significance; one would not justify a policy merely on the basis of its tendency to affect prices or output levels. Unlike in medicine, there typically are not measureable phenomena that can be used as surrogates for economic wellbeing. To assess the desirability of outcomes, economists have formulated concepts such as consumer and producer surplus, which represent the gains from trade in a market. (43)
Note that surplus is a purely abstract concept, with no analog in the natural world. It is defined by reference to supply and demand curves, which represent the quantities producers would supply and consumers would demand at various counterfactual prices. Surplus, supply curves, and demand curves cannot be physically measured, but rather must be estimated by combining data on prices and output levels at different points in time with theoretical assumptions about consumer and producer behavior.
Because the framework for relating observable data to surplus is so well established, (44) economists do not need to revisit its fundamentals every time they evaluate a proposed policy. Indeed, it can be easy to overlook the assumptions underlying any calculation of surplus. (45) Nevertheless, the concept of surplus allows economists to organize data on prices and output levels into a measure of economic wellbeing that can assess the impact of various policies and provide justification for proposed reforms.
D. Empirical Legal Scholarship
In contrast to medicine and economics, legal scholarship lacks frameworks for connecting empirical findings to normative claims. Occasionally, when legal changes can be assessed in terms of outcomes that have direct normative significance, there is no need for sophisticated theory. For example, in studies that examine the impact of tort reforms on medical complications in childbirth (46) or the effect of school desegregation decisions on black dropout rates, (47) the normative significance of the outcomes is clear.
At other times, frameworks from other disciplines are sufficient to evaluate outcomes. For example, studies that evaluate the impact of school finance decisions on academic achievement may use test scores as outcome variables. (48) Similarly, studies may use economic concepts to appraise the welfare impacts of proposed mergers. (49)
The methods of the other disciplines are most likely to be adequate when the outcomes of interest are similar to those that arise in the other disciplines. If one views law purely as a means to achieve policy goals, then the methods of the social sciences can often be used with little adaptation. Studies such as those discussed above only need to comprehend law well enough to understand the timing and expected impact of legal changes. Such studies, however, are conducted from an external point of view; one does not need to take any position on the validity of legal events in order to appraise them from a policy perspective. (50)
Some empirical legal research, however, appears to be motivated by values internal to law. Although these values are rarely made explicit, such studies appear to be animated by concerns about deciding cases correctly, treating likes alike, or writing good judicial opinions. Such concepts are not directly measurable, however, and the methods of the social sciences are often inadequate for connecting measureable outcomes to these concepts.
Developing methods for evaluating the effects of legal rules and institutions according to criteria internal to law ought to be a priority for legal empiricists. Influential theorists such as Ronald Dworkin, (51) Joseph Raz, (52) and John Rawls (53) have argued that institutions should be evaluated by their tendency to protect rights and promote justice. (54) Many debates about interpretive methods focus on their tendency to promote accurate interpretation, substantive justice, and the rule of law. (55) And the Supreme Court's administrative due process jurisprudence evaluates procedures according to their "capacity for accurate factfinding and appropriate application of substantive legal norms to the facts as found." (56)
The fundamental challenge is that such internal values cannot be directly measured. Instead of developing theories that can relate observable data to these values, as scholars have done in medical research and in economics, empirical studies in law often substitute proxy variables that have some asserted but unspecified connection to the motivating values. Parts II-IV examine three such measures--citation counts, reversal rates, and interjudge disparities--that empirical scholars commonly use to evaluate judges and legal institutions. Each of these outcome measures has an intuitive relevance to a normative goal, but the relationship is vague and under-theorized. Because scholars are rarely explicit about the relationship between these measures and the intended measure of merit--indeed, the measure of merit is rarely defined--the empirical evidence cannot justify the normative claims.
II. JUDICIAL CITATION COUNTS
In a series of recent articles, Stephen Choi, Mitu Gulati, and various coauthors have advocated the use of quantifiable metrics to address normative questions about judicial appointment, promotion, retention, and compensation. They have argued that nominations to the Supreme Court should be determined on the basis of three empirical indicators: citation counts, the number of opinions authored, and the rate at which judges disagree with colleagues of the same political party, (57) Using these measures in a series of studies, they found evidence that "female judges ... perform better than male judges" (58) and that "elected judges are superior to appointed judges." (59) The authors also used these same performance measures to estimate the effects of judicial compensation, finding that "it is as likely that judges are overpaid as that they are underpaid." (60)
These authors were not the first to apply a quantitative analysis to the study of judicial citations. As early as 1936, one study tabulated the number of citations to each state's courts from other state courts and the U.S. Supreme Court. (61) The goals of the early citation studies, however, were purely descriptive. Scholars typically characterized citation counts as a measure of influence but did not use them to justify prescriptive claims. At times, these measures were given normative interpretations; for example, Judge Richard Posner claimed that Learned Hand's citation counts confirmed that he "was indeed a great judge." (62) But until recently, no one argued that these measures should guide judicial appointments or the design of legal institutions.
Before we cut judges' pay and jettison judicial independence, however, we should scrutinize how the authors derived their normative claims from their empirical findings. They do not claim that citations themselves are a measure of goodness; in fact, they acknowledge that their measures "do not provide a perfect metric for judging skill" (63) and are merely "rough proxies. (64) Thus, the fact that Judge A has more citations than Judge B does not directly justify an assertion that that Judge A is better than Judge B.
But once they acknowledge that citations do not actually measure quality, how can they use aggregate comparisons between groups of judges to justify claims about the relative quality of elected judges versus appointed judges, or male judges versus female judges? Citation counts could arguably be viewed as surrogates (65) for some "true" measure of judicial quality, but if so, their use as surrogates must be validated. Choi and Gulati justify the validity of citation counts largely on theoretical grounds, analogizing the body of precedent to a "market" for judicial opinions. (66) Because the "price" of citing opinions is zero, judges will compete on quality. As they put it, "[a]ll judges will cite the best opinions," (67) and therefore, the best judges will garner the most citations.
Many critics, however, have questioned how well citation counts actually correlate with merit. (68) In addition, there are plausible arguments that citation measures may be correlated with judicial "vices." (69) One claim is that citation counts reward originality, so these measures will reward judges who change the law rather than follow it. (70) Another argument is that unclear opinions may create uncertainty and generate more litigation, thus generating more citations. (71) Finally, "an opinion notorious for being 'wrong' might also lead to many cites." (72)
Because citation counts might be associated with judicial vices as well as judicial virtues, theory alone cannot validate the use of citation counts as a surrogate for quality. Determining which judicial characteristics constitute virtue and vice is a matter of normative theory. But for any conception of judicial quality, determining whether citations are more strongly associated with virtue or vice is an empirical question. Of course, this is impossible to test without first specifying a normative benchmark. (73)
Many studies simply assume the validity of citation counts as a surrogate for quality, acknowledging that citations could be correlated with judicial vice, but dismissing this possibility as unlikely. (74) A mere positive correlation, however, is not sufficient to validate citations as a surrogate for quality. (75) Scholars who seek to use citation measures to inform policy decisions must be able to convey uncertainty about their assessments of judicial merit. This cannot be done without a nuanced understanding of the relationship between citations and the conception of merit that is employed.
Choi and Gulati also defend the proposed use of citation counts in the selection of Supreme Court justices on the ground that "objective factors will do better than what we have now: a biased and nontransparent process overwhelmed by politics." (76) The use of objective measures, however, cannot displace normative debates about judicial merit. Citation counts cannot be validated as a surrogate without first articulating a conception of merit.
In addition, there are many objective measures that could potentially be used to evaluate judges. How could one choose among them without some normative baseline for comparison? To illustrate, compare the citation measures proposed by Choi and Gulati with alternative measures proposed by Robert Anderson. (77) Whereas Choi and Gulati do not distinguish among positive, negative, and neutral citations in their measures, Anderson interprets negative citations as evidence of low quality and ignores neutral citations. Choi and Gulati count only citations from outside a judge's circuit, whereas Anderson counts citations from both inside and outside a judge's circuit. Finally, Choi and Gulati count citations to all opinions authored by a judge, while Anderson counts citations to all decisions in which the judge was on the panel.
Not surprisingly, these two methodologies yield very different rankings. (78) Even if both of these measures could potentially be useful for measuring judicial performance, how could one know which measure to use? Is the difference between the two measures primarily methodological, in the sense that one method might be a better surrogate for a common conception of judicial quality? Or is the difference primarily normative, in the sense that the measures serve as surrogates for competing conceptions of judicial merit?
Anderson characterizes the differences between the measures as both methodological and normative. He justifies the exclusion of negative citations on normative grounds, arguing that negative citations may be appropriate for measuring influence, but that only positive citations are appropriate for measuring quality. (79) Similarly, he justifies examining panel membership rather than opinion authorship on the grounds that it "capture[s] collegial factors that should enter into a measure of good judging." (80) But he also claims that part of the difference is methodological, arguing that using panel membership is appropriate because it "mitigate[s] the effects of selection bias in opinion assignment." (81)
To the extent that the difference between the two measures is methodological, one cannot assess which is a better surrogate without specifying a conception of judicial merit. And to the extent the difference is normative, the use of objective and quantifiable measures cannot displace normative debates about judicial merit. Either way, one cannot choose between these two measures without taking a position in the normative debate that the citation studies are purporting to circumvent.
Nevertheless, citation counts could conceivably be validated subjectively. Even though conceptions of judicial quality are inherently subjective, objective data could still be used to inform those subjective judgments. Scholars, for example, could survey informed observers about their perceptions of judges' relative competence or the quality of particular opinions. On certain dimensions of judicial quality, there is likely to be strong agreement. To take some extreme examples, everyone would agree that Chief Justice John Marshall was a greater judge than his contemporary Gabriel Duvall, (82) or that Learned Hand (83) was superior to his colleague Martin Manton, who went to prison for accepting bribes. (84) On other dimensions, however, assessments of judicial quality are likely to be disputed. For example, a comparison of Justices Sotomayor and Alito will likely depend on one's ideological leanings.
Such surveys could reveal the degree to which conceptions of judicial merit are shared and the degree to which they are disputed. The grounds for disagreement could potentially be approximated by a small number of salient dimensions, such as liberalism versus conservatism or pragmatism versus formalism. Empirical studies of citations can never tell us what kind of judge we ought to prefer, but they might conceivably shed light on how judges measure along these dimensions of judicial merit. To the extent that there are commonly shared conceptions of quality, these objective measures might at least be able to distinguish good judges on each side of the ideological spectrum from mediocre ones. Of course, survey responses do not indicate merit in an objective sense, but at least they would correspond to the conceptions of merit that are prevalent in scholarly dialogue or democratic deliberation.
It is essential, however, that citation measures be validated as surrogates for some discoverable measure of quality. In theory, it may seem plausible that good judges would be more productive and write better opinions, and that better opinions would generate more citations. But judicial craft is only one factor--and possibly a minor one--in determining how often a case is cited. Even a cursory examination can show that citation counts do not correspond very well to commonly held perceptions of judicial merit.
Consider two canonical torts cases--Palsgraf v. Long Island Railroad Co. (85) and United States v. Carroll Towing Co. (86)--which are taught in virtually every first-year law school torts class. Judge Cardozo's opinion in Palsgraf, which has been described as "[p]erhaps the most celebrated of all tort cases," (87) has been cited 218 times in published opinions in federal and state courts. (88) Judge Hand's opinion in Carroll Towing, which formulated the "Learned Hand rule" for negligence liability and has been described as one of the "two most influential opinions that Hand ever wrote," (89) has been cited a total of 177 times. (90) By comparison, the opinion in Bonnet v. City of Prichard, (91) which holds that all Fifth Circuit decisions handed down prior to October 1, 1981 are binding precedent in the Eleventh Circuit, has been cited 4311 times. (92)
Similarly, Marbury v. Madison (93) has been cited 252 times in Supreme Court opinions, barely more than once per term. (94) McCulloch v. Maryland (95) has been cited 326 times in Supreme Court opinions, (96) less than twice per term. But United States v. Detroit Timber & Lumber Co., (97) which held that the syllabus is not part of the opinion of the Court, has been cited 4362 times in the U.S. Reports. (98)
These examples illustrate that frequency of citation does not necessarily correspond to commonly held perceptions of the importance of a holding or the quality of the written opinion. A more detailed comparison of Supreme Court decisions confirms this same pattern. Figure 1 compares a selection of Supreme Court decisions, displaying how often each case was cited per year in reported federal cases. (99) Canonical constitutional cases are dwarfed by holdings on frequently litigated issues such as standards for summary judgment and pleading requirements. Brown v. Board of Education (100) is cited 29 times per year, whereas Anderson v. Liberty Lobby (101) and Celotex Corp. v. Catrett (102)--two decisions providing standards for summary judgment--are each cited more than 1600 times per year. Since it was decided in 1986, Anderson v. Liberty Lobby has been cited almost 45,000 times, roughly as many times as every case decided by the Marshall Court combined. (103)
Although judicial merit may well influence how often a judge's opinions are cited, these examples show that citation counts are strongly influenced by factors unrelated to merit, such as how often an issue is presented in litigation. Conceivably, such factors might be less relevant when comparing citation counts at the level of individual judges. If each judge decides a mix of high- and low-profile cases over time, then citation counts aggregated by a judge might conceivably better correlate with commonly held perceptions of merit. Such a claim is difficult to test, largely because judicial merit is contested, and even subjective perceptions are difficult to quantify. But citation statistics for Judge Learned Hand and his contemporaries on the Second Circuit, as reported in an article by Judge Richard Posner, (104) raise serious questions about the validity of these measures, even when aggregated by judge. Using Posner's results, I compiled statistics on opinions authored and citations per year for judges who were active from 1925 until 1939, when Learned Hand and Martin Manton served together. (105) The statistics are based on published majority opinions, and the citation counts only include citations by federal courts of appeals.
It may provide some reassurance that Learned Hand dominates his contemporaries, including Manton, in citations per year. But Manton has more citations per year than highly respected judges such as Thomas Swan and Augustus Hand. (106) Moreover, in opinions per year--the measure of "productivity" used by Choi and Gulati--Manton easily outpaces all of the other Second Circuit judges, including Learned Hand.
Manton's deficiencies were not merely ethical; he was held in low esteem even before evidence of his corruption had surfaced. Learned Hand had a poor opinion of Manton, perceiving him as "incapable of turning out memoranda and opinions that could earn him the respect from the bar or bench." (107) Chief Justice Taft believed that Manton "never should have been appointed to the bench in the first place." (108) Other prominent contemporaries described him as "unfit for the bench" (109) and "one of Wilson's worst appointments."(110) Yet in terms of two quantitative measures commonly used to evaluate judges--"opinion quality" and "productivity"--Manton compares quite favorably to most of his Second Circuit contemporaries. If sufficient weight were given to judicial "productivity," Manton might even rank above Learned Hand.
The fact that such quantitative measures cannot distinguish highly respected circuit judges from a judge widely regarded as one of the worst in history raises serious questions about whether these measures are valid surrogates for quality. Perhaps a more careful analysis of judicial citations might yield useful information about some conceptions of judicial merit. The analysis here, for example, did not distinguish between positive and negative citations, or between in-circuit and out-of-circuit citations. It may also be possible to control for outlier opinions that are highly cited, such as those involving summary judgment. Various statistical adjustments could potentially lead to more refined citation measures that more accurately reflect some conception of judicial merit. The multiplicity of possible adjustments, however, presents a choice of which to apply, which requires some external conception of merit against which the various adjustments can be compared.
III. REVERSAL RATES
Reversal rates are a commonly used outcome measure in empirical studies of judicial decisionmaking and are widely used to justify normative claims about judges and courts. In the last decade, more than 1000 law review articles included some mention of reversal or affirmance rates, although many uses were purely descriptive. (111) Like citation counts, reversal statistics are easy to calculate but can be difficult to interpret.
Many scholars have advocated using reversal rates as indicators of judicial quality, (112) and some state courts use reversal rates in judicial performance evaluations,(113) Reversal rates are also commonly used to evaluate circuits, with one study even assigning letter grades to the various circuits based on how often they are reversed by the Supreme Court. (114) In debates about splitting the Ninth Circuit, scholars and judges have often discussed the Ninth Circuit's high reversal rate and debated its normative significance. (115) Judges themselves have considered reversal rates in trial courts and administrative proceedings in determining whether procedures were adequate under the due process clause. (116) A growing literature in patent law has examined how often the Federal Circuit reverses claim construction decisions by district judges and debated the implications of the reversal rate. (117) One recent study evaluated economic training programs for district court judges by measuring how often their decisions in antitrust cases were appealed and reversed. (118)
Reversal rates have been prominently featured in debates about reforming asylum adjudication. In 2002, then-Attorney General John Ashcroft adopted "streamlining" rules for the Board of Immigration Appeals (BIA), which permitted decisions by immigration judges to be affirmed by a single BIA member in an unsigned opinion. (119) The rate at which the BIA reversed immigration judge decisions plummeted, leading some commentators to criticize the streamlining reforms for allowing errors to go uncorrected. (120) Ashcroft contended that these reversal rates had no significance (121) but then went on to claim that "the BIA streamlining reforms were a profound success" because fewer than ten percent of BIA decisions were reversed by circuit courts. (122) Yet in a widely noted opinion, Judge Richard Posner cited the BIA's reversal rate in the Seventh Circuit as evidence that immigration adjudication had "fallen below the minimum standards of legal justice." (123)
Legal scholars seem to think that reversal rates are worth discussing, but they rarely articulate why these rates are purportedly meaningful. Often, reversal rates are conflated with error rates or imbued with unwarranted normativity. One study, for example, found that more than two-thirds of death sentences in state courts are ultimately overturned on appeal. (124) After performing a highly sophisticated statistical analysis to examine what Factors predicted reversal, the authors considered policy options to reduce reversal rates. First, they proposed rules requiring disclosure of exculpatory evidence and more funding for defense lawyers at the trial stage. (125) But then they noted that reversal rates could also be reduced by limiting the grounds for reversal and withdrawing funding for attorneys who represent death-row inmates at the post-conviction stage. (126) The authors did not actually advocate the latter proposals, acknowledging that "[t]he positive impact of such policies is questionable." (127) The fact that they simultaneously considered increasing funding for trial lawyers and defunding appellate lawyers, however, suggests that they were asking the wrong question. By conflating reversals with errors, (128) the authors lost sight of their normative goals. Defunding appellate lawyers may well reduce reversals, but this should serve as a reminder that the reduction of reversal rates is not a worthy end in itself.
As with citation counts, reversal rates do not have any intrinsic normative significance; they are only useful insofar as they can shed light on other normatively significant quantities, such as error rates. A reversal is a good outcome when the lower court was wrong, but it is a bad outcome when the lower court was correct. If the applicable law is indeterminate, a reversal reflects the fact that the higher and the lower courts are exercising discretion differently. Reversal rates, however, aggregate "good reversals" and "bad reversals," as well as "ambiguous reversals" when the law is indeterminate.
Although reversal rates are commonly used to measure error rates of lower courts, they accurately reflect error only when four conditions are satisfied: the law is always determinate, both courts are addressing the same legal question and relying on the same sources of law, all cases are appealed, and the higher court is always correct. Scholars can debate whether and when the first two conditions hold, (129) but the third is rarely satisfied and the fourth is almost always implausible. Thus, additional assumptions are necessary to draw normative conclusions from reversal rates.
The proportion of cases that are appealed is especially relevant when the reviewing court is the U.S. Supreme Court, which hears only a tiny fraction of petitioned cases. For example, Judge Jerome Farris observed that the Supreme Court reversed the Ninth Circuit in 28 out of 29 cases it reviewed in 1997. (130) Yet he defended the Ninth Circuit by arguing that the Court let stand more than 99% of all Ninth Circuit decisions from the previous year. (131)
A further complication is that reviewing courts do not necessarily consider the same legal issues as lower courts. When lower court decisions are reviewed under a deferential standard, a reversal might be stronger evidence of error, or at least strong disagreement. Failure to reverse, however, does not show that the higher court believed that the lower court judgment was correct.
In addition, higher courts and lower courts are often bound by different sources of law, even when resolving the same dispute. A circuit court panel may reach a result that is compelled by circuit precedent, but the Supreme Court would not be bound by the same circuit precedent. The Supreme Court also has the authority to overrule its own precedent, whereas a circuit court is obligated to follow such precedent until it is overruled by the Supreme Court. (132) Thus, reversal by the Supreme Court may well represent the application of different legal principles rather than disagreement about the same legal principles. In other words, the Supreme Court can overrule a circuit court, and both can still be correct.
Consider Judge Richard Posner's opinion in Khan v. State Oil Co., (133) an antitrust case involving maximum resale price maintenance. Judge Posner believed the outcome was controlled by the Supreme Court's holding in Albrecht v. Herald Co. (134) Posner criticized Albrecht at length, describing it as "unsound when decided, and ... inconsistent with later decisions by the Supreme Court." (135) He continued: "It should be overruled. Someday, we expect, it will be." (136) In a not-so-subtle signal to the Supreme Court, Posner wrote, "Yet despite all its infirmities, its increasingly wobbly, moth-eaten foundations, Albrecht has not been expressly overruled." (137)
Presumably, Judge Posner was not disappointed when the Supreme Court reversed him unanimously and overruled Albrecht, (138) relying extensively on his reasoning in the Seventh Circuit decision (139) In this example, it would certainly be reasonable to assert that the Supreme Court was correct to overrule Albrecht, but that Posner was also correct to follow Albrecht despite his disagreement with its holding. From this point of view, the reversal does not reflect poorly on Posner; it resulted from the fact that the Seventh Circuit and the Supreme Court were bound by different sources of law. To the contrary, this reversal demonstrates Posner's influence, since he was able to convince the Court to hear the case and overrule a longstanding precedent that he disfavored.
To support any kind of conclusion about error rates, reversal rates must be interpreted in conjunction with some kind of assumptions about the relative competence of higher and lower courts and the determinacy of the law in the cases being analyzed (140) In debates about the performance of the Ninth Circuit, for example, Judge Diarmuid O'Scannlain has cited the Ninth Circuit's high reversal rate in the Supreme Court as evidence that the Ninth Circuit "got it wrong" in a large majority of the cases that were reviewed. (141) Arthur Hellman has argued that, irrespective of whether the ultimate outcome is correct, "it is not healthy when an intermediate court is reversed repeatedly by the highest court in the structure." (142) But others have argued that the reversal rate reflects positively on the Ninth Circuit. According to Michelle Landis Dauber, the problem was "not that the 9th Circuit [was] getting the law wrong" but rather that "the Rehnquist Court [was] changing the law." (143) Judge Stephen Reinhardt, the most frequently reversed circuit judge in the federal courts, (144) is said to view his reversal rate as a "mark of distinction." (145) Judge Richard Posner, on the other hand, argues that reversal rates are meaningless statistics because "reversals by the Supreme Court often involve disagreement rather than the correction of error, and ... the Supreme Court has neither the capacity nor the incentive to review more than a tiny percentage of federal courts of appeals decisions." (146)
The above commentators agree about what the Ninth Circuit's reversal rate is, but they have sharply differing views about its normative implications. Reversal rates may be objective, but they must be interpreted in conjunction with contestable assumptions about the relative competence of higher and lower courts, the institutional obligations of the lower courts, and the determinacy of the law in the cases under examination. Judge O'Scannlain's conclusions appear to be posited on a belief that the Supreme Court is usually correct when it disagrees with the Ninth Circuit; Judge Reinhardt's and Dauber's viewpoints are premised upon a more negative view of the Supreme Court. Hellman's position is predicated on a view that inferior courts should try to predict how the Supreme Court will rule, but Dauber disagrees, arguing that "the job of an intermediate court does not entail ... trying to divine what the current members of the Supreme Court might do if and when they get the case." (147) Posner's view, on the other hand, reflects his view of the Court as a "political body" (148) rather than as a tribunal resolving legally determinate disputes.
As these conflicting interpretations demonstrate, scholars must be explicit about the premises that underlie their normative conclusions. These premises, moreover, must be plausible. Simple but implausible assumptions such as "the higher court is always correct" may support straightforward interpretations of reversal rates, but such conclusions have little value. What is needed are methods for combining objective data on reversals with plausible assumptions to generate useful conclusions that can inform policymaking.
A study of jury verdicts by Bruce Spencer provides an instructive example. Spencer examined disagreement between juries and judges in trial courts, but the same approach applies to disagreement between higher and lower courts within the judicial hierarchy. Using data in which trial judges had been surveyed about the correct outcome, Spencer estimated the accuracy of jury verdicts under the assumption that the judge is at least as likely to be correct as the jury. (149) Of course, he could have considered alternative assumptions as well. Stronger assumptions--such as that the judge is twice as likely as the jury to be correct--would have yielded sharper inferences. Similarly, weaker assumptions--such as that the judge is correct at least 10% of the time--would have yielded weaker inferences. By interpreting the data according to a variety of assumptions, empirical scholars can make their findings interpretable to an audience with a diverse range of viewpoints.
Please note: Illustration(s) are not available due to copyright restrictions.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||Introduction through III. Reversal Rates, p. 117-146|
|Author:||Fischman, Joshua B.|
|Publication:||University of Pennsylvania Law Review|
|Date:||Dec 1, 2013|
|Previous Article:||Factual precedents.|
|Next Article:||Reuniting 'is' and 'ought' in empirical legal scholarship.|