# Bayesian statistics: how to quantify uncertainty.

In science, as in life, we are continually presented with new evidence that potentially alters our view of reality. We have hypotheses about reality and these hypotheses are subject to adjustment or replacement in the light of new information. However, we are rightly reluctant to discard well-established models of reality every time some new information calls them into question. To do so would place us at the mercy of poorly collected data and chance outliers. New evidence must be weighed against pre-existing evidence and alternative explanations for new data must be considered.

The statistical tests most commonly used in scientific publications (known as frequentist statistics) do not take all of the above considerations into account. Results of clinical trials are usually interpreted using the P value (or the related concept of the confidence interval), which is calculated using data only from an individual trial and therefore takes no account of pre-existing evidence. The statistical approach that is specifically designed to assess the overall weight of evidence for or against a hypothesis, and to compare competing hypotheses, is known as Bayesian statistics. It is beyond the scope of this editorial to fully explore the relative merits of Bayesian versus frequentist statistics. However, this editorial aims to give clinicians with minimal technical statistical background (which includes the author) a brief introduction to some important limitations of frequentist statistics, the Bayesian methods that address those limitations, and Bayesian statistics' own limitations.

Bayes' theorem can be expressed in terms of the odds-ratios between two hypotheses (1):

P([H.sub.1]|data)/P([H.sub.2]|data) = P([H.sub.1])/P([H.sub.2]) x P(data|[H.sub.1])/P(data\[H.sub.2])

This elegant formula is easily understood once the notation is explained. The notation P(A|B) represents a conditional probability--the probability of "A given B". The left side of the equation gives the odds ratio in favour of one hypothesis ([H.sub.1]) compared to a competing hypothesis ([H.sub.2]) after considering new data. This is known as the posterior odds. The first ratio on the right side of the equation is known as the prior odds--the relative probability of the two hypotheses based on evidence prior to acquiring the new data. The third ratio is known as the likelihood ratio and is a measure of how strongly the new data supports one hypothesis over another. P(data|[H.sub.1]) is the probability of observing the new data if one assumes [H.sub.1] is true and P(data|[H.sub.2]) is the probability of observing the new data assuming the alternative hypothesis. To express the above formula in words:

Posterior odds = prior odds x likelihood ratio

A couple of simplified examples might help explain Bayes' formula. The 95th centile for height of Australian males is 188 cm. Assume I meet a man whom I hypothesise is Australian, but then I measure his height at 189 cm. Based on his height, what is the probability the man actually is Australian? Clearly, there is not enough information to answer the question: the man is taller than 95% of Australians, but it would seem silly to conclude there is a 95% probability he is not Australian. If we look at Bayes' formula we can see all the other factors that need to be considered. First, the prior odds: how confident am I in the initial hypothesis that he is Australian? Depending on his accent and whether I met the man in Sydney, Chicago or Shanghai, I might assign very different prior odds. Second, the likelihood ratio: we need to know if the 95th centile for height differs between nationalities. If the likelihood of height >188 cm is also less than 5% in non-Australians, then height is a poor discriminator, the likelihood ratio will be close to 1 and the man's height ought not influence my confidence in the hypothesis that he is Australian. This example is intended to illustrate the problem of inverse probability. In symbols, this is expressed as P(A|B) [not equal to] P(B|A). The probability of an Australian male being >188 cm tall is not the same as the probability of a male >188 cm being Australian.

A second, more real-world example is the interpretation of a diagnostic test such as a screening test for cancer. Bayes' formula can be used to calculate the odds that a patient has either cancer (Ca) or no cancer ([Ca.sup.-]) when a test is positive ([T.sup.+]):

P(Ca|[T.sup.+])/P([Ca.sup.-]|[T.sup.+]) P(Ca)/P([Ca.sup.-]) x P([T.sup.+]|Ca)/P([T.sup.+]|[Ca.sup.-])

The likelihood ratio incorporates the sensitivity and specificity of the diagnostic test. P([T.sup.-]|Ca) is the sensitivity, i.e. the probability of a positive test when disease is present. P([T.sup.-]|[Ca.sup.-]) is the probability of a positive test when the patient does not have the disease, which is equal to (1--specificity). If our test has a sensitivity of 90% and specificity of 80% the likelihood ratio is 0.9/0.2=4.5. For example, if a patient is randomly selected from a population in which the prevalence of the cancer is 0.1% (prior odds of 1 to 999) then a positive test increases the odds to 4.5 to 999. However, if the patient is known to be in a high-risk group with a 10% prevalence (odds of 1 to 9) then the same positive test increases the odds to 4.5 to 9 (probability of 33%). Many readers will already be aware of these concepts, although they may be more familiar with the terms positive predictive value (rather than posterior odds) and pre-test probability (rather than prior odds).

Bayes' theorem can also be applied to the interpretation of clinical trials. Some readers may be wondering if this Bayesian approach is a bit too complex. Surely, interpretation of clinical trials is more straightforward--can't we simply use the P value to tell us the probability the null hypothesis is true? Unfortunately, although it is a commonly held belief, and is even stated in many statistics textbooks, the P value is not the probability of the null hypothesis being true, the P value does not give the probability of an alpha error, and the P value is not the probability that the observed result is due to chance alone (1,2). This is another example of the problem of inverse probability: there are two probabilities that should not be confused because they are not interchangeable: 1) the probability of a result given the null hypothesis and 2) the probability of the null hypothesis given a result. The P value gives the first probability, but it is the second probability that we really want to know when it comes to informing clinical decision-making.

Interestingly, these two probabilities actually represent different meanings of the word "probability". The P value is an example of a frequentist probability. It answers the question, "How frequently would the observed result occur if the trial was repeated many times and if the null hypothesis is true?". Conversely, the probability that the null hypothesis is true is not a question of frequency: it's either true or it isn't. This second type of probability answers the question, "Given the available data, how confident should we be that the hypothesis is true?". This is known as subjective probability because it changes depending on the current state of knowledge of the individual making the assessment. Subjective probability is often of prime interest in real world decision-making. Other examples include the probability of a war in the next five years, the probability of dangerous sea level rises from global warming, and the probability of an individual patient surviving an operation.

The practicalities of applying Bayesian statistics in health care can be only briefly touched on here, but they are comprehensively described in an excellent review by Spiegelhalter and colleagues (3). In the first formula above, the symbols [H.sub.1] and [H.sub.2] can be replaced with [H.sub.0] for the null hypothesis (no effect) and [H.sub.A] for an alternative hypothesis (e.g. a clinically significant effect). To derive the likelihood ratio, the Bayesian statistician compares the likelihood of the data given the null hypothesis, P[(data|[H.sub.0]).sup.*], with the likelihood of the data given the alternative hypothesis, P(data|[H.sub.A]). The likelihood ratio encapsulates the strength of evidence provided by the particular data set. Furthermore, with negative studies the likelihood ratio discriminates between underpowered and adequately powered studies. With an underpowered study, the likelihood ratio is close to 1 because the data is consistent with either hypothesis. In contrast, a likelihood ratio strongly favouring the null hypothesis indicates that the negative study was adequately powered and the data should be used to reduce our confidence in the alternative hypothesis.

One particular advantage of the Bayesian approach is that it takes emphasis away from the artificial dichotomy of declaring a result either significant or non-significant. In Bayesian statistics, all data counts as evidence. Whether or not new evidence from an individual trial causes us to shift our view of reality depends on the strength of the new evidence, the degree to which the new evidence is different to previous evidence, and the strength of the previous evidence.

The most difficult and controversial step in Bayesian analysis is determining the prior odds. This is especially the case when there is little objective data on which to base the 'priors'. For example, consider a clinical trial comparing a novel drug with the best currently available drug. Presumably, such a trial would not have been conducted unless someone thought there was a possibility the new drug would represent an improvement; but how to assign a prior probability that the new drug is truly superior? In the complete absence of any information, the simplest approach is to assume 'equal priors', i.e. prior odds of 1:1 (probability of 50%). This approach is considered arbitrary and unsatisfactory by many commentators. Another approach is to include 'subjective priors', which may incorporate the opinions of experts. Unfortunately, humans are notoriously poor at converting their knowledge into probabilities (4) and being expert in the subject area does not help. Exhaustive methods have been described for converting expert opinion into a valid probability distribution that can then be used to derive the prior odds (3). It is beyond the scope of this editorial to describe the mathematical challenges when defining prior probability distributions and then combining those distributions with the likelihoods of new data. Suffice it to say that these problems were often intractable before the advent of modern computing.

Discovering a useful new therapy is difficult, so it is to be expected that most novel therapies do not turn out to be useful. In other words, it is over-optimistic to assume prior odds of 50:50 that a new therapy will be useful. That being the case, Bayes' theorem tells us that when the level of statistical significance is set at P <0.05, the probability of being wrong when rejecting the null hypothesis is usually far greater than 5%. In a controversial and oft-cited article, one author has concluded that the majority of published findings are false (5)! Fortunately, science is self-correcting, and false or over-optimistic findings are eventually moderated by subsequent information; ideally in the form of larger, better-designed trials. Examples of therapies that were enthusiastically adopted then later found to be harmful include magnesium after myocardial infarction6 and perioperative beta blockade (7).

Rather than calculate the probability of a particular hypothesis being true, it is often more informative to consider a range within which the true value may lie. Bayesian statistics can be used to calculate the credible interval, which defines a degree of confidence that the true value lies within a specified interval (2,3). The credible interval is similar to, but not the same as the more familiar confidence interval. Like the posterior odds, the credible interval is calculated from both the prior information and the new data. For a further brief explanation of credibility analysis, readers are referred to a web page that includes a free online calculator for determining the critical odds ratio, which can be used to quickly determine whether a new result seems credible in the light of prior knowledge (8).

Bayes' theorem combines all the elements required to derive a probability that a hypothesis is true, including considerations of study power. In contrast, the frequentist statistic, the P value, answers only a limited hypothetical question and is often mistakenly interpreted to mean more than it does. Why, then, are frequentist statistics ubiquitous in the medical literature (9)? First, frequentist statistics are much easier to compute. Second, as evidence accumulates for or against a hypothesis, frequentist statistics and Bayesian statistics eventually converge on the same conclusions. Third, deriving the prior odds from 'soft' sources such as expert opinion is problematic and anathema to many scientists. A Bayesian may counter this by pointing out that science routinely interprets new data in the light of prior confidence in hypotheses. Carl Sagan's famous aphorism, "Extraordinary claims require extraordinary evidence", is a pithy statement of this principle.

One of the few certainties in the practice of medicine is that we will never eliminate uncertainty from the practice of medicine. Bayesian analysis does not eliminate uncertainty, but it attempts to put the degree of uncertainty into numbers. To summarise this author's take-home messages: 1) the certainty that a hypothesis is true cannot be based on a single trial, but must also take into account all the available information from outside the trial; 2) the P value tends to underestimate the probability of being wrong when accepting a positive result as valid; 3) it is unlikely Bayesian analysis will replace the citing of frequentist statistics in the medical literature in the near future, partly due to the practical and philosophical difficulties in deriving the prior probabilities; and 4) notwithstanding the previous comment, wider appreciation of Bayesian principles could make practitioners more sceptical and perhaps there would be less of a tendency towards fads that arise and then need to be corrected.

REFERENCES

(1.) Goodman SN. Introduction to Bayesian methods I: measuring the strength of evidence. Clin Trials 2005; 2:282-290.

(2.) Sterne JA, Davey Smith G. Sifting the evidence-what's wrong with significance tests? BMJ 2001; 322:226-231.

(3.) Spiegelhalter DJ, Myles JP, Jones DR, Abrams KR. Bayesian methods in health technology assessment: a review. Health Technol Assess 2000; 4:1-130.

(4.) Tversky A, Kahneman D. Judgment under uncertainty: heuristics and biases. Science 1974; 185:1124-1131.

(5.) Ioannidis JPA. Why most published research findings are false. PLoS Med 2005; 2:e124.

(6.) ISIS-4 (Fourth International Study of Infarct Survival) Collaborative Group. ISIS-4: a randomised factorial trial assessing early oral captopril, oral mononitrate, and intravenous magnesium sulphate in 58,050 patients with suspected acute myocardial infarction. Lancet 1995; 345:669-685.

(7.) POISE Study Group, Devereaux PJ, Yang H, Yusuf S, Guyatt G, Leslie K et al. Effects of extended-release metoprolol succinate in patients undergoing non-cardiac surgery (POISE trial): a randomised controlled trial. Lancet 2008; 371:1839-1847.

(8.) Bayesian credibility analysis. From http://statpages.org/ bayecred.html Accessed September 2011.

(9.) Bland JM, Altman DG. Bayesians and frequentists. BMJ 1998; 317:1151-1160.

T. J. MCCULLQCH

Department of Anaesthetics, University of Sydney, Sydney, New South Wales, Australia

Footnote

* P(data|[H.sub.0]) is similar to the P value except that the P value is the likelihood of a difference as great or greater than the observed effect, whereas Bayesian statistics considers only the observed data.
COPYRIGHT 2011 Australian Society of Anaesthetists
No portion of this article can be reproduced without the express written permission from the copyright holder.