Printer Friendly

Measuring quality of care in psychiatric emergencies: construction and evaluation of a Bayesian index.

A primary purpose of emergency medical service systems is to ensure that the needs of the emergency victim are matched with the appropriate form of care, and to do so as quickly as possible so that the victim's illness or injury is not exacerbated. Health care systems today, however, are increasingly struggling with the trade-off between the goal of providing quality care and the goal of containing burgeoning health care costs. As a result, health care administrators are becoming more and more concerned about identifying precisely what makes a difference in the quality of patient care. In simplest terms, they want to know what improvements in the service delivery system are worth spending money on.

The growing demand for better tools of evaluation is especially strong among emergency departments (EDs), which must serve patients suffering from a wide array of physical and psychiatric symptoms. Useful tools particularly lacking are those for evaluating the ED treatment of emergency psychiatric patients, defined by Soreff (1981) as persons needing immediate assistance for an intrapsychic, interpersonal, or biological condition that the victim or others can no longer tolerate.

Psychiatric emergencies make up a significant part of the ED patient population: 10 percent of all emergency department visits nationwide (Bassuk, Winter, and Apsler 1983). ED physicians must assume responsibility both for those patients who use the ED to gain access to the mental health system (Bassuk and Berson 1980; Rund 1986) and for those who have been discharged by that system and thus have nowhere else to go (Nurius 1983-1984; Schulberg and Burns 1985). The quality of care these patients receive may be jeopardized by any number of factors (Jacobs 1982-1983; Knesper 1982), including the fast-paced, medically (as opposed to psychiatrically) oriented environment of the ED itself, which discourages detailed assessment of patients' mental health status (Soreff 1981; Schulberg and Burns 1985).

Thus, there is a pressing need to evaluate the effectiveness of ED care for psychiatric patients. Evaluating medical care is always difficult, but evaluating psychiatric care is even more so, because when the outcome measure typically used in medical evaluations--mortality--is used in evaluating psychiatric care, it generates too little variation to be useful. In an examination of the ED records on 2,231 emergency psychiatric patients for the purposes of the present study, for example, we found that only one patient had died after arriving at the ED.

Another measure--recidivism, or the patient's unscheduled return to the ED--is difficult to interpret because it is unclear if an increase in recidivism is a sign of better or of worse mental health care. Yet another possible measure--the patient's compliance with the ED's follow-up instructions--can be used in evaluating the extent to which the ED has accomplished its purpose when referring psychiatric patients to the mental health system. But this measure applies only to those patients who have the option to comply, namely, those discharged home. As it turns out, this population, although small (only 10 percent of the emergency psychiatric patients we examine in this article), does demonstrate a significant variation in this outcome. Nonetheless, patient compliance as a measure offers little insight into the source of variations within and across EDs in specific practices.

In this article, we present a measure that can be applied to all patients treated for psychiatric emergencies and that helps identify processes that may be the source of variations in quality. This measure is applicable to the entire population of psychiatric patients in the ED, and our evaluations of it have shown it to be a reliable, valid predictor of the quality of care, yielding sufficient variation to be a useful tool in ED evaluations. The next section describes principles we consider to be important in defining and measuring quality. Then we introduce the subjective Bayesian approach to measuring quality and describe the development of the index for measuring quality. Finally, we present the results of the evaluations we conducted using the index.


The literature documenting attempts to measure the quality of health care offers several general principles that, in our opinion, command substantial agreement in the field. These principles, which served as the underlying rationale for the methodology outlined in the next section, are now listed.

Quality is a judgmental concept. We agree with other authors about this characteristic of quality (Kincaid 1981; Willemain 1983; Georgopoulos 1986). The real operating definition of quality is based on values and perceptions. The key implication of this observation for the development of quality indexes is that they may be most useful if they are based on the judgments of experts: specifically insightful customers and respected practitioners in the field (Gustafson 1992a).

Different dimensions of quality are not equally important. The various aspects of health care quality can differ in their influence on patient outcomes depending not only on the physical or emotional status of the patient but also on the environment and other nonmedical characteristics of the patient's condition. These different influences must be reflected in their relative contribution to the quality of care index.

Dimensions of quality should incorporate structural, process, and outcome elements. Much has been said about the need for quality indexes that measure patient outcomes (Kane, Bell, Hosek, et al. 1984). But structural and process elements of health care (Donabedian 1980)--such as experience of health services personnel with mentally ill patients in crisis and the adequacy of the evaluation of a patient's danger to self and/or others--must also be included to reflect adequately the quality of patient care.

Measures of quality must be subject to standards of reliability and validity. Tests of interrater and intrarater reliability are crucial in determining whether the measure employed yields consistent results among different assessors and with the same assessor over time. The instruments and procedures used in applying the measure must promote this consistency. Furthermore, tests of construct validity are critical for determining whether the measure employed actually measures the intended variable.

Measures of quality should guide process improvement. The ultimate objective of any quality assurance activity should be the improvement in care delivered to patients (Berwick 1989). As such, the measure should not only quantify quality but also explain the reasons for good or bad quality in such a way that care processes can be improved.

Much of the work measuring the quality of health care violates several of these principles. Although efforts to develop indexes of quality have been numerous in some fields, such as long-term care (Gustafson et al. 1990; Gustafson 1992b), the emphasis in the field of emergency psychiatric care has been to develop guidelines for quality rather than measures of quality (Barton 1986).

One exception to this emphasis is the measure developed by Georgopoulos (1986), who asked ED physicians and nurses to rate the quality of care given to psychiatric patients, rape victims, and drug abuse cases. He then averaged the scores given by physicians within each of 30 hospitals and correlated these 30 composite observations with the composite quality scores given by nurses in the same EDs. The average scores given by physicians and by nurses correlated .58 for psychiatric illness cases, .59 for rape cases, and .35 for drug abuse cases. These scores were not assigned to specific patients but to the institution as a whole for each type of case. Thus, the measure represented a first step in demonstrating the necessity of the approach, but its usefulness is limited in terms of evaluating care for different patients or explaining differences in quality within a given ED.

The Bayesian index described in this article, on the other hand, is patient-specific; is based on the subjective judgments of experts; accounts for relative differences in patient characteristics; and includes different structural, process, and outcome dimensions of quality emergency psychiatric care. Moreover, as explained in the next sections, its validity and reliability have been demonstrated.


Bayesian Models for Indexing the Quality of Care

No matter what strategy is followed, the development of an index to measure quality requires that panels of experts identify a set of components or dimensions (|S.sub.1~,|S.sub.2~, ..., |S.sub.n~) to constitute such an index. Once dimensions have been defined and measures or scales have been developed for each of them, the next challenge usually is to aggregate these dimensions into a single index. A variety of techniques exist to do so. These can be broadly classified into two main strategies: empirically derived methods based on sufficiently large data sets and subjectively derived methods based on panels of experts.

Empirically derived methods can consist, for example, of constructing a scoring rule using the dimensions (|S.sub.1~,|S.sub.2~, ..., |S.sub.n~) as predictors of a previously collected measure of quality considered as a dependent variable. Statistical methods such as discriminant analysis, logistic regression, linear regression, and so on can be used depending on the nature of the data and the hypothesized underlying model. These approaches require both a complete data set and an existing holistic measure of quality that can serve as a predictor in the model being developed. As noted before, such a measure is usually not available in the context of interest. Another method used in the absence of such an outcome measure consists of collecting data on the dimensions (|S.sub.1~,|S.sub.2~, ..., |S.sub.n~) and, through principal component analysis or factor analysis, creating a few linear combinations of these dimensions that will account for most of the observed variance in the original variables. These linear combinations are then usually aggregated into a single index by summing them using equal weights.

Subjectively derived methods take a very different perspective. The functional aggregation of the dimensions into an index is first determined through eliciting properties of the dimensions by investigating the preference structure of the experts who identified these dimensions. Then the necessary parameters, such as preferential weights, are quantified using specifically designed elicitation techniques. These methods, grounded in the field of decision theory, essentially encompass two main approaches: multiattribute utility (MAU) and Bayesian statistical models (Keeney and Raiffa 1976; von Winterfeld and Edwards 1986). Both methodologies are designed to ensure that the index captures the consensus of panels of experts. The MAU models are usually well suited for evaluation types of problems, while the Bayesian models are usually well suited for diagnostic and prediction types of problems. Although MAU models perform well in some evaluations of health care quality (Gustafson et al. 1990), they are less powerful when constrained by missing data or by the irrelevance of some components to some patients. In effect, in such situations, the structure of the MAU model varies with the data, the components available, or both. Both of these issues arise in the evaluation of psychiatric emergencies: the ED records from which the data are collected are often incomplete or in part illegible, and the aspects of care important in treating substance abuse patients may not be the same as those that are important in treating psychiatric patients.

The structure of a Bayesian model remains unchanged if the components are conditionally independent of each other. Therefore, in the context of psychiatric emergencies, we considered this modeling strategy to hold promise and we selected it. A Bayesian model is derived by investigating the ways in which experts make and revise their predictions of good versus bad care of a patient given some indications of what has been done or not done during the care process. The problem can be expressed more formally. Suppose we have two hypotheses: |H.sub.1~, that the patient's care was of acceptable quality, and |H.sub.2~, that the care was not acceptable. Suppose further that several data (|S.sub.1~,|S.sub.2~, ..., |S.sub.n~) are available to help make this judgment. Our objective is to calculate a score indicating the probability that the first hypothesis, |H.sub.1~, is true, given a set of indicators (|S.sub.1~,|S.sub.2~, ..., |S.sub.n~) that provides relevant information on the care the patient has received. Thus, the quality score for a patient will be given by Q(|S.sub.1~,|S.sub.2~, ..., |S.sub.n~) = Prob||H.sub.1~/(|S.sub.1~,|S.sub.2~, ..., |S.sub.n~)~. Since it is a probability, this score will range from 0 to 1, with 0 indicating low quality of care and 1 indicating high quality of care. Using Bayes' theorem, one can decompose this probability in the manner now shown, under the assumption that the data (|S.sub.1~,|S.sub.2~, ..., |S.sub.n~) are conditionally independent.

Prob{|H.sub.1~/(|S.sub.1~,|S.sub.2~,...,|S.sub.n~) = P{|S.sub.1~/|H.sub.1~}xP{|S.sub.2~/|H.sub.1~}x...xP{|S.sub.n~/| H.sub.1~} x P{|H.sub.1~} / P{(|S.sub.1~,|S.sub.2~,...,|S.sub.n~)} (1)

Similarly, one can decompose the probability of |H.sub.2~ being true:

Prob{|H.sub.2~/(|S.sub.1~,|S.sub.2~,...,|S.sub.n~) = P{|S.sub.1~/|H.sub.2~}xP{|S.sub.2~/|H.sub.2~}x...xP{|S.sub.n~/| H.sub.2~} x P{|H.sub.2~} / P{(|S.sub.1~,|S.sub.2~,...,|S.sub.n~)} (2)

Dividing Equation 1 by Equation 2, one obtains the ratio form of Bayes' theorem:

P{|H.sub.1~/(|S.sub.1~,|S.sub.2~,...,|S.sub.n~)} / P{|H.sub.2~/(|S.sub.1~,|S.sub.2~,...,|S.sub.n~)} = P{|S.sub.1~/|H.sub.1~}xP{|S.sub.2~/|H.sub.1~}x...xP{|S.sub.n~/| H.sub.1~} / P{|S.sub.1~/|H.sub.2~}xP{|S.sub.2~/|H.sub.2~}x...xP{|S.sub.n~/| H.sub.2~} x P{|H.sub.1~} / P{|H.sub.2~} (3)

The left-hand side of Equation 3 is termed the "posterior odds" of hypothesis |H.sub.1~ to |H.sub.2~ and is usually denoted |Omega~ (|S.sub.1~,|S.sub.2~, ..., |S.sub.n~). The right-hand side of Equation 3 has two terms. The first is the product of n "likelihood ratios" P{|S.sub.i~/|H.sub.1~}/P{|S.sub.i~/|H.sub.2~} indicating the relative likelihood of each datum |S.sub.i~ being associated with |H.sub.1~ rather than with |H.sub.2~. The second term is the "prior odds," P{|H.sub.1~}/P{|H.sub.2~}. Often, under the assumption of a priori absence of information regarding what the prior odds are, one sets these odds to 1. The posterior odds thus become:

|Omega~(|S.sub.1~,|S.sub.2~,...,|S.sub.n~) = P{|S.sub.1~/|H.sub.1~} / P{|S.sub.1~/|H.sub.2~} x P{|S.sub.2~/|H.sub.1~} / P{|S.sub.2~/|H.sub.2~} x...x P{|S.sub.n~/|H.sub.1~} / P{|S.sub.n~/|H.sub.2~} (4)

Finally, since Prob{|H.sub.2~/(|S.sub.1~,|S.sub.2~, ..., |S.sub.n~) = 1 - Prob{|H.sub.1~/(|S.sub.1~,|S.sub.2~, ..., |S.sub.n~), the quality score Q(|S.sub.1~,|S.sub.2~, ..., |S.sub.n~) can be obtained from the posterior odds ratio as follows:

Q(|S.sub.1~,|S.sub.2~,...,|S.sub.n~) = |Omega~(|S.sub.1~,|S.sub.2~,...,|S.sub.n~) / 1 + |Omega~(|S.sub.1~,|S.sub.2~,...,|S.sub.n~) (5)

Thus, in order to construct such a model, one needs to estimate each likelihood ratio for each possible piece of information |S.sub.i~.

Once developed, this Bayesian model is not sensitive to small amounts of missing data because it does not require reweightings of each of the other components. Instead, the model simply multiplies the data that are available, and comparable judgments about quality can be made based on more or less information. This holds true, of course, because of the assumption of conditional independence. If this assumption did not hold, then the likelihood ratios would be dependent on the other components.

The validity of the conditional-independence assumption has to be checked during the development tasks to ensure that the chosen model form is adequate for the problem at hand. The assumption of conditional independence was checked during the estimation tasks of the likelihood ratios, when the group of experts was asked to examine if the likelihood ratio estimates for one particular datum |S.sub.i~ would change when varying the other components of the model. The estimates did not change, and we assumed that conditional independence was present. The model will suffer, however, if many of the data are missing (just as a judge may be reluctant to reach a decision unless the amount of data available exceeds a certain threshold). Similarly, if an attribute is irrelevant for a particular type of patient, it is not used in the Bayesian calculation and reweighting is not needed. The Bayesian model does suffer if the relative weight of an attribute changes but does not become irrelevant for different types of patients. This is not a problem for our quality index, however, because the relative importance of attributes was perceived by the experts to remain the same for different types of patients.

Subjective Estimation

Bayesian models are based on the judgments of experts. Subjective judgment is used to estimate the prior odds and likelihood ratios. Because of the primacy of subjective judgments in this model-building strategy, we now briefly discuss the evidence of people's ability to estimate these kinds of parameters. Several authors have reviewed the findings in the literature on subjective estimation (Hogarth 1980; Huber 1974; Peterson and Beach 1967; Slovic and Lichtenstein 1971; von Winterfeld and Edwards 1986). The essence of these findings is that although people can estimate subjective probabilities quite well (Lichtenstein and Fischhoff 1977), they tend to be overconfident in making their estimations and tend to believe they know more than they actually do, leading to systematic biases in the estimates (Lichtenstein, Fischhoff, and Phillips 1978). There are at least two desirable properties of subjective probability assessments (von Winterfeld and Edwards 1986). One is extremeness, that is, one would like any assessment of probabilities to be close to 0 or 1, since it would give far more useful guidance about what to expect than an assessment near .5. The second property is calibration. Calibration cannot be evaluated for any single assessment. However, when one has a number of similar assessments, calibration can be evaluated by comparing how close the subjective probability estimate is from the observed relative frequency of occurrence of the event being assessed. Extremeness and calibration often pull in opposite directions, and it is difficult to measure the quality of subjective probability assessment.

Nonetheless, researchers have developed several ways to improve subjective probability estimates. Training is one example. In a study to examine the need for and value of learning, Lichtenstein, Fischhoff, and Phillips (1978) found that for 40 percent of their subjects with no training at all, one could not reject the hypothesis that the assessor was perfectly calibrated. (They also found that training did not help or hurt those subjects.) After one training session with the remaining subjects, the number of "perfectly calibrated" assessors increased to 84 percent. Ten additional training sessions did not improve calibration beyond the first training session. The point of this and other calibration and training research (Murphy and Winkler 1974) is that assessors can be categorized in three groups: those who do not need training, those who can be "perfectly calibrated" with minimal training, and those few who cannot become "perfectly calibrated" with the training provided.

A second way to improve subjective estimates is through group estimation. Gustafson et al. (1973) compared the accuracy of 288 untrained subjects using one individual and three group estimation processes. The three group processes were talk-estimate (approximating an interacting group), estimate-feedback-estimate (an approximation of a Delphi process with no face-to-face interaction), and estimate-talk-estimate (a variant of the nominal group process) (Delbecq, Van de Ven, and Gustafson 1975). The estimate-talk-estimate process yielded estimates at least 30 percent closer to the empirically based standard than estimates provided by the individual process and by the other two group processes. The average difference between the actual and the estimated probability was .06 for the estimate-talk-estimate process.

A third way to improve subjective estimation is to use experts. Several studies have shown that people who know more about a subject give better-calibrated estimates than do less knowledgeable people (Adams and Adams 1961; Delbecq, Van de Ven, and Phillips 1975). Lichtenstein and Fischhoff (1977), for instance, employed over 300 subjects in a series of experiments to test the effect of expertise on simple and difficult tasks of probability estimation. The results clearly indicate that expertise improves performance.


We convened two expert panels, each comprising five physicians, to participate in the development of the quality index. One panel constructed the index and the other helped validate it.

Panel Selection

To select physicians for participation in this project, we asked the administrator of the emergency medical services office of each of five states--Connecticut, Maine, Massachusetts, New Hampshire, and Rhode Island--to nominate two physicians who had achieved distinction as experts in treating emergency psychiatric patients in the ED. We assigned the ten nominees to one of the two research panels. The development panel consisted of four psychiatrists and one nonpsychiatrist, all of whom had extensive experience treating acute psychiatric problems in the ED and two of whom were nationally recognized authorities on the subject. The validation panel consisted of three psychiatrists and two nonpsychiatrists, who also had extensive experience in the ED.

Panel Meetings and Tasks

The development of the Bayesian index took place in several discrete steps. First, a facilitator interviewed by telephone the members of the development panel to elicit a list of potential indicators, from which the researchers would develop a straw model for preliminary discussion when the panelists met in person. Next, a few of the researchers met with the development panel to present the straw model and ask the physician experts to identify the types of information likely to be found in ED charts that would allow them to judge the quality of care a psychiatric patient had received. These meetings with the development panel took place in two half-day sessions: the first, an afternoon session to identify these variables, specify levels for each, and define both the variables and their levels in terms of the data likely to be found in ED charts; and the second, a session the next morning to quantify the variables and their levels. Figure 1 is a display of 26 variables identified, classified into six categories, by the adequacy (or appropriateness) of:

1. The ED staff's involvement in admitting the patient and initiating treatment

2. The patient history obtained

3. The physical and mental status examinations

4. The laboratory tests

5. The diagnosis

6. The treatment itself and the disposition decision.

These 26 variables, listed in the Appendix with their levels and predicted likelihood ratios, are symptom-specific rather than diagnosis-specific so that they will fit with the unique characteristics of the ED. An example of calculating the quality score using equations 4 and 5 is shown in Table 1. The first column presents the attributes of the model. The second lists the specific level on each attribute for a sample patient. The third column lists the likelihood ratios associated with the attributes in this specific example. For instance, the first attribute is ED staff awareness of support. In this specific case no reference was made to the patient's social support. This was considered to be a bad sign in terms of quality--bad enough, in fact, that the expert panel assigned a likelihood ratio of 3.32 in favor of the "poor quality" hypothesis. Multiplying the likelihood ratios for each attribute yields a posterior odds of .2, a number that could be interpreted as a 17 percent posterior probability of good care.

The rationale for including these variables in the Bayesian quality index was based on the expertise of the panel convened and on several basic selection criteria that both the researchers and the panelists agreed on. More specifically, the first panel session began with a general discussion of usually accepted standards for good-quality care in the ED. The panelists then listed treatment variables and selected the group listed in Figure 1 by culling only those that would be applicable to all types of psychiatric emergencies. During the discussion, the panelists also matched specific types of psychiatric emergencies to the variables selected and, where necessary, identified additional variables to match specific types of cases. Two criteria limited the panelists' choice of variables:

1. The variable had to be clearly relevant to the quality of emergency psychiatric care.

2. Data to document the level of the variable had to be routinely available in typical ED records.


The latter requirement placed a significant but essential limitation on the panel's selection of variables. The panel had earlier concluded that if this study were to be useful in the short run, the attributes named as criteria to assess quality needed to be accessible in today's medical record.

In order to validate the index, a third task was asked of the development panel members: to rate on a scale of 0 to 100 the quality of care received by 102 actual emergency psychiatric patients, based on abstracts from ED records, with 0 indicating low quality and 100 indicating high quality. Hospital staff members, under supervision by a physician, had randomly selected the abstracts from ED records in each of the five participating hospitals. The abstracts included information on the variables selected for the Bayesian quality index as well as data on the type and severity of the case. Each panelist rated all 102 cases. No attempt was made to discuss and rescore those cases demonstrating wide variance. The validation panel, like the development panel, met with the researchers in an afternoon session and a session the next morning. The validation panel was asked to rate these same 102 cases. However, the process of rating the cases followed the estimate-talk-estimate procedure mentioned earlier (Gustafson, Fryback, Rose, et al. 1983; Luke, Strauss, and Gustafson 1977). Specifically, each panelist individually rated each case, and then the panelists came together to compare their ratings for each case and to discuss their differences. Finally, they individually reevaluated each case, bearing in mind the previous discussion. In addition, the validation panel was presented a set of 45 new full case records (as opposed to abstracts), and was asked to follow the same process to arrive at an average holistic rating of quality of care for each of these 45 cases.

The next section presents the results of our evaluation studies.


The quality index is an empirically grounded indicator that can lead to useful inferences about the relationships of the concept of quality of care to other attributes. However, the index has to pass several evaluation tests before it can be used for a variety of empirical studies. At least three important properties are necessary for such empirical measurements: validity, reliability, and discriminating ability. First, in a very general sense, the indicator is valid if it does what it is intended to do. Second, the indicator is reliable if it yields the same results on repeated trials. Finally, the indicator is able to discriminate if it is sensitive enough to capture observed differences among individual cases.

According to Carmines and Zeller (1979), three types of validity are basic: content validity, criterion-related validity, and construct validity. Content validity refers to the extent to which an empirical measure covers the domain of content of the theoretical concept. In order to ensure content validity in this study, we based the development of the model on the judgments of a group of experts in the content area. However, it is never possible to determine the specific extent to which an empirical measure should be considered content-valid. Criterion-related validity represents the degree of correspondence between the measure and a criterion variable. Construct validity focuses on the extent to which a measure performs in accordance with theoretical expectations. In the context of quality of care, however, validity is not easy to establish. It is not possible to attain a perfectly valid indicator; validity, instead, is a matter of degree. The same holds for reliability: the measurement of any phenomenon always contains a certain amount of chance error.

We conducted four separate studies to evaluate the performance of the Bayesian index. These studies attempt to establish the criterion-related validity, the construct validity, the reliability, and the discriminating ability of the index. We describe now the methods used for carrying out these five studies; then we discuss the results of each of these evaluations in turn. A summary of these methods is shown in Table 2.


During the development phase of the quality index, we collected data to establish criterion-related validity: both the development panel and the validation panel rated the quality of the care provided in 102 randomly selected psychiatric emergencies documented in abstracts of ED charts. In addition, the validation panelists rated full case records on 45 randomly selected patients using the estimate-talk-estimate process to achieve consensus on the quality of care these patients received.

Then, two nurse abstractors collected 104 data items on 2,231 randomly selected emergency psychiatric patients treated in the EDs of two New England hospitals between March 1985 and March 1986. Among these 104 data items, 26 items are used to calculate the quality score. The remaining data items comprise the necessary data to examine the validity, reliability, and discriminant TABULAR DATA OMITTED ability of the quality index as well as demographic, admission, severity, history, outcome, and discharge data for each patient. From this data base, 218 patients were discharged home with specific follow-up instructions. We contacted the agencies to which they were referred to determine whether they complied with the recommendations. Finally, during the 12 months of data collection 70 cases were randomly selected, and the data collection was duplicated by a second abstractor for the purpose of interrater reliability testing.

We tested the ability of experts to achieve a consensus on their ratings of quality of care by calculating the Kendall coefficient of concordance among expert rankings (Siegel and Castellan 1988). Then, to test for criterion-related validity, we used Spearman rank-order correlation coefficients. Logistic regression was used to test for construct validity. Distributions of quality scores were used to show discriminant ability. Finally, Pearson product moment correlation analysis and Spearman rank-order correlation were used to test for test-retest reliability.


Criterion-Related Validity. There are no known criterion variables against which the quality index can be compared to establish criterion-related validity. Efforts to define concepts of quality of care often are frustrated by disagreements among health experts. Nonetheless, it has been observed that different experts will rate the performance of systems similarly while at the same time they are disagreeing on the dimensions and weights to use in their definition of performance (Barton 1986). Thus, our first concern in evaluating the criterion-related validity of the index was to determine whether physicians could in fact reach a "natural consensus" on the quality of emergency psychiatric care that could serve as a criterion. In other words, would physicians independently assign similar overall quality scores? If a natural consensus exists among experts' perceptions of quality, it is then sensible to use the average experts' perceptions as the criterion against which the index should be evaluated. Both the development panel and the validation panel rated the quality of the care provided in 102 psychiatric emergencies documented in abstracts of ED charts.

Table 3 shows the matrix of the Spearman rank-order correlation coefficients between the development and validation panelists' ratings of quality of care for these 102 cases. It also shows the Spearman rank-order correlation coefficients between the average ratings across the development panelists, the average ratings across the validation panelists, and the Bayesian quality score. The Spearman correlations between the individual panelists range from .326 to .739, thus showing a moderate to high level of agreement among the panelists. The overall agreement among k experts is given by the Kendall coefficient of concordance W. This coefficient of concordance is related to the average of the Spearman rank-order correlation coefficients |r.sub.s~ between all possible pairs of experts as:

W = (k - 1).ave(|r.sub.s~)+1 / k

A high or significant value of W may be interpreted to mean that the k experts are applying the same overall standard in ranking the N observations under study. The Kendall coefficient of concordance among the development panelists is W = .633 (p |is less than~ .001). The correlation coefficients between the validation panelists are slightly higher than the coefficients for the development panelists. This is because the validation panelists re-rated the cases after jointly reviewing and discussing each case. The validation panelists achieved better consensus on their ratings, as shown by the Kendall coefficient of concordance among them, which amounts to .741 (p |is less than~ .001).

These coefficients are higher than those we have found in similar efforts to test indexes of the severity of illnesses and injuries (Gustafson, Fryback, Rose, et al. 1983). It should be noted that for these indexes, the range of the dependent variable was comparable. Thus, the results of this analysis of natural consensus were more encouraging than we had expected. We had anticipated more variation in the experts' judgments, and thus lower correlation levels than we found. And the value-laden nature of the quality concept had led us to expect even lower correlations than those we had found for experts' judgment about severity. Because the correlations were so high, and because the ratings were based on the care given real rather than hypothetical patients, we believed it appropriate to proceed under the assumption that consensus could be used as the standard of performance for judging the validity of the index. Since the validation panelists re-rated the cases after a group discussion and reached better consensus (W = .741), we used the average of the validation panelists' ratings after discussion as the standard against which to judge the performance of the quality index.

Once we accepted natural consensus as the standard for evaluating the index, we evaluated performance of the index against it by comparing the validation panelists' ratings averages against the index ratings. The Spearman correlation coefficient between the Bayesian model ratings and the validation panelists' ratings averages was .663 (p |is less than~ .001). This result suggests that the indexes performed about as well as those indexes used to evaluate severity (Gustafson, Fryback, Rose, et al. 1983). Moreover, as an indication, a coefficient of .663 is within the range of the Spearman rank-order correlation coefficients between the development panelists' ratings and the average validation panelists' ratings (these coefficients ranged from .614 to .725 as shown in Table 3). Thus, according to the Spearman rank-order correlation coefficient, the quality index performs as well as an independent expert.

Nonetheless, we viewed this result between the index and the validation panel ratings as an important but only preliminary indicator of performance, because the panelists used only abstracted data instead of the full case records to establish their scores. The next test allowed validation panelists to read full case records on 45 patients and to use the estimate-talk-estimate process to achieve consensus on the quality of care they received. We then compared the Bayesian scores with those validation panel ratings. The Spearman rank-order correlation coefficient between the Bayesian model and that average was .689 TABULAR DATA OMITTED (p |is less than~ .001) for the 45 case records, and leads us to believe that the index has criterion validity.

Construct Validity. There are no agreed-upon theoretically derived relationships of quality with other variables that one could use to establish construct validity of the index. However, if the quality index does in fact measure quality, and if emergency care does any good, one would expect that quality scores would correlate with patient outcomes. We compared quality scores with one indicator of patient outcomes: patient compliance with the ED's follow-up instructions. In the overall data base of 2,231 cases, 218 patients, discharged home, had the option of complying with specific follow-up instructions; the rest either had nonspecific instructions or were admitted to an inpatient facility and thus had no option to comply. Patient compliance was determined by follow-up contact with the agency to which the patient had been referred, to determine if he or she had kept the appointment. Table 4 represents an analysis of the relationship between the quality scores and the probability of compliance for substance abuse patients versus other patients.

The development panel of physicians was asked to sort the 218 patients into two groups: one group for those receiving high-quality care and one for those not receiving it. An examination of the Bayesian quality scores for each case suggested .80 as an appropriate cutoff point for distinguishing between high-quality care and care of lesser quality. This distinction was used in the analysis presented in Table 4. High-quality care shows a strong association with compliance for substance abuse patients. Seventy-four percent of the substance abuse patients whose care received quality scores over .80 complied with the follow-up instruction they received, whereas only 47 percent of those whose care received lesser quality scores complied. The trend is similar, but not as strong, for non-substance abuse patients (88 percent versus 82 percent). However, it should be noted that the sample of 218 patients is biased toward substance abusers and away from psychotic patients. Thus, these results are going in the direction we expected but are not definitive proof of the construct validity of the index.
Table 4: Relationships between Patients' Compliance with the
ED's Follow-Up Instructions and the Bayesian Quality Score, for
Substance Abuse and Non-Substance Abuse Patients (N = 218)
 Percent of Patients Complying
Substance when Quality Score Is:
Abusers Less than .80 More than .80
Yes 47% (15/32) 74% (37/50)
No 82% (36/44) 88% (81/92)

Compliance also shows a highly significant partial correlation coefficient with Bayesian quality scores when analyzed by logistic regression. Table 5 shows results of a logistic model containing the quality score and the patient's age, chronicity, and substance abuse measures. In this analysis, the quality score was the most significant predictor of compliance (p |is less than~ .005). These results are leading us to believe that the quality index is construct-valid.

Reliability. By "reliability" we mean the extent to which multiple abstractors, given the same case to abstract, would produce data that would lead to the same index scores. Seventy psychiatric emergency patients were abstracted simultaneously and independently by two nurse abstractors. In comparing the test abstracts on the basis of their agreement on each of the 104 items comprising the study instrument (including the 26 items making up the quality index), we found that for each item taken independently at least 90 percent of the cases were identical. The critical question, however, was not the level of agreement among each of the abstract variables, but the implications of their joint reliability for the quality score. Hence, we calculated quality scores for each test case. In this situation, since the quality scores were calculated in the same way for both nurse abstractors, we were interested in the extent of linear association between the quality scores. Hence, as a measure of reliability we used the Pearson product-moment correlation coefficient between the quality scores generated by each of the two nurses. This correlation coefficient was .83. In order to correct for a potential shift in each rater's assessments, the intraclass correlation coefficient was also calculated and amounted to .74. As still another indication, the Spearman correlation coefficient was .774. These results suggest that the information needed to use the quality index can be reliably obtained by trained nurse abstractors.
Table 5: Results of a Logistic Regression with Patient
Compliance as the Dependent Variable and the Bayesian Quality
Score, the Patient's Age, Chronicity, and Substance Abuse as
the Independent Covariables (N = 218)
Variable Measure t p Coefficient
Dependent Variable
Compliance(*) Yes/No
Independent Variables
Quality Score 0 to 1 7.2 .005 2.1
Age Years 2.0 .05 -0.025
Chronicity Yes/No 1.4 .10 0.66
Substance Abuse Yes/No 2.7 .01 0.95
* The estimated proportion of patients complying was 79
percent; the actual proportion complying was 80 percent.

The Ability of the Bayesian Index to Detect Variations in Patterns of ED Practices. For this study, nurse abstractors collected 104 data items on 2,231 randomly selected emergency psychiatric patients treated in the EDs. These data were used to test the ability of the index to detect differences in practice patterns within and between the two hospitals. The main concern here is to ensure that the index has enough discriminating power. For this study, we conducted two separate analyses bearing on the relationship of the quality index to variations in practices. The first analysis examined variations in the quality scores across patients. The second tested the ability of the index to detect differences in the quality of care among physicians.

Figure 2 presents the frequency distribution of the quality scores for the patients whose cases were abstracted. Figure 2 includes all patients. The shapes of the score distributions for alcohol abuse patients only and for all other patients are consistent with Figure 2. The index rated a substantial number of the cases as having received either very poor or very good care. But a sizable percentage also received scores somewhere between the two extremes, suggesting the ability to distinguish well between poor and good care for these categories of patients.

One important use of a quality index would be to identify opportunities to improve the practice patterns of providers. For the index to be useful in this application, it must demonstrate significant differences among providers. To test this capability, we arrayed the scores for 18 different physicians, each of whom had more than 30 cases in the data base. We then resorted the quality scores into those for substance abuse patients and those for other patients. The average quality scores of each of the 18 physicians ranged from .47 to .84 for all patients; .51 to .86 for substance abuse patients; and .44 to .78 for all other patients. Although there appear to be differences in relative quality, in general it appears that the physicians who gave higher-quality care to substance abuse patients also gave higher-quality care to the other patients, and vice versa. These analyses suggest that the index is able to discriminate among different levels of quality of care for different categories of patients and for different physicians as well. Moreover, the index is able to identify practice patterns differentiating physicians who typically score low from those who typically score high. We found, for instance, that physicians scoring above average were 25 percent more likely to contact the patient's therapist, 24 percent more likely to contact their social support, 25 percent more likely to evaluate dangerousness, 23 percent more likely to evaluate substance abuse, 17 percent more likely to hospitalize, and 17 percent more likely to send their patients to detoxification. These differences set the stage for feedback to clinicians aimed at promoting behavior change (Sateia, Gustafson, and Johnson 1989).


The results of this study suggest that the Bayesian quality index is a valid, reliable, and potentially useful measure of emergency psychiatric care. It is based on 26 variables routinely available in typical ED records. The index predicts physician judgments of quality, is reliable, exhibits sufficient variation in scores, and is strongly associated with patient compliance. Hence, it appears that the quality index not only performs well in laboratory and field tests but also has characteristics that will make it a useful tool in improving emergency psychiatric care. It thus provides a solution to the pressing need to evaluate the effectiveness of ED care for psychiatric patients. However, we should emphasize that the index applies only to patients already recognized as having psychiatric problems. It does not address the important triage problem of identifying depression and other psychiatric problems.

This was the first attempt we know of to apply a Bayesian model to the task of measuring the quality of health care. The results suggest that the model holds promise as a technique not only for measuring and improving quality but also for measuring the severity of patients' illnesses and injuries, as well as improving the patterns of physician practices.



The authors wish to acknowledge the important contribution of these people in the conduct of this research: Jerry Rose, M.B.A.; Jacek Franasczek, M.D.; Douglas Jacobs, M.D.; Hertzier Knox, M.D.; Diane Kowal, B.S.N.; Anthony Krembs, M.D.; Debbie Mitchell, M.S.N.; Andrew Slaby, M.D.; and Stephen Soreff, M.D.


Adams, R., and P. Adams. "Realism of Confidence Judgment." Psychological Review 68, no. 1 (1961): 33-45.

Barton, G. M. "Handbook of Emergency Psychiatry for Clinical Administrators." Emergency Health Services Review 3 (1986): 2-3.

Bassuk, E., and S. Berson. "Psychiatric Emergencies: An Overview." American Journal of Psychiatry 137, no. 1 (1980): 1-11.

Bassuk, E., L. Winter, and R. Apsler. "Cross-Cultural Comparison of British and American Psychiatric Emergencies." American Journal of Psychiatry 140, no. 2 (1983): 183.

Berwick, D. M. "Continuous Improvement as an Ideal in Health Care." New England Journal of Medicine 320, no. 1 (1989): 53-56.

Carmines, E. G., and R. A. Zeller. Reliability and Validity Assessment. Beverly Hills, CA: Sage Publications, 1979.

Delbecq, A., A. Van de Ven, and D. H. Gustafson. Group Techniques for Program Planners. Chicago: Scott, Foresman and Co., 1975.

Donabedian, A. Explorations in Quality Assessment and Monitoring. Vol. I, The Definitions of Quality and Approaches to Its Assessment. Ann Arbor, MI: Health Administration Press, 1980.

Georgopoulos, B. S. Organizational Structure, Problem-Solving, and Effectiveness. San Francisco: Jossey-Bass, 1986.

Gustafson, D. H., R. U. Shukla, A. Delbecq, and G. W. Walster. "A Comparative Study of Differences in Subjective Likelihood Estimates Made by Individuals, Interacting Groups, Delphi Groups, and Nominal Groups." Organizational Behavior and Human Performance 9, no. 2 (1973): 280-91.

Gustafson, D. H., D. G. Fryback, F. J. H. Rose, C. T. Prokop, D. E. Detmer, F. J. C. Rossmeissl, C. M. Taylor, F. Alemi, and A. J. Carnazzo. "An Evaluation of Multiple Trauma Severity Indices Created by Different Index Development Strategies." Medical Care 21, no. 7 (1983): 674-91.

Gustafson, D. H., F. Sainfort, R. VanKonigsveld, and D. R. Zimmerman. "The Quality Assessment Index (QAI) for Measuring Nursing Home Quality." Health Services Research 25, no. 1 (April 1990): 97-127.

Gustafson, D. H. "Expanding on the Role of Patient as Consumer." Quality Review Bulletin 17, no. 10 (October 1991).

-------. "Lessons Learned from an Early Attempt to Implement CQI Principles in a Regulatory System." Quality Review Bulletin 18, no. 10 (October 1992): 333-39.

Hogarth, R. Judgment and Choice. Chichester, England: John Wiley & Sons, Inc., 1980.

Huber, S. "Methods for Quantifying Subjective Probabilities and Multi-Attribute Utilities." Decision Sciences 5, no. 4 (1974): 430-558.

Jacobs, D. "Evaluation and Care of Suicidal Behavior in Emergency Settings." International Journal of Psychiatry in Medicine 12, no. 4 (1982-1983): 295-310.

Kane, R. L., R. Bell, S. Hosek, S. Riegler, and R. H. Kane. Outcome-Based Reimbursement of Nursing Home Care. Publication no. R-1517-NCHSR. Santa Monica, CA: RAND Corporation, 1984.

Keeney, R. L., and H. Raiffa. Decisions with Multiple Objectives: Preferences and Value Tradeoffs. New York: John Wiley & Sons, Inc., 1976.

Kincaid, W. H. "The Future of Quality Assurance in Patient Care." In Proceedings of the Fifth Annual Symposium on Computer Applications in Medical Care. Washington, DC: November 1981.

Knesper, D. J. "A Study of Referral Failures for Potentially Suicidal Patients: A Method of Medical Care Evaluation." Hospital and Community Psychiatry 33, no. 1 (1982): 49-52.

Lichtenstein, S., and B. Fischhoff. "Do Those Who Know More Also Know More about How Much They Know?" Organizational Behavior and Human Performance 20, no. 2 (1977): 159-83.

Lichtenstein, S., B. Fischhoff, and L. Phillips. "Calibration of Probabilities, State of the Arts." In Decision Making and Change in Human Affairs. Edited by H. Jungermann and B. de Ceevw. Amsterdam: D. Reidel, 1978.

Luke, R., F. Stauss, and D. H. Gustafson. "Comparison of Five Methods of Estimating Subjective Probability Distribution." Organizational Behavior and Human Performance 19, no. 2 (1977): 162-79.

Murphy, A., and R. Winkler. "Probability Forecast: A Survey of National Weather Service Forecasters." Bulletin of the American Meteorological Society 55, no. 12 (1974): 1449-53.

Nurius, P. S. "Emergency Psychiatric Services: A Study of Changing Utilization Patterns and Issues." International Journal of Psychiatry in Medicine 13, no. 3 (1983-1984): 239-54.

Peterson, C. R., and L. R. Beach. "Man as an Intuitive Statistician." Psychology Bulletin 68, no. 1 (1967): 29-46.

Rund, D. A. "Symposium Overview of Psychiatric Emergencies: Challenge and Opportunities for Emergency Health Services." Emergency Health Services Review 3, no. 1 (1986): 5-9.

Sateia, M. J., D. H. Gustafson, and S. W. Johnson. "Quality Assurance for Psychiatric Emergencies: An Analysis of Assessment and Feedback Methodologies." In Improving Quality in Psychiatric Clinics of North America. Philadelphia, PA: W. B. Saunders Co., 1990.

Schulberg, H. C., and B. J. Burns. "The Nature and Effectiveness of General Hospital Psychiatric Services." General Hospital Psychiatry 7, no. 2 (1985): 249-57.

Siegel, S., and N. J. Castellan. Nonparametric Statistics for the Behavioral Sciences. 2d ed. New York: McGraw-Hill, 1988.

Slovic, P., and S. Lichtenstein. "Comparison of Bayesian and Regression Approaches to the Study of Information Processing in Judgment." Organizational Behavior and Human Performance 6, no. 5 (1971): 649-744.

Soreff, S. M. Management of the Psychiatric Emergency. New York: John Wiley & Sons, Inc., 1981.

vonWinterfeld, D., and W. Edwards. Decision Analysis and Behavioral Research. Cambridge, MA: Cambridge University Press, 1986.

Willemain, T. R. "Survey Based Indices for Nursing Home Quality Incentive Reimbursement." Health Care Financing Review 4 (Spring 1983): 3.

Figure 1: Classification of the Variables Underlying the Bayesian Quality Index


1 Social support identified 2 Social support contacted 3 Therapist contact 4 MD involvement 5 Psychiatrist involvement 6 Waiting time 7 Restraints


8 Psychiatric 9 Medication 10 Substance abuse 11 Allergies


12 Vital signs 13 Mental status exam (affect, hallucinations/delusions, suicidal ideation, homicidal ideation, basic intelligence, judgment, insight, sensorium) 14 Dangerousness evaluation


15 Lithium level (if on lithium) 16 ECG (if patient overdosed on antidepressants) 17 ECG (if patient is started on antidepressants) 18 Toxicology screen


19 Compatibility of diagnosis with the history


20 Hospitalization if dangerous 21 Medication supply not lethal 22 Notification if patient left against medical advice 23 Report if child abuse 24 Detoxication if substance abuse 25 Follow-up arrangements if given antidepressants or antipsychotic medication 26 Follow-up arrangements appropriate
COPYRIGHT 1993 Health Research and Educational Trust
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 1993 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Gustafson, David H.; Sainfort, Francois; Johnson, Sandra W.; Sateia, Michael
Publication:Health Services Research
Date:Jun 1, 1993
Previous Article:A cost-effectiveness analysis of hepatitis B vaccine in predialysis patients.
Next Article:Predicting the performance of a strategic alliance: an analysis of the community clinical oncology program.

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters