Severity of illness: red herring or horse of different color?
In this final article in a three-part series, the author discusses the various issues that surround the use of severity-of-illness measures. Particular attention is paid to the use of such measures to assess the performance of providers in the area of quality.
Since the dawn of medicine, patients have been interested in the outcomes of their care. Recently, consumers have become interested in using outcome data to choose providers on the basis of their performance. Yet, even today, the most elementary information on outcomes is lacking. Attention has focused on mortality, a wholly inadequate measure of health outcome. Neither health status nor mortality data have been available routinely from medical practices.
Severity and Effectiveness
Neither severity nor treatment difficulty, as defined in part 2 of this article, can tell us anything about the effectiveness of care. Severity is a prediction of untreated outcome. Subtracting a valid severity score from a comparable observed outcome would, of course, measure effectiveness provided one could attribute the outcome to the intervention. Subtracting pre- and posttreatment severity scores would not, because the posttreatment severity score is a prediction of future untreated outcome. Treatment difficulty is a prediction of treated outcome. Thus, without knowledge of natural history (or a valid severity score), we cannot determine effectiveness. Moreover, if one knew the natural history one would want to judge effectiveness from measured outcome data, not from a prediction (treatment difficulty score). Subtracting pre- and posttreatment difficulty scores would, of course, not measure effectiveness. In fact, it would only be useful in measuring the variance between predicted and actual outcomes.
Even a valid severity measurement system would permit assessment only of the effectiveness of medical care. Assessing the effectiveness of specific interventions for specific health problems would, of course, depend on the ability to group cases into homogeneous classes with respect to both intervention and problem. While such data could be collected, this may be an insurmountable task, as figure 1, page 6, illustrates. Because there is generally great variation in both treatments and health problems, practice data can only indicate where research might be useful. Practice data cannot yield definite effectiveness information, and their uncritical use may misinform therepeutic choices because of:
* Inappropriate outcome measures.
Even death within a year after
discharge as an outcome is likely to be
appropriate in only a limited number
of conditions, because the value of
most treatments can only be reflected
by measuring health status.
* Insufficient elapsed time. Data
collected at discharge are much too
proximal to therapy for an assessment of
* Heterogeneity of intervention. Even
though treatments for a given
problem can be segregated--for example,
into medical and surgical--each
treatment represents a variety of
interventions. There is the possibility that the
ineffective interventions will mask the
effective ones. Even if the
interventions are effective overall, there is a
low probability of describing which
interventions contribute how much to
the observed effectiveness.
* Quality of implementation. Given a
homogeneous intervention, there is
likely to be heterogenity in the quality
of its implementation, even with
* Treatment selection. The provider
choses the treatment on the basis of
various and unknown factors. A
further factor is unknown diagnostic
accuracy or, at least, homogeneity.
* Patient selection. Patients chose
providers, and the type of treatment
providers offer may be a factor in this
choice. Patients are assigned
randomly to treatment and control groups,
in the manner of a clinical trial, to
eliminate the potential bias of patient
factors that affect treatment outcomes
but that are unknown to or cannot be
measured by investigators.
* Lack of controls. At best, one
treatment can be compared to another.
Treatment cannot be compared to no
treatment. If the effectiveness of any
treatment rested solely on the placebo
effect, so that all treatments appear
similar in their observed effectiveness,
this would not be revealed.
Severity and Adjusting Outcomes
This discussion focuses on outcome (mortality) data collected by providers, and other data that are in or could be included in the patient's medical or administrative records, that could be used to adjust outcomes. The availability of such data and the integrity of their collection is the Achilles heel of such systems. Unless collection is audited routinely, the data cannot be relied upon. Not only could providers manipulate data, but their ability to provide accurate data depends on their competence, the very attribute the system is supposed to measure. Unless diagnoses were validated, comparisons among providers might be risky. For the near future, at least, validating diagnoses will likely mean examining individual patient records. If such information were available, a provider's cases could be divided into two groups: (1) those meeting criteria, for which mortality rates could be calculated, and (2) those not meeting diagnostic validation criteria, which is not the same as saying the patient did not have the problem. Some adjustment for patient factors would still have to be made, because different providers are likely to treat different percentages of patients with characteristics associated with mortality. Again, such factors are likely to have been drawn from the medical record, and the adequacy of medical records varies. The use of severity scores for adjusting outcomes assumes their validity. Treatment difficulty scores are not useful for adjusting outcomes, because they are predictions of what one is measuring. A valid treatment difficulty score (were it to exist) could be compared to an observed score, at discharge for example, to assess provider performance. The treatment difficulty score represents a prediction of posttreatment outcomes; the observed score represents actual outcome. The approach is another form of population-based quality assessment. Perhaps a valid approach to mortality data could be developed that does not depend on diagnosis, or obtaining medical record data. However, given the complexity of the problem, it may take some time to develop. Moreover, a parallel data collection system would have to be developed, if one were not to rely on the medical record. Experience to date is not too encouraging. For example, the model used by Shortell and Hughes to examine inhospital mortality rates for selected diagnoses in states with and without stringent rate and certification-of-need controls explained only 10 percent of the variance.(1) An excellent model, if one could be devised, would explain at least 90 percent of the variance. A mortality adjustment system could be validated as follows. A small number of diagnoses could be selected for test purposes. The cases would be taken from a random sample of institutions. The providers' mortality experience would be calculated with and without adjustment and rank-ordered, from best (lowest mortality) to worst. Simultaneously, individual medical records would be reviewed independently by panels of expert clinicians to determine if they met practice standards. Each clinician would follow a structured review protocol, and any disagreements among reviewers would be discussed and resolved. The proposition of providers' cases meeting practice standards would be calculated, and providers would be rank ordered accordingly. The panels' judgments would be considered definitive. If the adjustment method were valid, there would be a perfect correlation between adjusted mortality rates and levels of acceptable care, both for the sample as a whole and for each of its constituent diagnoses. A less acceptable, but nonetheless good, result would be a perfect correlation among providers' rank order using the two measures. A lesser correlation coefficient could be accepted, but the lower limit of acceptability would have to be set prior to the analysis to prevent the result from influencing the standard. One consideration makes population-based approaches to assessing provider performance inherently flawed. To be truly valid, any system would have to adjust not only for technical factors that influence patient outcomes but also for patients' choices. If the patient's choice of therapy influenced the outcome, e.g., mortality, and varied from that which would produce the lowest mortality rate, resultant data would have to be adjusted to permit valid comparisons. For example, assume that there are two treatments for a hypothetical stage II of a health problem. One is simple and expedient, the other disfiguring, but the mortality rate for the first is twice that of the second. If one provider's patients freely choose the simple therapy more often and another's the disfiguring one because of the provider's advice, the first provider's performance (even if adjusted for technical factors) would appear worse than the second's, even though by permiting the patient to choose more freely, the provider could be considered to be delivering better quality care.
Severity and Quality of Care
Severity measurement has been proposed as a means of assesing the quality of care by comparing admission, discharge, and mid-stay severity scores. This use of severity measures assumes that patients entering hospitals improve uniformly during the course of their stays. This situation is depicted by line A in figure 2, page 8. A patient whose discharge severity score was worse that his admitting or discharge severity score might to the victim of inadequate care (line B). One whose mid-stay severity was worse than his admission or discharge severity score might be the victim of iatrogenesis (line C). In reality, one might expect that some patients' severity scores would worsen before they improved, the natural course of the disease and its treatment (line D). Further, some patients' conditions might worsen, even with the best treatment (line E). The proper comparison would be a provider's case experience with what would be expected, which might be taken as that observed in all practices ( a statistical or normative standard). Without this knowledge, which we lack presently, one is reduced to trying to use severity scores as a quality of care screen, to target cases or providers for review. Targeting cases. Here, severity measures are being used as a screening devise to identify cases that might represent compromised quality of care and require peer review. Such severity measures represent an alternative to generic screening, where a uniform battery of questions serves this same purpose. Essentially, if a patient's discharge severity score is greater than his or her admission score, the case is flagged as a potential quality problem. In any quality of care screening system, one must be concerned with sensitivity (ability to detect quality problems) and specificity (ability to reject cases that are not quality problems). These concepts may be further quantified as positive and negative predictive values. Severity measurement systems' performance in this regard is unknown. Theoretically, the performance of available systems is likely to be poor, for two reasons. First, it is assumed that all patients entering the hospital should improve. However, the condition of patients in some diagnostic categories might be expected to worsen, because either no effective treatment exists or those that do can only slow rather than arrest decline or restore the patient to health. Second, the number of scale points used to describe severity is limited. Most patients entering hospitals are low severity as measured by existing systems. These patients can only score the same or higher, a source of false negatives. Patients at the other extreme can only score the same or lower, a source of false positives. Only scores in the limited middle range can stay the same, improve, or deteriorate. Targeting providers. Providers may be targeted for review in one of several ways. First, providers whose proportions of cases with discharge severity levels worse than admission levels or higher than colleagues, could be targeted. This approach is a variation of the case-by-case use of admission and discharge severity levels. Alternatively, providers could be compared on the basis of differences in their average admission and discharge severity levels. This approach assumes the measures are a ratio scale, and there is little or no evidence for this. Either approach results in cases to be peer-reviewed, the only practical way that exists to judge quality of care. Thus, severity scores represent case screening devices of unknown sensitivity and specificity.
Severity Adjustments to Payment
Great interest has been expressed in the use of severity measures for payment purposes, a job for which they are least suited. Costs are incurred because of a provider's response to a patient's condition, not because of the condition itself. Defining severity in terms of the predicted probability of death within a specified period from untreated disease, patients with the same severity score, but different conditions, may be treated appropriately at very different costs. For example, the prognosis for some infections is very poor without treatment, yet highly effective treatment costs little. The prognosis may be equally poor with heart failure, but treatment is expensive. A heart transplant is extremely expensive. A lesser provider response would certainly be less expensive, whether or not less effective. One could argue that the provider's response should be unconstrained by cost considerations. Even if this wishful state were to exist, the problem of paying for care would still remain. Studies of so-called severity measure's ability to predict care costs have found that they explain relatively little of the variance.2 Sometimes providers may select inappropriate responses or fail to implement appropriate responses properly, affecting costs upwards and downwards. What paymasters seem to be seeking is a resource intensity (not severity) measure. Treatment intensity may be defined as the amount of different resources required to treat a patient. Treatment intensity scores are predicted resource requirements, not observed resources consumed. Further, treatment intensity is concerned with resource requirements and not cost, which depends on the price of each resource required. Optimal treatment includes what is necessary medically to achieve maximum effectiveness, consistent with patient preferences. Clearly, patients could be grouped by similarity of resource requirements (or expected cost). In some cases, resource intensity and a valid severity measures would correlate highly. A useful resource intensity measure could be derived empirically, much in the way that DRGs were developed. The task, however, would be more complex, as subclassifications of patients in each medical diagnosis would be required and a price established for each one. This task is clearly feasible. Further, if the resultant system achieved only a one percent improvement in efficiency, we could spend at least $50 million on its development and still break even. While exposition of how such a system can be developed and tested is beyond the scope of this article, the essential attributes are clear. One would need to identify the patient diagnostic classifications in medical diagnoses that significantly affect resource requirements. These resource requirements could be priced, using a national or many local scales. Alternatively, existing care could be so divided and costs could be derived empirically. Provider would then receive the set payment (for each subclass of medical diagnosis) in the same way they now receive the DRG payment. The system would, of course, still have the same drawbacks: Claims would have to be audited and the price would have to be fixed and reviewed with changes in technology or practice and in patients' preferences. Monitoring the quality of care would remain essential. For any diagnostic classification, patients will vary in their treatment preference. In some classifications, this variation maybe small. In others, where choice exists between heroic and conservative interventions, it may be considerable. Clearly, payment could be based on the average mix of patients' preferences, considered in the aggregate. However, the case mix problem remains, albeit in modified form. Further, payers could influence providers (and patients) by paying only for the least expensive alternative. Paying for the technically optimal treatment may not alleviate the problem entirely, because the provider would still have an incentive to offer cheaper treatment.
Severity of illness is a useful theoretical concept. Severity measurement systems predict health outcomes after a specified period from untreated disease, and hence quantify expected natural history. However, a validated severity measurement system may never be possible. Use of unvalidated systems means taking them on faith, because of the present, and likely enduring, lack of knowledge about natural history. Further, the idea that an accurate prognosis can be made without regard to what is wrong with the patient appears naive. Given the complexity of patients and their diseases, the notion that a 5-point or other limited scale is a sufficiently sensitive way to express severity also appears naive. Were valid severity measurement systems to exist (and none does now), they could be used to assess the effectiveness of treatment, but not to assess provider performance or to adjust payment for care, or, depending on their exact nature, to adjust outcomes of care. Such measures would be useful for examining variations in provider performance and health care costs. Some so-called severity measurement systems may in fact be treatment difficulty systems. Again, no such validated systems exist. However, their development is more feasible than severity measurement systems because of the possibility of observing treatment outcomes. Were valid treatment difficulty measurement systems to exist, they could be used to assess provider performance by comparing predicted to achieved outcomes. Where resources are limited and the primary ethical consideration is the expectation of benefit from use of resources, but not effectiveness or payment adjustments, the systems can also be used to ration care. The pursuit of severity measures to adjust payments is based on the false assumption that the cost of care is predicated solely on the patient's condition. In fact, care costs result from the provider's response to the patients' condition and preferences, not the patient's condition itself. A resource intensity measurement system, a refinement of the present DRG approach, could be devised to alleviate many, if not all, of the case-mix problems confronting the prospective pricing scheme. However, a valid resource intensity measure would be more complex and difficult to devise and maintain. Further, the need to audit claims and periodically update payments for changes in resource input prices, technology and medical practice, and patients' preferences would remain. Patients' preferences are the Achilles heel of treatment difficulty measures to assess provider performance and of resource intensity measures to determine payments for care. Patients' preferences can properly be taken into account in assessing provider performance by case-by-case quality assessment using, for example, computerized screens and structured peer review. They can be taken into account fully in paying for care only by giving patients discretion on how to spend the money that others make available to them. Thus, in terms of assessing the quality of medical care, severity measures are a red herring, serving to divert attention from what really needs to be done. In terms of paying for care, they are a painted pony, a horse of a different (and wrong) color, since their use for this purpose is based on a false assumption. [Figure 2 Omitted]
References Shortell, S., and Hughes, E. "The Effects of Regulation, Competition, and Ownership on Mortality Rates among Hospital Inpatients." New England Journal of Medicine 318(17):1100- 7, April 28, 1988. Jencks, S., and others. "Evaluating and Improving the Measurement of Hospital Case-Mix." Health Care Financing Review, pp. 1-11, Nov. 1984, Supplement.
Peter G. Goldschmidt, MD, DrPH, DMS, is Vice President for Research and Development, Quality Standards in Medicine, Inc., Bethesda, Md. Address inquiries to: Peter G. Goldschmidt, MD, DrPH, DMS, 5101 River Road, #1913, Bethesda, Md. 20816.
|Printer friendly Cite/link Email Feedback|
|Title Annotation:||the use of severity-of-illness measures to assess performance of health care providers; part|
|Author:||Goldschmidt, Peter G.|
|Date:||Nov 1, 1989|
|Previous Article:||Certification: goals set, goals met.|
|Next Article:||Secrets of management success.|