# Graphical interpretation of analytical data from comparison of a field method with a reference method by use of difference plots.

Westgard et al. [1-3] outlined the basic principles for method
comparison in a clear, easy to follow manual. They also introduced the
concept of allowable analytical error and gave an overview of published
performance criteria. They recommended that the estimated analytical
imprecision and bias be compared with these performance criteria in
method evaluation as well as in method comparison. Their approach made
use of a scatter-plot and calculations based on regression lines, but
with confidence limits and judgment of acceptability based on the
criteria for allowable analytical error.

These principles of comparing analytical performance with performance criteria, however, have not been universally accepted, and recent publications have criticized the misuse of correlation coefficients [4] and overinterpretation of regression lines in method comparison [5-7]. Bland and Altman [4] recommended the difference plot (or bias plot or residual plot) as an alternative approach for method comparison. On the abscissa they used the mean value of the methods to be compared, to avoid regression towards the mean, and on the ordinate they plotted the calculated difference between measurements by the two methods. They further estimated the mean and standard deviation of differences and displayed horizontal lines for the mean and for [+ or -]2 x the standard deviation. However, they missed the concept of a more objective criterion for acceptability. Recently, Hollis [5] has recommended difference plots as the only acceptable method for method comparison studies for publication in Annals of Clinical Biochemistry, but without specifying criteria for acceptability.

However, a few difference plots with evaluation of acceptability according to defined criteria have been published, e.g., in evaluation of estimated biological variation compared with analytical imprecision [8], and in external quality assessment of plasma proteins for the possibilities of sharing common reference intervals [9].

Maybe the scarcity of such publications is more a question of interpretation of the data by plotting than a strict choice between scatter-plot and difference plot, as discussed by Stockl [10] recently. Investigators seem to rely too much on regression lines and r-values, without doing the equally important interpretation of the data points of the plot. This is becoming more and more disadvantageous with the increasing number of Reference Methods available for comparison with field methods, because in these cases, it is not a question of finding some relationships, but simply of judging the field method to be acceptable or not.

NCCLS has recently published guidelines for method comparison and bias estimation by using patients' samples [11], where both scatter-plots and bias plots are advised. The document also recommends plotting of single determinations as mean values and stresses the need of visual inspection of data. Further, comparison with performance criteria is recommended, but these criteria are not specified and they are not used in the graphical interpretation. Recently, Houbouyan et al. [12] used ratio plots in their validation protocol of analytical hemostasis systems, where they used a preset, but arbitrarily chosen, acceptance limit of inaccuracy of 15%.

In the following, we will use the difference plot (or bias plot) in combination with simple statistics for the principal judgment of the identity or acceptability of a field method. The difference plot makes it easier to apply the concept; in principle, however, the same evaluations could be performed for a scatter-plot in relation to the line of identity (y = x).

The aim of this contribution is to pay attention to the hypothesis of identity and the concept of acceptable analytical quality in method comparison, especially when one of the methods is a Reference Method.

BASIC CONSIDERATIONS

The basis for comparing an analytical field method with an analytical Reference Method measuring the same quantity (analyte) is the hypothesis of identity within inherent imprecision or within preset analytical quality specifications. The field method, whether a kit or a self-produced analytical procedure, is applied for routine analyses in laboratory medicine. The method must be demonstrated to have adequate accuracy (trueness), precision, and specificity (lack of aberrant-sample bias [13]) for its intended use. The Reference Method, in contrast, is a "thoroughly investigated method, clearly and exactly describing the necessary conditions for procedures, for the measurement of one or more property values that has been shown to have accuracy and precision commensurate with its intended use and that can therefore be used to assess the accuracy of other methods for the same measurement, particularly in permitting the characterization of a reference material" [14].

Further characteristics of the methods are the costs, the complexity, the equipment, the time used for production of the results, and so forth--the field method generally being cheap and well suited for routine work and the Reference Method usually being applicable only in few competent and specially equipped laboratories.

The two approaches to the comparison are:

1. Identity. The results from the field method do not deviate from the Reference Method results by more than the inherent imprecision of both methods.

2. Analytical quality specifications. The results from the field method do not deviate from the Reference Method results by more than the acceptance specifications do from the analytical goals.

In both cases the null hypothesis is that the measured differences for all samples are zero. In the first case, the acceptance limits are defined by the inherent analytical imprecisions, and in the latter case the acceptance limits are defined by the analytical quality specifications.

By assuming the ideal situation, the theoretical limits for testing the null hypothesis can be drawn in a difference plot, i.e., a mean difference [mean([delta])] equal to 0, and and a standard deviation of differences [[sigma]([delta])] calculated from the imprecisions of the two methods. The ultimate hypothesis, then, is that the points are distributed within these bounds. If they fall outside of these bounds, the hypothesis is rejected. Alternatively, the performance characteristics are tested against defined analytical quality specifications.

ACCEPTANCE LIMITS DEFINED BY INHERENT ANALYTICAL IMPRECISION

Constant analytical standard deviations are presumed. A number of patients' samples are used for the comparison of the field method with the Reference Method. The result of the ith patient sample by the two methods is [x.sub.iF] and [x.sub.iR] for field and Reference Method, respectively, and the difference, [x.sub.id], is [x.sub.iF] - [x.sub.iR]. Further, the theoretical variance of the differences is the sum of the two variances denoting the inherent imprecision of field and Reference Methods: [[sigma].sup.2]([delta]) = [[sigma].sup.2.sub.F] + [[sigma].sup.2.sub.R] (the theoretical s values for the two methods being estimated independently for the two methods of the comparison, or estimated from replicate measurements during the comparison). When means of duplicates are used, then s should be divided by [square root of (2)].

When the two methods are identical, the expectation is that ~68% of differences will be distributed symmetrically around 0 within 0 [+ or -] 1[sigma]([delta]), and 95% will be within 0 [+ or -] 1.96[sigma]([delta]). If the distribution of differences fits these criteria, then it is not possible to find any difference between results from the two methods within the inherent imprecision. If this is not the case, then the methods are not identical.

Before the measured points are plotted on any figure, the hypothesis can be illustrated in a difference plot, as shown in Fig. 1A for the example of S-creatinine. The horizontal line y = 0 illustrates the hypothesis of [x.sub.id] 5 0; the other lines indicate within 0 [+ or -] 1[sigma]([delta]) for 68% and within 0 [+ or -] 1.96[sigma]([delta]) for 95% of the differences, respectively. The outer lines indicate the 95% prediction interval for the expected distribution of points. In this example, sF = 3.10 and [[sigma].sub.R] = 0.50 and, therefore, [sigma]([delta]) = 3.15 mmol/L.

[FIGURE 1 OMITTED]

It is educational to describe the hypothesis before plotting the points, because most investigators will start interpretation of the points in the form of functional relationships and thereby forget about the hypothesis.

With the hypothesis in mind, one can now plot the data points as shown in Fig. 1B. The data points are generated for Reference Method values between 50 and 150 mmol/L and computer-simulated gaussian-distributed differences based on mean([delta]) = -0.5 [micro]mol/L and [sigma]([delta]) = 3.00 mmol/L; from these simulated data the calculated mean([delta]) was 20.84 [micro]mol/L and s([delta]) was 3.27 [micro]mol/L. The data points are distributed roughly according to the hypothesis, with 16 points (70%) within the 0 [+ or -] 1[sigma]([delta]) and 21 points (91%) within 0 [+ or -] 2[sigma]([delta]), leaving 2 points (9%) outside the limits; because these two points seem not to deviate too much from the general distribution, the conclusion could be that the finding of just 2 points outside (and close to) the 95% prediction interval is expected and acceptable and, therefore, that the field method is indistinguishable from the Reference Method within the analytical imprecision, so the evaluation can stop. Statistically, the mean difference can be evaluated by a t-test and the distribution of differences by an F-test.

This approach is very narrow, and some uncertainty related to unknown factors may be taken into account. Such variations could be the variation between the two tubes/vials with serum from the same individual or the underestimation of [[sigma].sub.F]. Therefore, possible additional sources of uncertainty always should be taken into account when appropriate. However, the design of a method comparison has to be carefully planned so as to exclude additional uncertainties. Further, any addition of "acceptable" uncertainty should be well thought through and handled with caution.

For the experienced scientist, interpreting the difference plot is easy. A more objective criterion, however, for graphical validation of the distribution of points, and especially the more extreme points, is to apply the concept of tolerance intervals, where the standard deviate (z or c) is substituted for by a tolerance factor, k, with a value dependent on the percentage of points (here 95%) and the confidence with which this percentage should be obtained. The k-value is determined by the assumptions about the new distribution, whether the mean or the standard deviation is unknown, or both. The k-value further depends on the number of points, n [15, 16]. Although we have a hypothesis about mean difference and standard deviation, these figures are unknown in practice, so the [k.sub.7] for unknown mean and standard deviation may be the most relevant to use. For n = 23 and 95% confidence for 95% of the points, the tolerance factor, [k.sub.7], is 2.67 and the tolerance interval is 0 [+ or -] 2.67[sigma]. This is illustrated in Fig. 1C, where all points are distributed within the chosen tolerance interval.

The present approach of theoretical expected distribution is compared with the approach of Bland and Altman [4] in Fig. 1D by inserting the new lines determined by mean([delta]) [+ or -] 2 s([delta]) estimated from the data points.

The difference between the present concept and the Bland and Altman concept is clear from Fig. 1D. Accordingly, we illustrate the 95% prediction interval before any data points are applied, in contrast to Bland and Altman, who simply illustrate the statistics of the points. In this example the mean values are clearly different, whereas the standard deviations are rather close to each other.

To illustrate the relations to x-y plots, we first add to Fig. 1B 11 points (triangles), as shown in Fig. 2(top). The specimens producing these points are assumed to contain some "nonspecific" components [in the S-creatinine example, perhaps the specimens are from diabetics, where (e.g.) glucose could result in nonspecific reactions by the Jaffe methods]. In the difference plot they separate clearly from the other points and the difference and standard deviation change to + 0.16 and 4.08 [micro]mol/L, respectively. In the x-y plot (Fig. 2, bottom), where the points should be related to the line of identity (y = x), it is difficult to see the difference between the two sets of points. If we turn to calculation of r-values, r decreases from 0.993 to 0.991, the slope of the regression line changes from 1.020 to 1.025, and the intercept changes from -2.54 to -1.17 [micro]mol/L.

[FIGURE 1 OMITTED]

The information from difference plots and x-y plots (when using the line of identity) is the same, but it is easier to expand the differences in the difference plot and the calculations of variances are simpler, whereas the 45[degrees] angle in the x-y plot makes comparable calculations more difficult. This is also seen from Westgard et al. [2, 3], where the simple calculations of (e.g.) bias is easier to interpret in combination with figures.

ACCEPTANCE LIMITS DEFINED BY GOALS FOR ANALYTICAL QUALITY

A more relevant approach for comparing a field method with a Reference Method is to use the analytical goals (analytical quality specifications) as acceptance limits. These specifications may be related to the clinical use of laboratory data [17, 18] or more generally to the application of common reference intervals [19, 20] and monitoring patients [21, 22]. Two European groups, one under the auspices of EGE-Lab (European Group for the Evaluation of Reagents and Analytical Systems in Laboratory Medicine) [23, 24], and another group under the auspices of European EQA-Organizers (External Quality Assessment Organizers) [25], have given recommendations for analytical quality specifications based on the same biological concepts and with identical criteria for acceptable analytical bias and imprecision, but with a different concept for combining these--the first (EGE-Lab) defining the recommendations for analytical bias and imprecision separately and the latter (EQA-Organizers) combining these two aspects. In both European recommendations, the analytical quality specification for imprecision is [CV.sub.analytical] [less than or equal to]0.5 [CV.sub.within-subject variation] as proposed by Cotlove et al. [21], and that for analytical bias is [absolute value of ([B.sub.analytical])] [less than or equal to]0.25 [CV.sub.total biological variation] [19]. The EGE-Lab concept accepts both a maximum bias and a maximum imprecision simultaneously; the EQA-Organizer concept describes a functional relationship between the two in the form of a maximum allowable combination of imprecision and bias.

The latter is close to the original concept of Gowans et al. [19], defining the acceptable analytical percentage bias for S-creatinine as 2.8% [25] when imprecision is negligible. According to the EGE-Lab concept [23, 24], however, both a bias of 2.8% and an imprecision of 2.2% are acceptable simultaneously. This means that for single determinations 0 [+ or -] (bias + 1.65 x imprecision) is acceptable [26]; i.e., 95% of the single points must lie within the limits of 0 [+ or -] (2.8% + 1.65 x 2.2%) = 0 [+ or -] 6.4%, as illustrated in Fig. 3 (left). This criterion is fulfilled as shown in Fig. 3 (middle). The concept, however, is onesided, i.e., is valid only in one direction. Therefore, the standard deviation of differences should be judged against the imprecision criterion separately. The standard deviation of the field method is 3.1 [micro]mol/L and the CV = 3.9%, which exceeds the imprecision specification of 2.2%.

Figure 3 (right) illustrates an example of acceptable analytical imprecision and bias. The CV is 2.1% and the estimated bias (mean difference) is +1.3 [micro]mol/L (95% confidence interval, 0.6-2.0 [micro]mol/L). Note that the mean difference is different from 0 but is acceptable according to the criterion.

For the purpose of method comparison, the value for maximum allowable bias might be expanded because of the uncertainty (confidence interval) of the Reference Method. This cannot be seen from the actual comparison, but because the Reference Method is allowed to have some uncertainty, then this must be allowed also for the field method. We propose using the factor 1.2, in light of a recent concept that requires for Reference Methods a total error of <0.2 times that of the routine method [27]. In the present case, the acceptance limits for bias should thus be 0 [+ or -] 3.36%, as also used in the example below.

TWO EXAMPLES UTILIZING COMPARISON DATA

The data used are from a paper on a candidate reference method for determining S-creatinine, which was used for comparison of four field methods [28]. Data from two of these comparisons are used, but only for concentration values <150 [micro]mol/L, the only region in which there are sufficient data for our presentation.

Practical example 1. Based on the duplicate analyses performed, analytical imprecision is calculated within the interval 50-150 [micro]mol/L for both the Reference Method (HPLC) and the field method, giving [[sigma].sub.R] = 0.568 [micro]mol/L and [[sigma].sub.F] 5 0.791 [micro]mol/L. When means of duplicates are used for the comparison, the calculated theoretical [sigma]([delta]) should be divided by [square root of (2)], giving an expected distribution of 95% (the 95% prediction interval) of the points within 0 [+ or -] 2[sigma]([delta]) = 0 [+ or -] 1.4 [micro]mol/L. Further, the expanded allowable bias of 3.36% is used for the analytical quality specifications according to both EGE-Lab and EQA-Organizers concepts.

[FIGURE 3 OMITTED]

The graphical evaluations are shown in Fig. 4(top). Here, the calculated distribution is considerably broader than the one assumed from the analytical point of view, whereas the measured mean bias is within the limits of acceptance of the bias because the mean (-1.6) and 95% confidence interval (-0.5 to -2.7 [micro]mol/L) are within the 3.36% allowable bias. Further, the points are distributed within the total EGE-Lab criteria with only one real outlier. The difference between the present concept of 95% prediction interval and the Bland and Altman description of the actual points is clear from the Figure.

At first glance, the field method should be acceptable, with CV 5 1% and bias <3.36%, but the distribution of points (Fig. 4, top) reveals a much broader distribution (CV = 5%), which emphasizes an unknown uncertainty. The problem is further underlined by the adding of single determinations to the measurements in the Figure, illustrating the reproducibility of the individual differences. This uncertainty is close to 5% and may originate not only from vial-to-vial variation but also from aberrant-sample bias, whether from nonspecific reactions or interference in the field method. Thus the difference plot and calculation of the standard deviation of differences are tools to disclose aberrant-sample bias in field methods.

Fuentes-Arderiu and Fraser [30] have proposed that the combined effects of imprecision and interference should be used in the concept of specifications for imprecision, but the problem has not been dealt with in either the EGE-Lab or EQA-Organizer concepts.

Practical example 2. In the other example (Fig. 4, bottom), all points are displaced from the acceptance area, showing a considerable mean bias (~20 [micro]mol/L) and also considerable uncertainty from aberrant-sample bias. The calculated imprecision, based on assays of duplicates, was ~1%, but the difference plot reveals the errors clearly.

[FIGURE 4 OMITTED]

The CLIA criteria are much wider than the European recommendations, but as total error specifications are easy to apply in the plot. For concentrations <175 [micro]mol/L the acceptable deviation is [+ or -]26.5 [micro]mol/L ([+ or -]0.3 mg/L), [+ or -]15% at higher concentrations. The upper line of the CLIA criterion is shown in Fig. 4 (bottom). Because 7 of the 54 points are outside the line, the method is also unacceptable from the standpoint of proficiency testing.

GENERAL DISCUSSION

In the current discussion of difference plot vs x-y plots and the application of regression lines or functions, evaluations of data (vis-a-vis best relationship and comparison of a field method with a Reference Method) have often been mixed, as pointed out recently by Stockl [10]. Stockl emphasizes that graphical presentation of data (x-y plot or difference plot) should not be mixed with statistical interpretation of data.

As long as the purpose is to find the best functional relationship between two methods in order to correct one with the other, then an x-y plot and calculation of the regression line with an estimation of the scatter via syux may be most relevant, with visual inspection of the scatter and a residual plot. The present task, however, is not to find a functional relationship between two methods, but to judge a field method in relation to a Reference Method.

When a field method is compared with a Reference Method for acceptability according to certain criteria, whether analytical or biological, then the visual inspection of all data is essential. Whether one uses a simple x-y plot or a difference plot is not critical, as long as the area of interest is expanded and the single points are assessed according to the hypothesis of identity between the two methods. The hypothesis is that the measured values are identical (and not that they are unrelated, which is the basic hypothesis for correlation studies), which means that the hypothesis is described in an x-y plot as the line of identity (y = x) and in the difference plot as the line y = 0. When a ratio plot is more appropriate than a difference plot, e.g., when analytical CV is close to being constant, the same evaluations can be performed.

The advantages of difference plots (and ratio plots) are keeping the hypothesis of identity in mind and the ease of expanding the difference ordinate according to the investigator's purpose. The power of the graphical illustration in Figs. 1A and 3 (left) lies in the simplicity and the clear definition of the hypotheses.

For the experienced interpreter, most situations can be evaluated by visual inspection of the plots, whether a difference plot or an x-y plot. If more objective criteria are wanted, calculations of mean difference with confidence intervals are a powerful tool, as is a table of [k.sub.7]-values (5) for estimation of tolerance intervals.

Most important, however, is the inspection of the distribution of the difference points, especially when samples expected to have matrix effects are marked, and calculation of the standard deviation of differences. When this exceeds the estimated analytical imprecision, it is an indication of aberrant-sample bias. In principle, the r-value from correlation between x and y reflects this, but in practice the r-value is insensitive, as illustrated in Fig. 2. Further, calculation of the standard deviation of differences gives an estimate of aberrant-sample bias compared with the theoretical imprecision. The same information can be obtained from the syux estimate from regression analysis.

The plotting of single determinations can give information about imprecision and aberrant-sample bias as well and, if needed, a functional curve can be calculated and drawn. In this context we mention that replicate measurements always should be performed in this type of comparison, and that specimens should be stored for evaluation of possible outliers.

Krouwer and Monti [29] presented a graphical method for evaluation of laboratory assays (a mountain plot). They computed the percentile for each ranked difference between the two methods, and by "turning" at the 50th percentile produced a histogram-like function (the mountain). This method is relevant for detecting large infrequent errors (differences) but lacks the aspect of concentration relationship. These investigators, therefore, recommend use of their plot together with difference plots. Introduction of analytical quality specifications in the mountain plots may be useful in method evaluations.

Bland and Altman have pointed to the presentation of difference plots, where they recommend mean values of both methods on the abscissa [4]; however, the risk of regression towards the mean is negligible in studies where field methods are compared with Reference Methods, because the results from the Reference Method (used on the abscissa) are assumed to have negligible error. They further calculate and present the standard deviation of the measured data, which is relevant information but not related to the hypothesis of identity.

This form of graphical testing of the hypothesis of identity has been used for biological data [8], where the standard deviation of measured differences was compared with the analytical imprecision. A more stringent method of testing measured values from field methods against target values has been applied for plasma proteins in The Nordic Protein Project [9]. Here, the target values of serum pools were assigned from the European Community Bureau of Certified Reference Material, CRM 470 [31], by the method recommended by IFCC [32], and were used for the abscissa in a plot with acceptance lines according to acceptable bias [9] and with the measurements illustrated by the mean difference with a 90% confidence interval.

Another graphical method of prediction of single measurements of clinical data has been published for serial measurements of International Normalized Ratios of prothrombin times [33]. Here, the differences between consecutive measurements in patients were compared with the expected variation estimated from patients under steady-state conditions. The abscissa was used for the latest result, resulting in a regression towards the mean, which was considered acceptable for the purpose (i.e., not to investigate correlations). The usefulness of the nomogram was improved by adding vertical lines for the therapeutic interval.

The goals used for acceptance are relevant for the use of common reference intervals [19, 20] and have been recommended by two European groups [23-25], with different consequences for the acceptance. This problem is related to the phenomenon of aberrant-sample bias (matrix effects) and has not been fully clarified by the proposed goal from either EGE-Lab or EQA-Organizers; Fuentes-Arderiu and Fraser [30], however, have proposed that the aberrant-sample bias be included in the precision goal [30]. Other analytical goals have been postulated [17, 18, 34] and may be relevant for other evaluations. Goals based on biological data, however, are general in nature (related to common reference intervals and monitoring of patients) and are not restricted to a specific clinical application.

The CLIA criteria for S-creatinine are [+ or -]0.3 mg/dL (3 g/L, or 26.5 [micro]mol/L) or [+ or -]15% for concentrations >175 [micro]mol/L [28]. These are criteria for total error, but they are very wide compared with the European recommendations. The CLIA criteria are intended for use with proficiency testing (and therefore need to be wide), whereas the European recommendations are so-called educational criteria, which relate directly to the desirable performance criteria for optimum monitoring of patients and the sharing of common reference intervals within geographical areas with populations that are homogeneous for the quantity of analyte. The criteria proposed by Ehrmeyer et al. [35] for minimum intralaboratory performance characteristics to pass CLIA--CV <33% and bias <20%, proficiency testing criteria--can be applied as well as the European criteria, but for the latter the total error will be determining. The CLIA criteria may be applied even better in difference plots, given the total error concept.

The validity of the analytical conclusions of this type of evaluation relies on the Reference Method chosen. It must be correct and specific and so forth, with negligible imprecision; otherwise, the conclusions of the comparison will be weakened according to any possible flaws of the Reference Method.

In conclusion, we find the visual inspection of plots to be essential for method comparison. When a field method is compared with a Reference Method, the hypothesis of identity within analytical imprecision or within stated analytical quality specifications should be applied. Both x-y plots and difference plots are useful, but we find the difference plot is easier to handle and interpret and facilitates the calculations of uncertainty.

Received December 10, 1996; revision accepted June 17, 1997.

References

[1.] Westgard JO, de Vos DJ, Hunt M, Quam EG, Carey RN, Garber CC. Method evaluation. American Society for Medical Technology, Houston, 1978.

[2.] Westgard JO, Hunt MR. Use and interpretation of common statistical tests in method-comparison studies. Clin Chem 1973;19: 49-57.

[3.] Westgard JO, Carey RN, Wold S. Criteria for judging precision and accuracy in method development and evaluation. Clin Chem 1974;20:825-33.

[4.] Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;i:307-10.

[5.] Hollis S. Analysis of method comparison studies [Guest Editorial]. Ann Clin Biochem 1996;33:1-4.

[6.] Pollock MA, Jefferson SG, Kane JW, Lomax K, MacKinnin G, Winnard CB. Method comparison--a different approach. Ann Clin Biochem 1992;29:556-60.

[7.] Svendsen A, Holmskov U, Hyltoft Petersen P, Jensenius JC. Difference and ratio plots: simple tools for improved presentation and interpretation of data. Unexpected possibilities for the use of conglutinin binding assay in inflammatory rheumatic diseases. J Immunol Methods 1995;178:211-8.

[8.] Hyltoft Petersen P, Felding P, Horder M, Tryding N. Effect of posture on concentrations of serum proteins in healthy adults. Dependency of the molecular size of proteins. Scand J Clin Lab Invest 1980;40:623-8.

[9.] Hyltoft Petersen P, Blaabjerg O, Irjala K, Icen A, Bjoro K. The Nordic Protein Project. In: Hyltoft Petersen P, Blaabjerg O, Irjala K, eds. Assessing quality in measurements of plasma proteins. The Nordic Protein Project and related projects. Helsinki, Finland: NORDKEM (Nordic Clinical Chemistry Project), 1994:87-116.

[10.] Stockl D. Beyond the myths of difference plots [Letter]. Ann Clin Biochem 1996;33:575-7.

[11.] NCCLS. Method comparison and bias estimation using patient samples; approved guidelines. Document EP9-A (ISBN 1-56238-283-7). Wayne, PA: NCCLS, 1995.

[12.] Houbouyan L, Boutiere B, Contant G, Dautzenberg MD, Fievet P, Potron G, et al. Validation protocol of analytical hemostasis systems: measurement of anti-Xa activity of low-molecular-weight heparins. Clin Chem 1996;42:1223-30.

[13.] Dybkaer R. Vocabulary for describing the metrological quality of a measurement procedure. Upsala J Med Sci 1993;98:445-86.

[14.] International Organization of Standardization. Terms and definitions used in connection with reference materials. ISO Guide 30, 2nd ed. Geneva: ISO, 1992.

[15.] Bliss CI. Statistics in biology. New York: McGraw-Hill, 1967:558 pp.

[16.] Documenta Geigy. Mathematics and statistics. Basle, Switzerland: Ciba-Geigy, 1975.

[17.] Horder M, ed. Assessing quality requirements in clinical chemistry. Scand J Clin Lab Invest 1980;40(Suppl 155):144pp.

[18.] de Verdier C-H, ed. Medical need for quality specifications in laboratory medicine. Upsala J Med Sci 1990;95:161-309.

[19.] Gowans EMS, Hyltoft Petersen P, Blaabjerg O, Horder M. Analytical goals for the acceptance of common reference intervals for laboratories throughout a geographical area. Scand J Clin Lab Invest 1988;48:757-64.

[20.] Hyltoft Petersen P, Gowans EMS, Blaabjerg O, Horder M. Analytical goals for the estimation of non-Gaussian reference intervals. Scand J Clin Lab Invest 1989;49:727-37.

[21.] Cotlove E, Harris EK, Williams GZ. Biological and analytical components of variation in long-term studies of serum constituents in normal subjects. III. Physiological and medical implications. Clin Chem 1970;16:1028-32.

[22.] Elevitch R, ed. College of American Pathologists Conference Report. Conference on analytical goals in clinical chemistry, at Aspen, Colorado. Skokie, IL: CAP, 1976:129pp.

[23.] Fraser CG, Hyltoft Petersen P, Ricos C, Haeckel R. Proposed quality specifications for the imprecision and inaccuracy of analytical systems for clinical chemistry. Eur J Clin Chem Clin Biochem 1992;30:311-7.

[24.] Fraser CG, Hyltoft Petersen P, Ricos C, Haeckel R. Quality specifications. In: Haeckel R, ed. Evaluation methods in laboratory medicine. Weinheim: VCH, 1992:87-99.

[25.] Stockl D, Baadenhuijsen H, Fraser CG, Libeer J-C, Hyltoft Petersen P, Ricos C. Desirable routine analytical goals for quantities assayed in serum. Discussion paper from the members of the external quality assessment (EQA) working group A on analytical goals in laboratory medicine. Eur J Clin Chem Clin Biochem 1995;33:157-69.

[26.] Fraser CG, Hyltoft Petersen P. Quality goals in external quality assessment are best based on biology. Scand J Clin Lab Invest 1993;53(Suppl 212):8-9.

[27.] Stockl D, Reinauer H. Development of criteria for the evaluation of reference method values. Scand J Clin Lab Invest 1993;53(Suppl 212):16-8.

[28.] Thienpont LM, van Landuyt KG, Stockl D, de Leenheer AP. Candidate reference method for determining serum creatinine by isocratic HPLC: validation with isotope dilution gas chromatography-mass spectrometry and application for accuracy assessment of routine test kits. Clin Chem 1995;41:995-1003.

[29.] Krouwer JS, Monti KL. A simple, graphical method to evaluate laboratory assays. Eur J Clin Chem Clin Biochem 1995;33:525-7.

[30.] Fuentes-Arderiu X, Fraser CG. Analytical goals for interference. Ann Clin Biochem 1991;28:383-5.

[31.] Whicher JT, Ritchie RF, Johnson AM, Baudner S, Bienvenu J, Blirup-Jensen S, et al. New international reference preparation for proteins in human serum (RPPHS). Clin Chem 1994;40:934-8.

[32.] Blirup-Jensen S, Just Svendsen P. A new reference preparation for proteins in human serum. Upsala J Med Sci 1994;99:251-8.

[33.] Flensted Lassen J, Kjeldsen J, Antonsen S, Hyltoft Petersen P, Brandslund I. Interpretation of serial measurements of International Normalized Ratio (INR) in monitoring oral anticoagulant treatment. A graphical combination of therapeutic interval and reference change. Clin Chem 1995;41:1171-6.

[34.] US Department of Health and Human Services. Medicare, Medicaid, and CLIA programs; regulations implementing the Clinical Laboratory Improvement Amendments of 1988 (CLIA). Final rule. Fed Regist 1992;57:7002-186.

[35.] Ehrmeyer SS, Laessig RH, Leinweber JE, Oryall JJ. 1990 Medicare/ CLIA final rules for proficiency testing: minimum intralaboratory performance characteristics (CV and bias) needed to pass. Clin Chem 1990:36:1736-40.

PER HYLTOFT PETERSEN, (1) * DIETMAR STOCKL, (2) OLE BLAABJERG, (1) BENT PEDERSEN, (1) ERLING BIRKEMOSE, (1) LINDA THIENPONT, (2) JENS FLENSTED LASSEN (3) and JENS KJELDSEN (4)

Departments of (1) Clinical Chemistry and (4) Medical Gastroenterology, Odense University Hospital, DK-5000 Odense C, Denmark. (2) Laboratorium voor Analytische Chemie, Faculteit Farmaceutische Wetenschappen, Universiteit Gent, Harelbekestraat 72, B-9000 Gent, Belgium. (3) Department of Clinical Chemistry, Vejle Sygehus, DK-7100 Vejle, Denmark.

(5) Some [k.sub.7]-values for 95% tolerance factors for the 95% interval: 3.38 for n = 10; 2.75 for n = 20; 2.55 for n = 30; 2.38 for n = 50; 2.30 for n = 70; and 2.23 for n = 100.

* Author for correspondence. Fax 1 45 65 41 19 11; e-mail phy@imbmed.ou.dk.

These principles of comparing analytical performance with performance criteria, however, have not been universally accepted, and recent publications have criticized the misuse of correlation coefficients [4] and overinterpretation of regression lines in method comparison [5-7]. Bland and Altman [4] recommended the difference plot (or bias plot or residual plot) as an alternative approach for method comparison. On the abscissa they used the mean value of the methods to be compared, to avoid regression towards the mean, and on the ordinate they plotted the calculated difference between measurements by the two methods. They further estimated the mean and standard deviation of differences and displayed horizontal lines for the mean and for [+ or -]2 x the standard deviation. However, they missed the concept of a more objective criterion for acceptability. Recently, Hollis [5] has recommended difference plots as the only acceptable method for method comparison studies for publication in Annals of Clinical Biochemistry, but without specifying criteria for acceptability.

However, a few difference plots with evaluation of acceptability according to defined criteria have been published, e.g., in evaluation of estimated biological variation compared with analytical imprecision [8], and in external quality assessment of plasma proteins for the possibilities of sharing common reference intervals [9].

Maybe the scarcity of such publications is more a question of interpretation of the data by plotting than a strict choice between scatter-plot and difference plot, as discussed by Stockl [10] recently. Investigators seem to rely too much on regression lines and r-values, without doing the equally important interpretation of the data points of the plot. This is becoming more and more disadvantageous with the increasing number of Reference Methods available for comparison with field methods, because in these cases, it is not a question of finding some relationships, but simply of judging the field method to be acceptable or not.

NCCLS has recently published guidelines for method comparison and bias estimation by using patients' samples [11], where both scatter-plots and bias plots are advised. The document also recommends plotting of single determinations as mean values and stresses the need of visual inspection of data. Further, comparison with performance criteria is recommended, but these criteria are not specified and they are not used in the graphical interpretation. Recently, Houbouyan et al. [12] used ratio plots in their validation protocol of analytical hemostasis systems, where they used a preset, but arbitrarily chosen, acceptance limit of inaccuracy of 15%.

In the following, we will use the difference plot (or bias plot) in combination with simple statistics for the principal judgment of the identity or acceptability of a field method. The difference plot makes it easier to apply the concept; in principle, however, the same evaluations could be performed for a scatter-plot in relation to the line of identity (y = x).

The aim of this contribution is to pay attention to the hypothesis of identity and the concept of acceptable analytical quality in method comparison, especially when one of the methods is a Reference Method.

BASIC CONSIDERATIONS

The basis for comparing an analytical field method with an analytical Reference Method measuring the same quantity (analyte) is the hypothesis of identity within inherent imprecision or within preset analytical quality specifications. The field method, whether a kit or a self-produced analytical procedure, is applied for routine analyses in laboratory medicine. The method must be demonstrated to have adequate accuracy (trueness), precision, and specificity (lack of aberrant-sample bias [13]) for its intended use. The Reference Method, in contrast, is a "thoroughly investigated method, clearly and exactly describing the necessary conditions for procedures, for the measurement of one or more property values that has been shown to have accuracy and precision commensurate with its intended use and that can therefore be used to assess the accuracy of other methods for the same measurement, particularly in permitting the characterization of a reference material" [14].

Further characteristics of the methods are the costs, the complexity, the equipment, the time used for production of the results, and so forth--the field method generally being cheap and well suited for routine work and the Reference Method usually being applicable only in few competent and specially equipped laboratories.

The two approaches to the comparison are:

1. Identity. The results from the field method do not deviate from the Reference Method results by more than the inherent imprecision of both methods.

2. Analytical quality specifications. The results from the field method do not deviate from the Reference Method results by more than the acceptance specifications do from the analytical goals.

In both cases the null hypothesis is that the measured differences for all samples are zero. In the first case, the acceptance limits are defined by the inherent analytical imprecisions, and in the latter case the acceptance limits are defined by the analytical quality specifications.

By assuming the ideal situation, the theoretical limits for testing the null hypothesis can be drawn in a difference plot, i.e., a mean difference [mean([delta])] equal to 0, and and a standard deviation of differences [[sigma]([delta])] calculated from the imprecisions of the two methods. The ultimate hypothesis, then, is that the points are distributed within these bounds. If they fall outside of these bounds, the hypothesis is rejected. Alternatively, the performance characteristics are tested against defined analytical quality specifications.

ACCEPTANCE LIMITS DEFINED BY INHERENT ANALYTICAL IMPRECISION

Constant analytical standard deviations are presumed. A number of patients' samples are used for the comparison of the field method with the Reference Method. The result of the ith patient sample by the two methods is [x.sub.iF] and [x.sub.iR] for field and Reference Method, respectively, and the difference, [x.sub.id], is [x.sub.iF] - [x.sub.iR]. Further, the theoretical variance of the differences is the sum of the two variances denoting the inherent imprecision of field and Reference Methods: [[sigma].sup.2]([delta]) = [[sigma].sup.2.sub.F] + [[sigma].sup.2.sub.R] (the theoretical s values for the two methods being estimated independently for the two methods of the comparison, or estimated from replicate measurements during the comparison). When means of duplicates are used, then s should be divided by [square root of (2)].

When the two methods are identical, the expectation is that ~68% of differences will be distributed symmetrically around 0 within 0 [+ or -] 1[sigma]([delta]), and 95% will be within 0 [+ or -] 1.96[sigma]([delta]). If the distribution of differences fits these criteria, then it is not possible to find any difference between results from the two methods within the inherent imprecision. If this is not the case, then the methods are not identical.

Before the measured points are plotted on any figure, the hypothesis can be illustrated in a difference plot, as shown in Fig. 1A for the example of S-creatinine. The horizontal line y = 0 illustrates the hypothesis of [x.sub.id] 5 0; the other lines indicate within 0 [+ or -] 1[sigma]([delta]) for 68% and within 0 [+ or -] 1.96[sigma]([delta]) for 95% of the differences, respectively. The outer lines indicate the 95% prediction interval for the expected distribution of points. In this example, sF = 3.10 and [[sigma].sub.R] = 0.50 and, therefore, [sigma]([delta]) = 3.15 mmol/L.

[FIGURE 1 OMITTED]

It is educational to describe the hypothesis before plotting the points, because most investigators will start interpretation of the points in the form of functional relationships and thereby forget about the hypothesis.

With the hypothesis in mind, one can now plot the data points as shown in Fig. 1B. The data points are generated for Reference Method values between 50 and 150 mmol/L and computer-simulated gaussian-distributed differences based on mean([delta]) = -0.5 [micro]mol/L and [sigma]([delta]) = 3.00 mmol/L; from these simulated data the calculated mean([delta]) was 20.84 [micro]mol/L and s([delta]) was 3.27 [micro]mol/L. The data points are distributed roughly according to the hypothesis, with 16 points (70%) within the 0 [+ or -] 1[sigma]([delta]) and 21 points (91%) within 0 [+ or -] 2[sigma]([delta]), leaving 2 points (9%) outside the limits; because these two points seem not to deviate too much from the general distribution, the conclusion could be that the finding of just 2 points outside (and close to) the 95% prediction interval is expected and acceptable and, therefore, that the field method is indistinguishable from the Reference Method within the analytical imprecision, so the evaluation can stop. Statistically, the mean difference can be evaluated by a t-test and the distribution of differences by an F-test.

This approach is very narrow, and some uncertainty related to unknown factors may be taken into account. Such variations could be the variation between the two tubes/vials with serum from the same individual or the underestimation of [[sigma].sub.F]. Therefore, possible additional sources of uncertainty always should be taken into account when appropriate. However, the design of a method comparison has to be carefully planned so as to exclude additional uncertainties. Further, any addition of "acceptable" uncertainty should be well thought through and handled with caution.

For the experienced scientist, interpreting the difference plot is easy. A more objective criterion, however, for graphical validation of the distribution of points, and especially the more extreme points, is to apply the concept of tolerance intervals, where the standard deviate (z or c) is substituted for by a tolerance factor, k, with a value dependent on the percentage of points (here 95%) and the confidence with which this percentage should be obtained. The k-value is determined by the assumptions about the new distribution, whether the mean or the standard deviation is unknown, or both. The k-value further depends on the number of points, n [15, 16]. Although we have a hypothesis about mean difference and standard deviation, these figures are unknown in practice, so the [k.sub.7] for unknown mean and standard deviation may be the most relevant to use. For n = 23 and 95% confidence for 95% of the points, the tolerance factor, [k.sub.7], is 2.67 and the tolerance interval is 0 [+ or -] 2.67[sigma]. This is illustrated in Fig. 1C, where all points are distributed within the chosen tolerance interval.

The present approach of theoretical expected distribution is compared with the approach of Bland and Altman [4] in Fig. 1D by inserting the new lines determined by mean([delta]) [+ or -] 2 s([delta]) estimated from the data points.

The difference between the present concept and the Bland and Altman concept is clear from Fig. 1D. Accordingly, we illustrate the 95% prediction interval before any data points are applied, in contrast to Bland and Altman, who simply illustrate the statistics of the points. In this example the mean values are clearly different, whereas the standard deviations are rather close to each other.

To illustrate the relations to x-y plots, we first add to Fig. 1B 11 points (triangles), as shown in Fig. 2(top). The specimens producing these points are assumed to contain some "nonspecific" components [in the S-creatinine example, perhaps the specimens are from diabetics, where (e.g.) glucose could result in nonspecific reactions by the Jaffe methods]. In the difference plot they separate clearly from the other points and the difference and standard deviation change to + 0.16 and 4.08 [micro]mol/L, respectively. In the x-y plot (Fig. 2, bottom), where the points should be related to the line of identity (y = x), it is difficult to see the difference between the two sets of points. If we turn to calculation of r-values, r decreases from 0.993 to 0.991, the slope of the regression line changes from 1.020 to 1.025, and the intercept changes from -2.54 to -1.17 [micro]mol/L.

[FIGURE 1 OMITTED]

The information from difference plots and x-y plots (when using the line of identity) is the same, but it is easier to expand the differences in the difference plot and the calculations of variances are simpler, whereas the 45[degrees] angle in the x-y plot makes comparable calculations more difficult. This is also seen from Westgard et al. [2, 3], where the simple calculations of (e.g.) bias is easier to interpret in combination with figures.

ACCEPTANCE LIMITS DEFINED BY GOALS FOR ANALYTICAL QUALITY

A more relevant approach for comparing a field method with a Reference Method is to use the analytical goals (analytical quality specifications) as acceptance limits. These specifications may be related to the clinical use of laboratory data [17, 18] or more generally to the application of common reference intervals [19, 20] and monitoring patients [21, 22]. Two European groups, one under the auspices of EGE-Lab (European Group for the Evaluation of Reagents and Analytical Systems in Laboratory Medicine) [23, 24], and another group under the auspices of European EQA-Organizers (External Quality Assessment Organizers) [25], have given recommendations for analytical quality specifications based on the same biological concepts and with identical criteria for acceptable analytical bias and imprecision, but with a different concept for combining these--the first (EGE-Lab) defining the recommendations for analytical bias and imprecision separately and the latter (EQA-Organizers) combining these two aspects. In both European recommendations, the analytical quality specification for imprecision is [CV.sub.analytical] [less than or equal to]0.5 [CV.sub.within-subject variation] as proposed by Cotlove et al. [21], and that for analytical bias is [absolute value of ([B.sub.analytical])] [less than or equal to]0.25 [CV.sub.total biological variation] [19]. The EGE-Lab concept accepts both a maximum bias and a maximum imprecision simultaneously; the EQA-Organizer concept describes a functional relationship between the two in the form of a maximum allowable combination of imprecision and bias.

The latter is close to the original concept of Gowans et al. [19], defining the acceptable analytical percentage bias for S-creatinine as 2.8% [25] when imprecision is negligible. According to the EGE-Lab concept [23, 24], however, both a bias of 2.8% and an imprecision of 2.2% are acceptable simultaneously. This means that for single determinations 0 [+ or -] (bias + 1.65 x imprecision) is acceptable [26]; i.e., 95% of the single points must lie within the limits of 0 [+ or -] (2.8% + 1.65 x 2.2%) = 0 [+ or -] 6.4%, as illustrated in Fig. 3 (left). This criterion is fulfilled as shown in Fig. 3 (middle). The concept, however, is onesided, i.e., is valid only in one direction. Therefore, the standard deviation of differences should be judged against the imprecision criterion separately. The standard deviation of the field method is 3.1 [micro]mol/L and the CV = 3.9%, which exceeds the imprecision specification of 2.2%.

Figure 3 (right) illustrates an example of acceptable analytical imprecision and bias. The CV is 2.1% and the estimated bias (mean difference) is +1.3 [micro]mol/L (95% confidence interval, 0.6-2.0 [micro]mol/L). Note that the mean difference is different from 0 but is acceptable according to the criterion.

For the purpose of method comparison, the value for maximum allowable bias might be expanded because of the uncertainty (confidence interval) of the Reference Method. This cannot be seen from the actual comparison, but because the Reference Method is allowed to have some uncertainty, then this must be allowed also for the field method. We propose using the factor 1.2, in light of a recent concept that requires for Reference Methods a total error of <0.2 times that of the routine method [27]. In the present case, the acceptance limits for bias should thus be 0 [+ or -] 3.36%, as also used in the example below.

TWO EXAMPLES UTILIZING COMPARISON DATA

The data used are from a paper on a candidate reference method for determining S-creatinine, which was used for comparison of four field methods [28]. Data from two of these comparisons are used, but only for concentration values <150 [micro]mol/L, the only region in which there are sufficient data for our presentation.

Practical example 1. Based on the duplicate analyses performed, analytical imprecision is calculated within the interval 50-150 [micro]mol/L for both the Reference Method (HPLC) and the field method, giving [[sigma].sub.R] = 0.568 [micro]mol/L and [[sigma].sub.F] 5 0.791 [micro]mol/L. When means of duplicates are used for the comparison, the calculated theoretical [sigma]([delta]) should be divided by [square root of (2)], giving an expected distribution of 95% (the 95% prediction interval) of the points within 0 [+ or -] 2[sigma]([delta]) = 0 [+ or -] 1.4 [micro]mol/L. Further, the expanded allowable bias of 3.36% is used for the analytical quality specifications according to both EGE-Lab and EQA-Organizers concepts.

[FIGURE 3 OMITTED]

The graphical evaluations are shown in Fig. 4(top). Here, the calculated distribution is considerably broader than the one assumed from the analytical point of view, whereas the measured mean bias is within the limits of acceptance of the bias because the mean (-1.6) and 95% confidence interval (-0.5 to -2.7 [micro]mol/L) are within the 3.36% allowable bias. Further, the points are distributed within the total EGE-Lab criteria with only one real outlier. The difference between the present concept of 95% prediction interval and the Bland and Altman description of the actual points is clear from the Figure.

At first glance, the field method should be acceptable, with CV 5 1% and bias <3.36%, but the distribution of points (Fig. 4, top) reveals a much broader distribution (CV = 5%), which emphasizes an unknown uncertainty. The problem is further underlined by the adding of single determinations to the measurements in the Figure, illustrating the reproducibility of the individual differences. This uncertainty is close to 5% and may originate not only from vial-to-vial variation but also from aberrant-sample bias, whether from nonspecific reactions or interference in the field method. Thus the difference plot and calculation of the standard deviation of differences are tools to disclose aberrant-sample bias in field methods.

Fuentes-Arderiu and Fraser [30] have proposed that the combined effects of imprecision and interference should be used in the concept of specifications for imprecision, but the problem has not been dealt with in either the EGE-Lab or EQA-Organizer concepts.

Practical example 2. In the other example (Fig. 4, bottom), all points are displaced from the acceptance area, showing a considerable mean bias (~20 [micro]mol/L) and also considerable uncertainty from aberrant-sample bias. The calculated imprecision, based on assays of duplicates, was ~1%, but the difference plot reveals the errors clearly.

[FIGURE 4 OMITTED]

The CLIA criteria are much wider than the European recommendations, but as total error specifications are easy to apply in the plot. For concentrations <175 [micro]mol/L the acceptable deviation is [+ or -]26.5 [micro]mol/L ([+ or -]0.3 mg/L), [+ or -]15% at higher concentrations. The upper line of the CLIA criterion is shown in Fig. 4 (bottom). Because 7 of the 54 points are outside the line, the method is also unacceptable from the standpoint of proficiency testing.

GENERAL DISCUSSION

In the current discussion of difference plot vs x-y plots and the application of regression lines or functions, evaluations of data (vis-a-vis best relationship and comparison of a field method with a Reference Method) have often been mixed, as pointed out recently by Stockl [10]. Stockl emphasizes that graphical presentation of data (x-y plot or difference plot) should not be mixed with statistical interpretation of data.

As long as the purpose is to find the best functional relationship between two methods in order to correct one with the other, then an x-y plot and calculation of the regression line with an estimation of the scatter via syux may be most relevant, with visual inspection of the scatter and a residual plot. The present task, however, is not to find a functional relationship between two methods, but to judge a field method in relation to a Reference Method.

When a field method is compared with a Reference Method for acceptability according to certain criteria, whether analytical or biological, then the visual inspection of all data is essential. Whether one uses a simple x-y plot or a difference plot is not critical, as long as the area of interest is expanded and the single points are assessed according to the hypothesis of identity between the two methods. The hypothesis is that the measured values are identical (and not that they are unrelated, which is the basic hypothesis for correlation studies), which means that the hypothesis is described in an x-y plot as the line of identity (y = x) and in the difference plot as the line y = 0. When a ratio plot is more appropriate than a difference plot, e.g., when analytical CV is close to being constant, the same evaluations can be performed.

The advantages of difference plots (and ratio plots) are keeping the hypothesis of identity in mind and the ease of expanding the difference ordinate according to the investigator's purpose. The power of the graphical illustration in Figs. 1A and 3 (left) lies in the simplicity and the clear definition of the hypotheses.

For the experienced interpreter, most situations can be evaluated by visual inspection of the plots, whether a difference plot or an x-y plot. If more objective criteria are wanted, calculations of mean difference with confidence intervals are a powerful tool, as is a table of [k.sub.7]-values (5) for estimation of tolerance intervals.

Most important, however, is the inspection of the distribution of the difference points, especially when samples expected to have matrix effects are marked, and calculation of the standard deviation of differences. When this exceeds the estimated analytical imprecision, it is an indication of aberrant-sample bias. In principle, the r-value from correlation between x and y reflects this, but in practice the r-value is insensitive, as illustrated in Fig. 2. Further, calculation of the standard deviation of differences gives an estimate of aberrant-sample bias compared with the theoretical imprecision. The same information can be obtained from the syux estimate from regression analysis.

The plotting of single determinations can give information about imprecision and aberrant-sample bias as well and, if needed, a functional curve can be calculated and drawn. In this context we mention that replicate measurements always should be performed in this type of comparison, and that specimens should be stored for evaluation of possible outliers.

Krouwer and Monti [29] presented a graphical method for evaluation of laboratory assays (a mountain plot). They computed the percentile for each ranked difference between the two methods, and by "turning" at the 50th percentile produced a histogram-like function (the mountain). This method is relevant for detecting large infrequent errors (differences) but lacks the aspect of concentration relationship. These investigators, therefore, recommend use of their plot together with difference plots. Introduction of analytical quality specifications in the mountain plots may be useful in method evaluations.

Bland and Altman have pointed to the presentation of difference plots, where they recommend mean values of both methods on the abscissa [4]; however, the risk of regression towards the mean is negligible in studies where field methods are compared with Reference Methods, because the results from the Reference Method (used on the abscissa) are assumed to have negligible error. They further calculate and present the standard deviation of the measured data, which is relevant information but not related to the hypothesis of identity.

This form of graphical testing of the hypothesis of identity has been used for biological data [8], where the standard deviation of measured differences was compared with the analytical imprecision. A more stringent method of testing measured values from field methods against target values has been applied for plasma proteins in The Nordic Protein Project [9]. Here, the target values of serum pools were assigned from the European Community Bureau of Certified Reference Material, CRM 470 [31], by the method recommended by IFCC [32], and were used for the abscissa in a plot with acceptance lines according to acceptable bias [9] and with the measurements illustrated by the mean difference with a 90% confidence interval.

Another graphical method of prediction of single measurements of clinical data has been published for serial measurements of International Normalized Ratios of prothrombin times [33]. Here, the differences between consecutive measurements in patients were compared with the expected variation estimated from patients under steady-state conditions. The abscissa was used for the latest result, resulting in a regression towards the mean, which was considered acceptable for the purpose (i.e., not to investigate correlations). The usefulness of the nomogram was improved by adding vertical lines for the therapeutic interval.

The goals used for acceptance are relevant for the use of common reference intervals [19, 20] and have been recommended by two European groups [23-25], with different consequences for the acceptance. This problem is related to the phenomenon of aberrant-sample bias (matrix effects) and has not been fully clarified by the proposed goal from either EGE-Lab or EQA-Organizers; Fuentes-Arderiu and Fraser [30], however, have proposed that the aberrant-sample bias be included in the precision goal [30]. Other analytical goals have been postulated [17, 18, 34] and may be relevant for other evaluations. Goals based on biological data, however, are general in nature (related to common reference intervals and monitoring of patients) and are not restricted to a specific clinical application.

The CLIA criteria for S-creatinine are [+ or -]0.3 mg/dL (3 g/L, or 26.5 [micro]mol/L) or [+ or -]15% for concentrations >175 [micro]mol/L [28]. These are criteria for total error, but they are very wide compared with the European recommendations. The CLIA criteria are intended for use with proficiency testing (and therefore need to be wide), whereas the European recommendations are so-called educational criteria, which relate directly to the desirable performance criteria for optimum monitoring of patients and the sharing of common reference intervals within geographical areas with populations that are homogeneous for the quantity of analyte. The criteria proposed by Ehrmeyer et al. [35] for minimum intralaboratory performance characteristics to pass CLIA--CV <33% and bias <20%, proficiency testing criteria--can be applied as well as the European criteria, but for the latter the total error will be determining. The CLIA criteria may be applied even better in difference plots, given the total error concept.

The validity of the analytical conclusions of this type of evaluation relies on the Reference Method chosen. It must be correct and specific and so forth, with negligible imprecision; otherwise, the conclusions of the comparison will be weakened according to any possible flaws of the Reference Method.

In conclusion, we find the visual inspection of plots to be essential for method comparison. When a field method is compared with a Reference Method, the hypothesis of identity within analytical imprecision or within stated analytical quality specifications should be applied. Both x-y plots and difference plots are useful, but we find the difference plot is easier to handle and interpret and facilitates the calculations of uncertainty.

Received December 10, 1996; revision accepted June 17, 1997.

References

[1.] Westgard JO, de Vos DJ, Hunt M, Quam EG, Carey RN, Garber CC. Method evaluation. American Society for Medical Technology, Houston, 1978.

[2.] Westgard JO, Hunt MR. Use and interpretation of common statistical tests in method-comparison studies. Clin Chem 1973;19: 49-57.

[3.] Westgard JO, Carey RN, Wold S. Criteria for judging precision and accuracy in method development and evaluation. Clin Chem 1974;20:825-33.

[4.] Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;i:307-10.

[5.] Hollis S. Analysis of method comparison studies [Guest Editorial]. Ann Clin Biochem 1996;33:1-4.

[6.] Pollock MA, Jefferson SG, Kane JW, Lomax K, MacKinnin G, Winnard CB. Method comparison--a different approach. Ann Clin Biochem 1992;29:556-60.

[7.] Svendsen A, Holmskov U, Hyltoft Petersen P, Jensenius JC. Difference and ratio plots: simple tools for improved presentation and interpretation of data. Unexpected possibilities for the use of conglutinin binding assay in inflammatory rheumatic diseases. J Immunol Methods 1995;178:211-8.

[8.] Hyltoft Petersen P, Felding P, Horder M, Tryding N. Effect of posture on concentrations of serum proteins in healthy adults. Dependency of the molecular size of proteins. Scand J Clin Lab Invest 1980;40:623-8.

[9.] Hyltoft Petersen P, Blaabjerg O, Irjala K, Icen A, Bjoro K. The Nordic Protein Project. In: Hyltoft Petersen P, Blaabjerg O, Irjala K, eds. Assessing quality in measurements of plasma proteins. The Nordic Protein Project and related projects. Helsinki, Finland: NORDKEM (Nordic Clinical Chemistry Project), 1994:87-116.

[10.] Stockl D. Beyond the myths of difference plots [Letter]. Ann Clin Biochem 1996;33:575-7.

[11.] NCCLS. Method comparison and bias estimation using patient samples; approved guidelines. Document EP9-A (ISBN 1-56238-283-7). Wayne, PA: NCCLS, 1995.

[12.] Houbouyan L, Boutiere B, Contant G, Dautzenberg MD, Fievet P, Potron G, et al. Validation protocol of analytical hemostasis systems: measurement of anti-Xa activity of low-molecular-weight heparins. Clin Chem 1996;42:1223-30.

[13.] Dybkaer R. Vocabulary for describing the metrological quality of a measurement procedure. Upsala J Med Sci 1993;98:445-86.

[14.] International Organization of Standardization. Terms and definitions used in connection with reference materials. ISO Guide 30, 2nd ed. Geneva: ISO, 1992.

[15.] Bliss CI. Statistics in biology. New York: McGraw-Hill, 1967:558 pp.

[16.] Documenta Geigy. Mathematics and statistics. Basle, Switzerland: Ciba-Geigy, 1975.

[17.] Horder M, ed. Assessing quality requirements in clinical chemistry. Scand J Clin Lab Invest 1980;40(Suppl 155):144pp.

[18.] de Verdier C-H, ed. Medical need for quality specifications in laboratory medicine. Upsala J Med Sci 1990;95:161-309.

[19.] Gowans EMS, Hyltoft Petersen P, Blaabjerg O, Horder M. Analytical goals for the acceptance of common reference intervals for laboratories throughout a geographical area. Scand J Clin Lab Invest 1988;48:757-64.

[20.] Hyltoft Petersen P, Gowans EMS, Blaabjerg O, Horder M. Analytical goals for the estimation of non-Gaussian reference intervals. Scand J Clin Lab Invest 1989;49:727-37.

[21.] Cotlove E, Harris EK, Williams GZ. Biological and analytical components of variation in long-term studies of serum constituents in normal subjects. III. Physiological and medical implications. Clin Chem 1970;16:1028-32.

[22.] Elevitch R, ed. College of American Pathologists Conference Report. Conference on analytical goals in clinical chemistry, at Aspen, Colorado. Skokie, IL: CAP, 1976:129pp.

[23.] Fraser CG, Hyltoft Petersen P, Ricos C, Haeckel R. Proposed quality specifications for the imprecision and inaccuracy of analytical systems for clinical chemistry. Eur J Clin Chem Clin Biochem 1992;30:311-7.

[24.] Fraser CG, Hyltoft Petersen P, Ricos C, Haeckel R. Quality specifications. In: Haeckel R, ed. Evaluation methods in laboratory medicine. Weinheim: VCH, 1992:87-99.

[25.] Stockl D, Baadenhuijsen H, Fraser CG, Libeer J-C, Hyltoft Petersen P, Ricos C. Desirable routine analytical goals for quantities assayed in serum. Discussion paper from the members of the external quality assessment (EQA) working group A on analytical goals in laboratory medicine. Eur J Clin Chem Clin Biochem 1995;33:157-69.

[26.] Fraser CG, Hyltoft Petersen P. Quality goals in external quality assessment are best based on biology. Scand J Clin Lab Invest 1993;53(Suppl 212):8-9.

[27.] Stockl D, Reinauer H. Development of criteria for the evaluation of reference method values. Scand J Clin Lab Invest 1993;53(Suppl 212):16-8.

[28.] Thienpont LM, van Landuyt KG, Stockl D, de Leenheer AP. Candidate reference method for determining serum creatinine by isocratic HPLC: validation with isotope dilution gas chromatography-mass spectrometry and application for accuracy assessment of routine test kits. Clin Chem 1995;41:995-1003.

[29.] Krouwer JS, Monti KL. A simple, graphical method to evaluate laboratory assays. Eur J Clin Chem Clin Biochem 1995;33:525-7.

[30.] Fuentes-Arderiu X, Fraser CG. Analytical goals for interference. Ann Clin Biochem 1991;28:383-5.

[31.] Whicher JT, Ritchie RF, Johnson AM, Baudner S, Bienvenu J, Blirup-Jensen S, et al. New international reference preparation for proteins in human serum (RPPHS). Clin Chem 1994;40:934-8.

[32.] Blirup-Jensen S, Just Svendsen P. A new reference preparation for proteins in human serum. Upsala J Med Sci 1994;99:251-8.

[33.] Flensted Lassen J, Kjeldsen J, Antonsen S, Hyltoft Petersen P, Brandslund I. Interpretation of serial measurements of International Normalized Ratio (INR) in monitoring oral anticoagulant treatment. A graphical combination of therapeutic interval and reference change. Clin Chem 1995;41:1171-6.

[34.] US Department of Health and Human Services. Medicare, Medicaid, and CLIA programs; regulations implementing the Clinical Laboratory Improvement Amendments of 1988 (CLIA). Final rule. Fed Regist 1992;57:7002-186.

[35.] Ehrmeyer SS, Laessig RH, Leinweber JE, Oryall JJ. 1990 Medicare/ CLIA final rules for proficiency testing: minimum intralaboratory performance characteristics (CV and bias) needed to pass. Clin Chem 1990:36:1736-40.

PER HYLTOFT PETERSEN, (1) * DIETMAR STOCKL, (2) OLE BLAABJERG, (1) BENT PEDERSEN, (1) ERLING BIRKEMOSE, (1) LINDA THIENPONT, (2) JENS FLENSTED LASSEN (3) and JENS KJELDSEN (4)

Departments of (1) Clinical Chemistry and (4) Medical Gastroenterology, Odense University Hospital, DK-5000 Odense C, Denmark. (2) Laboratorium voor Analytische Chemie, Faculteit Farmaceutische Wetenschappen, Universiteit Gent, Harelbekestraat 72, B-9000 Gent, Belgium. (3) Department of Clinical Chemistry, Vejle Sygehus, DK-7100 Vejle, Denmark.

(5) Some [k.sub.7]-values for 95% tolerance factors for the 95% interval: 3.38 for n = 10; 2.75 for n = 20; 2.55 for n = 30; 2.38 for n = 50; 2.30 for n = 70; and 2.23 for n = 100.

* Author for correspondence. Fax 1 45 65 41 19 11; e-mail phy@imbmed.ou.dk.

Printer friendly Cite/link Email Feedback | |

Title Annotation: | Opinion |
---|---|

Author: | Petersen, Per Hyltoft; Stockl, Dietmar; Blaabjerg, Ole; Pedersen, Bent; Birkemose, Erling; Thienpont |

Publication: | Clinical Chemistry |

Date: | Nov 1, 1997 |

Words: | 5667 |

Previous Article: | Molecular diagnostics of infectious diseases. |

Next Article: | Improved detection of minor ischemic myocardial injury with measurement of serum cardiac troponin I. |