# Quality control in the new environment: statistics.

Quality control in the new environment: Statistics

Of all the statistical tools at their disposal, laboratorians are most comfortable with the theory and application of the standard deviation. Yet it is often misapplied, giving rise to erroneous or misleading conclusions. This article will discuss some common misapplications of the SD in routine lab practice, especially in daily quality assurance programs.

Gaussian distribution required. The standard deviation is a measure of the degree of dispersion (or error) about a mean value. Limits of 1 SD on each side of the mean encompass about 68 per cent of all possible values; with limits of 2 SD, about 95 per cent of the possible values are encompassed. However, these approximations hold only for populations that have a "normal' or Gaussian distribution. Population distributions that deviate significantly from Gaussian do not relate to the SD in the same way.

A Gaussian distribution occurs in a population of values randomly dispersed about a point of central tendency. Visually, this translates into the familiar symmetric bell-shaped curve having an identical mean, median, and mode and conforming to a specific mathematical expression. In the clinical laboratory, the point of central tendency is commonly the mean concentration of an analyte in solution as determined by repeat testing. The rendom dispersion is the random error of an analytic procedure.

Although many populations dealt with in the laboratory are distributed normally, this is not always the case. Assuming normality can lead to incorrect, even bizarre conclusions. For example, treating several normally distributed populations as a single population may result in a distribution with several individual peaks and variable degrees of overlap (Figure I). The SD calculated from such combined data could be devoid of its usual meaning. That is, 2 SD from the calculated combined mean would probably encompass more than 95 per cent of all possible values in this situation (Figure II).

Another way of combining data incorrectly is to lump together control results generated by several technologists performing a test that is highly operator-dependent. Here, each technologist's results represent an independent population. Differential WBC counts, red cell morphology, reticulocyte counts, and coagulation tests by the tilt tube method fall into this category.

In these cases, operators must be evaluated individually to determine whether their performance is adequate relative to that of their peers. For example, a quality control plan for reticulocyte counts might include periodically having all operators count the same specimen. A review of only the aggregate results may overlook individual operators with significant biases --their data are diluted by the entire pool of results. On the other hand, tracking each individual's record over several such control challenges should reveal any significant interoperator biases.

Establishing a "normal' or reference range is another area where one may draw erroneous conclusions by incorrectly assuming a population distribution is Gaussian. The upper and lower limits of such ranges are customarily derived from 2 SD above and below the mean value of a representative sample of the particular population. White blood cell counts, for example, have a positively skewed distribution--i.e., deviations from the mean value are more extreme above the mean than below. Consequently, a reference value study for WBC counts done on 15 to 20 normal individuals, taking 2 SD on either side of the mean, may yield a 95 per cent confidence range that is patently absurd, such as a lower normal limit of 500 or even a negative value.

Establishing operational limits. Conventionally, operational limits are set at levels that indicate a defined probability of some change in the analytic system. Control values reaching these limits should trigger rejection of an analytic run or at least require some response.

When several methodologies are are lumped together as one, there is a risk that the derived confidence limits may be meaningless. Each narrow peak represents the population distribution of the results of a specific instrument-reagent combination for a given control material. The broad bell-shaped curve represents the distribution obtained if all the individuals of all six populations are treated as a single normally distributed population. The extremities of this curve range well beyond those of the others.

Although a variety of statistical rules can be used to define them,1 all limits should be based on data accumulated from routine analytic runs by the individual laboratory. Limits that come from other sources--quality assurance programs or vendors, for example-- are not as relevant to the performance of the individual lab. In fact, these limits are often inappropriately wide because they have been constructed to include the divergent results of several methods. Such artificially wide control limits may convey a false sense of QC security (Figure II).

Insuring clinical relevance. In statistical analysis, we attempt to make generalizations and draw inferences about specific characteristics of a population by mathematically analyzing a sample. To be considered adequate, the sample must be representative of the population in question. In the same vein, laboratory quality control programs are designed to reflect what is occurring in the population of patient specimens during testing. Since this information is obtained by observing serial values of a control material, the medical relevance of our QC analysis rests on the assumption that the control material is in all ways similar to patient specimens.

The two most important functions of a QC product are its ability to detect system malfunctions that can affect patient specimens and its ability to quantitate that portion of test result variability due to random error in the analysis, as opposed to, say, pre-test variables or physiologic variation. In other words, the control material must adequately reflect what is happening to patient specimens. For that reason, materials selected must be, as much as possible, like actual patient specimens in composition.

A control sensitive to minor instrument malfunctions that have no impact on patient values is likely to cause a good deal of unnecessary troubleshooting. Conversely, an insensitive control material may allow significant changes in patient results to go unnoticed.

Many regional quality assurance programs perform extensive field testing on candidate controls to make sure the materials they use are appropriately sensitive. In addition, if a control specimen is to adequately reflect variability in results due to random analytic error, it must be handled in every way possible like an actual patient specimen. The control must be randomly placed in a test run just as patient specimens are--not always at the beginning or end or after a calibration step.

The logic used in drawing conclusions. When drawing conclusions from statistical analyses, we first accept several assumptions. We customarily assume that control preparations reflect what occurs with patient specimens. Another assumption is that of the "null hypothesis,' upon which we build the logic for our conclusions.2

The null hypothesis states that there is no difference between our observed data and the population in question. To test our assumption that the hypothesis is correct, we calculate the probability level at which our observed data fall in relation to the population. If this level indicates it is sufficiently improbable that the observed data are part of the population in question, we may conclude the data do not belong to the population. Thus our assumption is false, and we reject the null hypothesis. Conventionally, a probability level greater than 0.95 (a significance level of less than 0.05) results in rejection.

Clearly understanding this concept permits us to know our limitations. This approach allows us to draw only the conclusion that our data are different or have changed relative to the population of interest. It does not provide a logical basis to conclude that our data are the same or unchanged.

If results fall outside the mean 2 SD on a Shewhart (Levey-Jennings) chart, we conventionally conclude (at about the 95 per cent confidence level) that a real change has occurred in our testing procedure. On the other hand, if a result falls within these operational limits, we cannot automatically infer stability. We can conclude only that we have not convincingly demonstrated occurrence of a change. For example, if a test method experiences a sudden shift of exactly 2 SD, the probability that the next control value will fall out of conventional operational limits is only 50 per cent; the other 50 per cent of the time, we may falsely conclude that there has been no change.

This point must be kept in mind constantly: We cannot depend on control values alone to tell us that a real change has not occurred. We must be attuned to other signs of system malfunction and not ignore them when control results fall within acceptable limits. If we suspect a problem and wish to increase our certainty of detecting a real change, we must look at a greater number of controls.3 We can either place more controls into a run or closely observe several control values from consecutive runs for adverse trends. Another option is to narrow operational limits to less than the mean 2 SD.

There is a trade-off, however, since these actions increase the probability that a control value may fall out of range when, in fact, no analytic error has occurred. It should also be clear that a change in results may have occurred well before the analytic run in which the control first falls out of range. Thus a real change may have gone undetected for several runs.

Making peer group comparisons. The ability to develop very large lots of stable and homogeneous control materials was a major breakthrough in clinical lab quality assurance. It provided the means to compare the performance of laboratories participating in a quality assurance program and to evaluate instruments, reagents, and methodologies.

When making such comparisons, we naturally depend on some form of statistical analysis to give us a perspective on the data submitted by laboratories. The standard deviation interval (SDI) is a popular tool for comparing the monthly mean values of the various participants and is usually supplied to each subscriber as part of the QA program.

The concept of the SDI is straightforward. Peer laboratories are grouped on the basis of common methodologies. Each month, the overall mean value is calculated from the individual mean values of each lab in the group. The interlaboratory SD of the individual means is also computed. The SDI of each laboratory is the deviation of its monthly mean value from the group's overall mean in terms of the interlaboratory SD; it shows how many interlaboratory SDs the laboratory's mean value lies from the group mean. SDIs that lump all peer groups together may also be provided for additional information.

Since the SDI is quantitated in SDs, the distribution of individual results must be close to normal in order to convey the appropriate conceptual meaning. To accomplish this, quality assurance programs must make sure their peer groups consist of laboratories with very similar if not identica, methodologies.

Some programs repeatedly ignore this important issue and mix many disparate methodologies in a single peer group to enhance the database. Unfortunately, the SDI may then lose its ability to help laboratory directors pinpoint biases relevant to their labs and specific methodologies.

Observing SDIs for several months or years is an excellent way to monitor quality control over the long term; the interlaboratory mean of the peer group serves as an external standard for monthly comparisons with the lab's means. The SDIs for each month may be plotted on a Shewhart type of chart or on a Youden plot (2 SDI values are needed) to look for long-term shifts away from the peer group, which is assumed to be a constant. In many situations, these observations may make it unnecessary to conduct time-consuming normal value studies, which have much the same purpose.

The peer group's SD is an interlab measure and should not be confused with the intralaboratory SD for each individual participant. It cannot be used as the basis for operational limits in daily QC; as we already emphasized, daily operational limits should be based on the lab's performance, not set from control manufacturers' directions or peer group reports.

In addition to interlaboratory SDs, the better quality assurance programs periodically provide the current range of intralaboratory SDs or the participating laboratories' coefficients of variation. Against this guideline, the individual lab can judge its precision. There are also published guidelines listing the acceptable CVs for commonly performed chemical analytes.4

Many laboratorians believe they can obtain sufficient information about their lab's performance relative to that of its peers from the data reported by quarterly proficiency surveys. Seeing no reason to participate in a second group program, they use day-to-day control material as an internal control only, not as part of a group.

There are, however, important differences between the peer comparisons of quality assurance programs and those of surveys. In quality assurance programs, we compare mean values of each individual lab, computed from several (25 or more) individual determinations. In surveys, comparisons are made from a single analysis. By comparing means instead of single values, the quality assurance programs provide statistically more powerful analyses than surveys. They are more sensitive to detection of real differences, both between peer groups and between individuals within a peer group.

Surveys are really just spot checks designed more to insure a laboratory's compliance with minimal standards than as a tool to monitor and troubleshoot analytic methods. Survey results may fall outside acceptable limits (usually an SDI of 2) because of poor accuracy or precision, a sporadic outlier, improper specimen handling, or clerical errors. An out-of-limits SDI on a quality assurance program is a more reliable and sensitive indicator of a problem with accuracy, while precision problems are detected by comparing a laboratory's SD or CV with that of the peer group.

Confusing the SD of individual measurements with the SD of mean values is a common misapplication among laboratorians. The SD of a mean-- usually called the standard error of the mean--is calculated from the following equation:

SE(x) = sd(xi) n

Here, SE(x) is the standard error of a population of mean values, each of which is calculated from n number of individuals of a population X. SD(xi) is the SD of the individual members of the population X. As can be readily seen, the greater the number of values that go into making up a mean, the smaller the SD of those means. Smaller, real differences between mean values are therefore more readily detectable than differences between individual measurements.

The standard deviation is by far the most useful and widely used statistical tool in the clinical lab. To employ it properly, however, one must be fully aware of the conditions under which its use is valid, the assumptions necessary to obtain meaningful results, and its inherent limitations.

1. Westgard, J.O.; Groth, T.; Aronsson, T.; et al. Performance characteristics of rules for internal quality control: Probabilities for false rejection and error detection. Clin. Chem. 23: 1857-1867, 1977.

2. Colton, T. "Statistics in Medicine,' 1st ed., pp. 115-120. Boston, Little, Brown, 1974.

3. Arkin, C.F. Quality control: What are our goals? How much is necessary? Pathologist 8: 19-25, 1985.

4. Ross, J.W.; Fraser, M.D.; and Moore, T.D. Analytical clinical clinical laboratory precision--State of the art for thirty-one analytes. Am. J. Clin. Pathol. 74: 521-530, 1980.

Photo: Figure II All values pass when limits are too broad

When sources other than the individual laboratory establish operational or action limits (dotted lines), these limits may bear no relationship to the lab's actual performance. An "overly optimistic' Shewhart chart may result, as illustrated here.

Of all the statistical tools at their disposal, laboratorians are most comfortable with the theory and application of the standard deviation. Yet it is often misapplied, giving rise to erroneous or misleading conclusions. This article will discuss some common misapplications of the SD in routine lab practice, especially in daily quality assurance programs.

Gaussian distribution required. The standard deviation is a measure of the degree of dispersion (or error) about a mean value. Limits of 1 SD on each side of the mean encompass about 68 per cent of all possible values; with limits of 2 SD, about 95 per cent of the possible values are encompassed. However, these approximations hold only for populations that have a "normal' or Gaussian distribution. Population distributions that deviate significantly from Gaussian do not relate to the SD in the same way.

A Gaussian distribution occurs in a population of values randomly dispersed about a point of central tendency. Visually, this translates into the familiar symmetric bell-shaped curve having an identical mean, median, and mode and conforming to a specific mathematical expression. In the clinical laboratory, the point of central tendency is commonly the mean concentration of an analyte in solution as determined by repeat testing. The rendom dispersion is the random error of an analytic procedure.

Although many populations dealt with in the laboratory are distributed normally, this is not always the case. Assuming normality can lead to incorrect, even bizarre conclusions. For example, treating several normally distributed populations as a single population may result in a distribution with several individual peaks and variable degrees of overlap (Figure I). The SD calculated from such combined data could be devoid of its usual meaning. That is, 2 SD from the calculated combined mean would probably encompass more than 95 per cent of all possible values in this situation (Figure II).

Another way of combining data incorrectly is to lump together control results generated by several technologists performing a test that is highly operator-dependent. Here, each technologist's results represent an independent population. Differential WBC counts, red cell morphology, reticulocyte counts, and coagulation tests by the tilt tube method fall into this category.

In these cases, operators must be evaluated individually to determine whether their performance is adequate relative to that of their peers. For example, a quality control plan for reticulocyte counts might include periodically having all operators count the same specimen. A review of only the aggregate results may overlook individual operators with significant biases --their data are diluted by the entire pool of results. On the other hand, tracking each individual's record over several such control challenges should reveal any significant interoperator biases.

Establishing a "normal' or reference range is another area where one may draw erroneous conclusions by incorrectly assuming a population distribution is Gaussian. The upper and lower limits of such ranges are customarily derived from 2 SD above and below the mean value of a representative sample of the particular population. White blood cell counts, for example, have a positively skewed distribution--i.e., deviations from the mean value are more extreme above the mean than below. Consequently, a reference value study for WBC counts done on 15 to 20 normal individuals, taking 2 SD on either side of the mean, may yield a 95 per cent confidence range that is patently absurd, such as a lower normal limit of 500 or even a negative value.

Establishing operational limits. Conventionally, operational limits are set at levels that indicate a defined probability of some change in the analytic system. Control values reaching these limits should trigger rejection of an analytic run or at least require some response.

When several methodologies are are lumped together as one, there is a risk that the derived confidence limits may be meaningless. Each narrow peak represents the population distribution of the results of a specific instrument-reagent combination for a given control material. The broad bell-shaped curve represents the distribution obtained if all the individuals of all six populations are treated as a single normally distributed population. The extremities of this curve range well beyond those of the others.

Although a variety of statistical rules can be used to define them,1 all limits should be based on data accumulated from routine analytic runs by the individual laboratory. Limits that come from other sources--quality assurance programs or vendors, for example-- are not as relevant to the performance of the individual lab. In fact, these limits are often inappropriately wide because they have been constructed to include the divergent results of several methods. Such artificially wide control limits may convey a false sense of QC security (Figure II).

Insuring clinical relevance. In statistical analysis, we attempt to make generalizations and draw inferences about specific characteristics of a population by mathematically analyzing a sample. To be considered adequate, the sample must be representative of the population in question. In the same vein, laboratory quality control programs are designed to reflect what is occurring in the population of patient specimens during testing. Since this information is obtained by observing serial values of a control material, the medical relevance of our QC analysis rests on the assumption that the control material is in all ways similar to patient specimens.

The two most important functions of a QC product are its ability to detect system malfunctions that can affect patient specimens and its ability to quantitate that portion of test result variability due to random error in the analysis, as opposed to, say, pre-test variables or physiologic variation. In other words, the control material must adequately reflect what is happening to patient specimens. For that reason, materials selected must be, as much as possible, like actual patient specimens in composition.

A control sensitive to minor instrument malfunctions that have no impact on patient values is likely to cause a good deal of unnecessary troubleshooting. Conversely, an insensitive control material may allow significant changes in patient results to go unnoticed.

Many regional quality assurance programs perform extensive field testing on candidate controls to make sure the materials they use are appropriately sensitive. In addition, if a control specimen is to adequately reflect variability in results due to random analytic error, it must be handled in every way possible like an actual patient specimen. The control must be randomly placed in a test run just as patient specimens are--not always at the beginning or end or after a calibration step.

The logic used in drawing conclusions. When drawing conclusions from statistical analyses, we first accept several assumptions. We customarily assume that control preparations reflect what occurs with patient specimens. Another assumption is that of the "null hypothesis,' upon which we build the logic for our conclusions.2

The null hypothesis states that there is no difference between our observed data and the population in question. To test our assumption that the hypothesis is correct, we calculate the probability level at which our observed data fall in relation to the population. If this level indicates it is sufficiently improbable that the observed data are part of the population in question, we may conclude the data do not belong to the population. Thus our assumption is false, and we reject the null hypothesis. Conventionally, a probability level greater than 0.95 (a significance level of less than 0.05) results in rejection.

Clearly understanding this concept permits us to know our limitations. This approach allows us to draw only the conclusion that our data are different or have changed relative to the population of interest. It does not provide a logical basis to conclude that our data are the same or unchanged.

If results fall outside the mean 2 SD on a Shewhart (Levey-Jennings) chart, we conventionally conclude (at about the 95 per cent confidence level) that a real change has occurred in our testing procedure. On the other hand, if a result falls within these operational limits, we cannot automatically infer stability. We can conclude only that we have not convincingly demonstrated occurrence of a change. For example, if a test method experiences a sudden shift of exactly 2 SD, the probability that the next control value will fall out of conventional operational limits is only 50 per cent; the other 50 per cent of the time, we may falsely conclude that there has been no change.

This point must be kept in mind constantly: We cannot depend on control values alone to tell us that a real change has not occurred. We must be attuned to other signs of system malfunction and not ignore them when control results fall within acceptable limits. If we suspect a problem and wish to increase our certainty of detecting a real change, we must look at a greater number of controls.3 We can either place more controls into a run or closely observe several control values from consecutive runs for adverse trends. Another option is to narrow operational limits to less than the mean 2 SD.

There is a trade-off, however, since these actions increase the probability that a control value may fall out of range when, in fact, no analytic error has occurred. It should also be clear that a change in results may have occurred well before the analytic run in which the control first falls out of range. Thus a real change may have gone undetected for several runs.

Making peer group comparisons. The ability to develop very large lots of stable and homogeneous control materials was a major breakthrough in clinical lab quality assurance. It provided the means to compare the performance of laboratories participating in a quality assurance program and to evaluate instruments, reagents, and methodologies.

When making such comparisons, we naturally depend on some form of statistical analysis to give us a perspective on the data submitted by laboratories. The standard deviation interval (SDI) is a popular tool for comparing the monthly mean values of the various participants and is usually supplied to each subscriber as part of the QA program.

The concept of the SDI is straightforward. Peer laboratories are grouped on the basis of common methodologies. Each month, the overall mean value is calculated from the individual mean values of each lab in the group. The interlaboratory SD of the individual means is also computed. The SDI of each laboratory is the deviation of its monthly mean value from the group's overall mean in terms of the interlaboratory SD; it shows how many interlaboratory SDs the laboratory's mean value lies from the group mean. SDIs that lump all peer groups together may also be provided for additional information.

Since the SDI is quantitated in SDs, the distribution of individual results must be close to normal in order to convey the appropriate conceptual meaning. To accomplish this, quality assurance programs must make sure their peer groups consist of laboratories with very similar if not identica, methodologies.

Some programs repeatedly ignore this important issue and mix many disparate methodologies in a single peer group to enhance the database. Unfortunately, the SDI may then lose its ability to help laboratory directors pinpoint biases relevant to their labs and specific methodologies.

Observing SDIs for several months or years is an excellent way to monitor quality control over the long term; the interlaboratory mean of the peer group serves as an external standard for monthly comparisons with the lab's means. The SDIs for each month may be plotted on a Shewhart type of chart or on a Youden plot (2 SDI values are needed) to look for long-term shifts away from the peer group, which is assumed to be a constant. In many situations, these observations may make it unnecessary to conduct time-consuming normal value studies, which have much the same purpose.

The peer group's SD is an interlab measure and should not be confused with the intralaboratory SD for each individual participant. It cannot be used as the basis for operational limits in daily QC; as we already emphasized, daily operational limits should be based on the lab's performance, not set from control manufacturers' directions or peer group reports.

In addition to interlaboratory SDs, the better quality assurance programs periodically provide the current range of intralaboratory SDs or the participating laboratories' coefficients of variation. Against this guideline, the individual lab can judge its precision. There are also published guidelines listing the acceptable CVs for commonly performed chemical analytes.4

Many laboratorians believe they can obtain sufficient information about their lab's performance relative to that of its peers from the data reported by quarterly proficiency surveys. Seeing no reason to participate in a second group program, they use day-to-day control material as an internal control only, not as part of a group.

There are, however, important differences between the peer comparisons of quality assurance programs and those of surveys. In quality assurance programs, we compare mean values of each individual lab, computed from several (25 or more) individual determinations. In surveys, comparisons are made from a single analysis. By comparing means instead of single values, the quality assurance programs provide statistically more powerful analyses than surveys. They are more sensitive to detection of real differences, both between peer groups and between individuals within a peer group.

Surveys are really just spot checks designed more to insure a laboratory's compliance with minimal standards than as a tool to monitor and troubleshoot analytic methods. Survey results may fall outside acceptable limits (usually an SDI of 2) because of poor accuracy or precision, a sporadic outlier, improper specimen handling, or clerical errors. An out-of-limits SDI on a quality assurance program is a more reliable and sensitive indicator of a problem with accuracy, while precision problems are detected by comparing a laboratory's SD or CV with that of the peer group.

Confusing the SD of individual measurements with the SD of mean values is a common misapplication among laboratorians. The SD of a mean-- usually called the standard error of the mean--is calculated from the following equation:

SE(x) = sd(xi) n

Here, SE(x) is the standard error of a population of mean values, each of which is calculated from n number of individuals of a population X. SD(xi) is the SD of the individual members of the population X. As can be readily seen, the greater the number of values that go into making up a mean, the smaller the SD of those means. Smaller, real differences between mean values are therefore more readily detectable than differences between individual measurements.

The standard deviation is by far the most useful and widely used statistical tool in the clinical lab. To employ it properly, however, one must be fully aware of the conditions under which its use is valid, the assumptions necessary to obtain meaningful results, and its inherent limitations.

1. Westgard, J.O.; Groth, T.; Aronsson, T.; et al. Performance characteristics of rules for internal quality control: Probabilities for false rejection and error detection. Clin. Chem. 23: 1857-1867, 1977.

2. Colton, T. "Statistics in Medicine,' 1st ed., pp. 115-120. Boston, Little, Brown, 1974.

3. Arkin, C.F. Quality control: What are our goals? How much is necessary? Pathologist 8: 19-25, 1985.

4. Ross, J.W.; Fraser, M.D.; and Moore, T.D. Analytical clinical clinical laboratory precision--State of the art for thirty-one analytes. Am. J. Clin. Pathol. 74: 521-530, 1980.

Photo: Figure II All values pass when limits are too broad

When sources other than the individual laboratory establish operational or action limits (dotted lines), these limits may bear no relationship to the lab's actual performance. An "overly optimistic' Shewhart chart may result, as illustrated here.

Printer friendly Cite/link Email Feedback | |

Title Annotation: | part 4 |
---|---|

Author: | Arkin, Charles F. |

Publication: | Medical Laboratory Observer |

Date: | Dec 1, 1986 |

Words: | 2597 |

Previous Article: | Now is career decision time for MTs. |

Next Article: | A microcomputer test-costing program. |

Topics: |