# Screening cervical smears.

In September 1987 Liverpool Health Authority revealed that their
former pathologist, who had been employed as Locum Consultant from March
1983 to December 1985, but was by that time retired, had been discovered
to have issued a large number of false negatives during her period in
office. In all, 911 smears which she had inspected had been reviewed and
it had been found necessary to recall some 487 women either for further
tests or for treatment. The treatments included 133 biopsies, 20 laser
treatments, and four hysterectomies (The Times, 24 September 1987, p.
3). An internal investigation concluded: 'The incorrect diagnoses
made during the period point to a massive error of professional judgment
for which there is no logical explanation' (Fitzpatrick, Utting,
Kenyon & Burston, 1987, p. 3.)

This paper sets out a plausible explanation how those cervical smears came to be misclassified. Some doubt must remain because other explanations of a more obvious and reprehensible kind are logically possible, and the control observations which one would record in a well-designed experiment, and which could confirm the present thesis, are not available. Whatever the facts of this particular failure of diagnosis, a psychological analysis of the task facing a cytopathologist engaged in the classification of cervical smears, and of the procedures at present adopted, indicates that this episode was an accident waiting to happen.

Cervical cancer

More than 2000 women in Britain die each year from cancer of the cervix (Sharp et al., 1987, p. 5). In principle, these deaths are avoidable because this form of cancer is easy to detect in its early stages. Women between the ages of 20 and 64 are advised to have a smear taken every five (ideally three) years. A sample of superficial cells is removed from the cervix by the patient's GP, fixed on a slide, and sent for laboratory examination. In the laboratory the cells are stained, and examined under a microscope. Healthy cells are characterized by a small nucleus and abundant cytoplasm, while abnormal cells generally show the contrary aspect, often being nearly all nucleus. There are, moreover, a number of different pathological processes which might occur, not all of them cancerous; they are described by Evans et al. (1986).

Viewed side by side, the difference between healthy and abnormal cervical cells is easy to see. But, in practice, diagnosis may be difficult. In the first place, abnormal cells, when they are found, typically comprise small groups amongst a much larger number of healthy cells; there are frequently more than 100 000 cells in a cervical smear and a small number of abnormal cells are easy to miss. In the second place, the cellular abnormalities may vary in degree from 'mild' to 'severe'. How abnormal does a cell have to be and what proportion of cells have to be abnormal before some medical action is justified? It will not do to 'err on the safe side', not only because that creates additional and unnecessary work in the form of repeated smears, but also because it causes unnecessary anxiety to some women with no real likelihood of developing cervical cancer.

Inspection by a qualified cytopathologist is necessary. However, it takes five to seven minutes to check and report on a healthy smear, and about 10 minutes for a doubtful one. So, in order to economize on the use of scarce professional time, the following procedure is adopted:

The smears are first examined by a Medical Laboratory Scientific Officer (MSLO) or a cytoscreener. These scientific officers have been trained to pick out any possible abnormalities (which are marked with a dot), but they do not make management diagnoses. Diagnosis is the responsibility of the cytopathologist to whom any doubtful smear is passed for further examination. As at the time when the Locum Consultant in Liverpool made her misdiagnoses, about 98 per cent of smears were classified as entirely normal, and the procedure described enabled cytopathologists' attention to be concentrated on that 2 per cent where some further action might be required. Typically 1 per cent of the total laboratory intake gets classified as 'positive', leading usually to a referral for a biopsy, though mild abnormalities would be followed merely by a request for a further smear in three or six months' time. About 0.1 per cent of smears examined lead to new diagnoses of cancer (Committee on Gynaecological Cytology, 1987).

The cytopathologist, then, examines a succession of slides, selected by the MSLOs as possibly requiring some action, and varying in degree of abnormality. She gets no feedback on the accuracy of her diagnoses (except for reports on biopsies received some time after the smear was initially examined). It is to be expected that the screening of cervical smears will exhibit the vagaries of judgement that characterize inspection tasks generally.

Between March 1983 and December 1985 the cervical cytology service at the Women's Hospital in Liverpool processed some 45 000 smears. About 2 per cent of that 45 000 would have been referred to the consultant pathologist for diagnosis. So the 911 smears reviewed by her successor in post must be all or nearly all of the smears personally examined by the (by then retired) pathologist. That means that during her period of office the unfortunate pathologist must have passed as 'negative' nearly every smear submitted to her. That circumstance points to an experiment by Howarth & Bulmer (1956) as a model for the pathologist's failure of judgement.

Detecting flashes of light

The subject in Howarth & Bulmer's experiment watched a dimly lit screen. Every 4 1/2 seconds there was a faint 1/2-second flash of light, accompanied by a bell. The subject was instructed to press a Morse key every time he saw a flash. In an initial series of trials the intensity of the flash was adjusted until the subject was reporting about half of them. Catch trials were introduced and the subject warned if he issued too many false positives. There followed, in the main series of trials, 600 further flashes at the unchanging 50 per cent intensity, with no further catch trials and no feedback to the subject of any kind.

This experiment was intended to study fluctuations in the probability of detection. It has been known since Fernberger (1920) that successive judgements of stimuli are not independent. But the nature of the interaction shows up the most clearly when a subject is induced to make judgements about each of a series of identical stimuli (as in the quantal experiment; Neisser, 1957; Stevens, Morgan & Volkmann, 1941), eliminating any contribution from varying stimulus magnitude. Although the pattern of fluctuation varied somewhat from one subject to another, detections were typically bunched together, and misses, much more so than in a purely random sequence of responses. Successive judgements were positively correlated, and the probability of detection varied locally by as much as [+ or -]0.2. Even more significant, out of 10 subjects run the authors reported that

The experiment was stopped on two occasions after the probability of responding had dropped almost to zero (Howarth & Bulmer, 1956, p. 164).

Those two subjects provide a model for the failure of the pathologist in Liverpool.

Two explanations

Howarth & Bulmer (1956) examined two generic explanations for the fluctuations in the probability of detection:

(a) Spontaneous fluctuations in sensitivity

Suppose the subject's sensitivity to the flashes fluctuates spontaneously. If [p.sub.n] denotes the probability of detection on trial n, the probability on trial n + 1 can be modelled as

[p.sub.n+1] = [Alpha][p.sub.n] + (1 - [Alpha]) [x.sub.n], (1)

where 0 [less than] [Alpha] [less than] 1, and the [x.sub.n](0 [less than or equal to] [x.sub.n] [less than or equal to] 1) are a succession of independent and identically distributed random variables. A large [x.sub.n] (the [x.sub.n] are unobservable) would raise the value of p, and the linear character of Equation (1) would preserve that increase for several trials thereafter.

(b) Assimilation to the previous response

Let [R.sub.n] be the response on trial n, taking the value 1 if the flash is reported and 0 otherwise. Assimilation to the previous response can be modelled as

[p.sub.n+1] = [Alpha][p.sub.n] + (1 - [Alpha]) [R.sub.n], (2)

simply writing [R.sub.n] for [x.sub.n] in Equation (1).

Notwithstanding the formal similarity between Equations (1) and (2), these two models have quite different mathematical properties, as we shall see (see also Laming, 1974). The reason is set out in Fig. 1. The structure of each model is indicated there by solid arrows; what is observed experimentally is the correlation between successive responses, shown by broken arrows. Response assimilation provides the more direct and therefore the stronger linkage. If both processes are operative, response assimilation will dominate.

Response assimilation

A second experiment by Howarth & Bulmer (1956) showed response assimilation to be operative. They repeated the experiment described above with just one difference: blocks of three blank trials (no light flash, but all auditory cues remaining in place) were occasionally inserted into the main series.

(1) After a sufficient number of trials the expectation of [p.sub.n], E{[p.sub.n]}, in Model (a) tends to E{[x.sub.n]}; this follows from taking expectations in Equation (1),

E{[p.sub.n+1]} = [Alpha]E{[p.sub.n]} + (1 - [Alpha]) E{[x.sub.n]}, (3)

and putting E{[p.sub.n+1]} = E{[p.sub.n]}. Thereafter, [p.sub.n] varies randomly about this mean in a stable dynamic equilibrium. Putting E{[x.sub.n]} = [Mu] and subtracting from each side of Equation (1),

([p.sub.n+1] - [Mu]) = [Alpha]([p.sub.n] - [Mu]) + (1 - [Alpha]) ([x.sub.n] - [Mu]),

whence

Var ([p.sub.n+1]) = [[Alpha].sup.2]Var ([p.sub.n]) + [(1 - [Alpha]).sup.2] Var ([x.sub.n]). (4)

So, again after a sufficient number of trials,

Var ([p.sub.n+1]) [right arrow] [(1 - [Alpha])/(1 + [Alpha])] Var ([x.sub.n]), (5)

at which point the process is stationary and would have the same statistical properties if run backwards in time. As a consequence, the probability of detection on the trial immediately following a block of three blanks [P('Yes' after BBB) = 0.348] ought to be the same as on the trial immediately before [P('Yes' before BBB) = 0.537]. But this latter probability is much closer to the unconditional probability of detection [P('Yes') = 0.516]. It is clear that, contrary to Model (a), a block of three blank trials depresses the probability of detection.

(2) Model (b) is driven by the responses. 'No' is the same response, whether resulting from a missed flash or forced by a blank trial. According to this model the effect of three forced 'No's [P('Yes' after BBB) = 0.348] should be comparable to the effect of three missed flashes [P('Yes' after NNN) = 0.301] and much less than the unconditional probability of detection (0.516). Indeed, it is.

(3) Unlike Model (a), the expected probability of detection in Model (b) does not evolve. Taking expectations in Equation (2),

E{[p.sub.n+1]} = [Alpha][p.sub.n] + (1 - [Alpha]) E{[R.sub.n]} = [p.sub.n]; (6)

so E{[p.sub.n+1]} does not vary from E{[p.sub.n]} or, indeed, from its initial value [p.sub.1]. But the variance of [p.sub.n] about its mean increases continually, trial by trial. Rewriting Equation (2),

([p.sub.n+1] - [p.sub.1]) = ([p.sub.n] - [p.sub.1]) + (1 - [Alpha]) ([R.sub.n] - [p.sub.n]),

whence

Var ([p.sub.n+1] - [p.sub.1]) = Var ([p.sub.n] - [p.sub.1]) + [(1 - [Alpha]).sup.2] Var ([R.sub.n] - [p.sub.n]). (7)

A limit is reached only when Var ([R.sub.n] - [p.sub.n]) = 0, that is, when [p.sub.n] is permanently either 0 or 1. The process (2) is a martingale and 'runs away'. Its behaviour matches that of the two of Howarth & Bulmer's subjects who 'dried up' or of the unfortunate pathologist in Liverpool.

Prior expectations

It should be noted that Equation (2) always 'runs away' in this manner, while most subjects and most cytopathologists do not. There is a problem how the performance of the other eight subjects in Howarth & Bulmer's experiment might be accommodated.

That experiment made no sense unless the subject was expected, somehow, to continue to detect some, but not all, of the flashes. In an auditory replication (Treisman & Williams, 1984, p. 79) subjects 'were told they should expect to make a detection on about one half of the trials'. And in an earlier experiment, judging whether two simultaneous stimuli were, or were not, simultaneous, Senders & Sowards (1952, p. 373) reported that 'The Ss tended to make the proportions of "yes" and "no" responses correspond to the proportions they had been told to expect'. Moreover, the two of Howarth & Bulmer's subjects who 'dried up' on their first run, completed the experiment satisfactorily on a second occasion. If a subject has the idea that he is meant to continue responding occasionally throughout the session, he will do so.

A prior expectation can be written into Equation (2) as follows:

[p.sub.n+1] = [Alpha][p.sub.n] + (1 - [Alpha])[[Beta][R.sub.n] + (1 - [Beta])[Pi]], (8)

where [Pi] is the proportion of trials on which the subject expects to detect a signal. Taking expectations in Equation (8),

E{[p.sub.n+1]} = [Alpha]E{[p.sub.n]} + (1 - [Alpha]) [[Beta]E{[p.sub.n]} + (1 - [Beta])[Pi]], (9)

so that E{[p.sub.n+1]} tends to [Pi]. The prior expectation acts as an attractor.

If [Beta] is equal to 1 in Equation (8), that equation reduces to Equation (2). But if [Beta] is less than 1, the asymptotic variance of [p.sub.n] has a limit which is less than the maximum possible, [Pi](1 - [Pi]), and the process no longer runs away (Laming, 1974). At the same time, [p.sub.n] does have an asymptotic variance ([greater than] 0) and the variability of the responses is greater than one would expect from a strictly binomial process with constant [p.sub.n].

The systematic failure of diagnosis in Liverpool came to light when the MSLOs expressed their anxieties to the consultant pathologist's successor. He re-examined all of his predecessor's work and discovered 487 new 'positives'. Some of those positive diagnoses were confirmed by outside experts. But one might nevertheless ask whether this re-examination of the former pathologist's work was itself without bias.

Nationwide, in the three years 1983 to 1985, the National Health Service examined 10514000 smears of which 90919 were classified as 'positive' (Committee on Gynaecological Cytology, 1987). If a constant criterion of assessment equal to that average had been applied consistently during the re-examination in Liverpool, the expected number of positive diagnoses from the original 45000 smears would have been 389. The likelihood of finding as many as 487 positives is about [10.sup.-6]. A similar calculation on the number of biopsies ordered (133) gives a likelihood of less than [10.sup.-7]; nationwide, there were 18787 biopsies during the period in question, and on that basis one would expect only 80 from the 45000 smears examined in Liverpool. On the other hand, the proportion of smears classified as positive in some degree varied from 0.57 per cent in one administrative region of the NHS to 1.24 per cent in another (Committee on Gynaecological Cytology) and the variation between individual cytopathologists would have been even greater. The proportion (1.08 per cent) of the 45000 smears classified as positive by the unfortunate pathologist's successor falls well within these limits. What this analysis shows is that the diagnostic criteria are variable, as between one cytopathologist and another, and this is to be expected on the basis of Model (b).

Finally, if [Beta] is less than 0 in Equation (8), negative correlation is predicted between successive responses. Although a positive correlation is the most common experimental finding, Treisman & Williams (1984) observed negative correlation in two of four subjects trying to detect signals in 'noise-only' trials.

Why 'response assimilation'?

I have used a linear model to relate detection performance during blocks of signal-only trials to the screening of cervical smears. Others (e.g. Dorfman & Biderman, 1971; Thomas, 1973; Treisman & Williams, 1984) have analysed analogous experiments in terms of signal detection theory. That more complicated mathematics permits the expression of more detailed intuitions. But the discourse here is limited to what can be inferred from the natural experiment in Liverpool; and the simple linear model, though only approximate, is actually to be preferred because it enables the basic relationships between observables to be exhibited more clearly. It is now time to see how that linear model translates into psychological process.

Elsewhere (Laming, 1984) I have shown that if (a) 'absolute' judgement is actually mediated by a comparison of the present stimulus with its immediate predecessor and (b) that comparison is no better then ordinal, then a wide variety of results from magnitude estimation and absolute identification paradigms become explicable. Especially compelling is the finding that if a stimulus value happens to be repeated on successive trials in a magnitude estimation experiment, the two successive log. estimates are highly correlated (about +0.8; Baird, Green & Luce, 1980). In that case about two-thirds of the variability in the second judgement is inherited from its predecessor. That result and others related to it are properties of the person making the judgement, not of the experimental paradigm, and ought to transpose to a cytopathologist screening cervical smears. I therefore suppose that the cytopathologist, notwithstanding her professional training, does not actually have any immutable internal criterion with which to compare each smear and can do no better than compare each with its predecessor. (To be more precise, a professionally qualified cytopathologist is well equipped to identify the different kinds of abnormality; I am suggesting here the absence of any internal criterion for assessing degree).

Suppose now that one smear, call it [S.sub.n], is diagnosed as positive. The next smear, [S.sub.n+1], is compared with [S.sub.n] and, if it is judged to be more abnormal than [S.sub.n] or about the same, is also diagnosed as positive. This means that the mere fact of [S.sub.n] being called 'positive' increases the likelihood that [S.sub.n+1] will be classified 'positive' too. That is the idea expressed by Equation (2). The subsequent mathematical argument shows that, in the absence of any other relevant factor, the judgements will 'run away'. They run away because any error in the classification of [S.sub.n] is passed on to the classification of [S.sub.n+1] and so to all subsequent classifications. Errors accumulate; but this is not at all obvious from a purely verbal expression of the idea.

Suppose, now, that a cytopathologist has learned, possibly by comparing her diagnoses with biopsy reports, that in the long run about half of the smears submitted to her for her personal inspection turn out to be positive. In that case her assessment of [S.sub.n+1] will be moderated by the number of positive reports she has recently issued. If there has been a rather large proportion of positives, maybe she has been overready to say 'Yes' and needs to be more cautious. That idea, of a prior expectation moderating the diagnoses, is expressed in Equation (8). The judgements no longer run away; but, again, that is far from obvious from a merely verbal statement of the revised idea.

Nevertheless, regard for long-run frequencies notwithstanding, the probability of detecting a positive smear will still fluctuate locally, leading to positive correlation between successive diagnoses. Because the classification assigned to one smear is used in the evaluation of the next, that positive correlation has the character of response assimilation.

How to prevent a recurrence of the systematic misdiagnosis in Liverpool

Envisage a library of cervical smears for which the correct diagnosis is known - known from the subsequent medical histories of the women from whom the smears were taken. A suitable proportion of these library smears are inserted at random in the sequence passed to the cytopathologist for her personal examination - perhaps one library smear for every nine to be examined, so that approximately every 10th smear which the cytopathologist sees is one for which the correct diagnosis is already known - but in such a manner that the cytopathologist cannot identify the library smears in advance. The cytopathologist types her assessment of each smear directly into a computer and, when the assessment of a library smear is typed in, the cytopathologist is immediately presented with the correct classification. For these 'library' trials Equation (2) becomes

[P.sub.n+1] = [Alpha][P.sub.n] + (1 - [Alpha]) [S.sub.n], (10)

where [s.sub.n] is the correct classification of the smear. It is known from the work of Tanner, Rauk & Atkinson (1970, especially p. 268, Fig. 2) that the knowledge that one has just made an error in a signal-detection experiment effects a large correction in the probabilities of subsequent detections and false positives.

Such a corrective procedure has already been shown to work in simulated sonar watchkeeping by Wilkinson (1964). Naval ratings were set to listen in 75 dB noise to a series of 1/2 s tones occurring every 3 s. The subjects were required to report the occasional slightly shorter (3/8 s) tone, there being just eight such shorter tones in an hour's watch (1200 tones altogether). The proportion of signals reported fell from 61 per cent in the first 15 minutes of the watch to about 30 per cent thereafter. But including 40 additional signals, identical to the real ones, and giving immediate knowledge of results on those 40 additional signals (though not on the 8 original ones) raised the detection rate on the eight signals to about 90 per cent throughout the watch (Wilkinson, 1964, p. 65, Fig. 1).

Whether immediate knowledge of results on a small proportion of cervical smears would work must await a field trial. Wilkinson (1964) compared a variety of different procedures, and showed that it was important that the additional signals be identical to the real ones (the library smears must not be identifiable in advance of their examination) and that immediate knowledge of results be provided. But these benefits are in prospect: in return for, perhaps, an 11 per cent increase in the number of smears inspected, the criteria of judgement would be automatically aligned with the state of nature (that is, with the actual medical condition of the women from whom the library smears were taken rather than with a merely consensual diagnosis). Any substantial deviation from those criteria would be detected typically within 10 smears. Moreover, when such a deviation is detected, those smears which need to be reassessed - the last 10 or so - are immediately identified.

The question whether the increased reliability of the inspection process is adequate compensation for the increased inspection time is again a matter for field trial. But it might even happen that the increased confidence which will come from immediate knowledge of results on the library smears will enable cytopathologists to inspect generally at a faster rate and so lead to a net saving of time, notwithstanding the additional smears to be inspected.

At present, recommended practice in maintaining accuracy in the screening of cervical smears is for everybody involved, MSLOs as well as cytopathologists, to be tested on 10 selected smears once a year, but with a more detailed and accurate classification expected from the more senior personnel (Hudson et al., 1988). In addition, there is some rechecking of previous diagnoses. A random 10 per cent of negative smears are routinely re-examined as also are previous negative smears when a subsequent one is 'positive'. But this does not address the real problem:

Cancer tests on 20,000 women to be checked

A FULL inquiry was launched in Scotland yesterday when it was disclosed that an unknown number of women were wrongly given an 'all clear' on their smear test for cervical cancer. Now 20,000 smears taken over a five-year period are to be re-examined.

The errors came to light ... when a woman developed cervical cancer although her test had earlier been judged clear (The Independent, 29 April 1993, p. 1).

Acknowledgement

I thank Dr Pauline Cooper, consultant cytopathologist at Addenbrooke's Hospital, Cambridge, for her comments on an earlier version of this paper.

References

Baird, J. C., Green, D. M. & Luce, R. D. (1980). Variability and sequential effects in cross-modality matching of area and loudness. Journal of Experimental Psychology: Human Perception and Performance, 6, 277-289.

Committee on Gynaecological Cytology (1987). Statistics, 1975 to 1985. London: DHSS.

Dorfman, D. D. & Biderman, M. (1971). A learning model for a continuum of sensory states. Journal of Mathematical Psychology, 8, 264-284.

Evans, D. M. D. & six others (1986). Terminology in gynaecological cytopathology: Report of the working party of the British Society for Clinical Cytology. Journal of Clinical Pathology, 39, 933-944.

Fernberger, S. W. (1920). Interdependence of judgments within the series for the method of constant stimuli. Journal of Experimental Psychology, 3, 126-150.

Fitzpatrick, J., Utting, J., Kenyon, E. & Burston, J. (1987). Internal review into the laboratory at the Women's Hospital, Liverpool. Liverpool Health Authority.

Howarth, C. I. & Bulmer, M. G. (1956). Non-random sequences in visual threshold experiments. Quarterly Journal of Experimental Psychology, 8, 163-171.

Hudson, E. & eight others (1988). Protocol for a proficiency testing scheme in gynaecological cytopathology, Department of Health.

The Independent, 29 April 1993.

Laming, D. (1974). The sequential structure of the quantal experiment. Journal of Mathematical Psychology, 11, 453-472.

Laming, D. (1984). The relativity of 'absolute' judgements. British Journal of Mathematical and Statistical Psychology, 37, 152-183.

Neisser, U. (1957). Response-sequences and the hypothesis of the neural quantum. American Journal of Psychology, 70, 512-527.

Senders, V. L. & Sowards, A. (1952). Analysis of response sequences in the setting of a psychophysical experiment. American Journal of Psychology, 615, 358-374.

Sharp, F. & eight others (1987). Report of the Intercollegiate Working Party on Cervical Cytology Screening. The Royal College of Obstetricians and Gynaecologists.

Stevens, S. S., Morgan, C. T. & Volkmann, J. (1941). Theory of the neural quantum in the discrimination of loudness and pitch. American Journal of Psychology, 54, 315-335.

Tanner, T. A., Rauk, J. A. & Atkinson, R. C. (1970). Signal recognition as influenced by information feedback. Journal of Mathematical Psychology, 7, 259-274.

Thomas, E. A. C. (1973). On a class of additive learning models: Error-correcting and probability matching. Journal of Mathematical Psychology, 10, 241-264.

The Times, 24 September 1987.

Treisman, M. & Williams, T. C. (1984). A theory of criterion setting with an application to sequential dependencies. Psychological Review, 91, 68-111.

Wilkinson, R. T. (1964). Artificial 'signals' as an aid to an inspection task. Ergonomics, 7, 63-72.

This paper sets out a plausible explanation how those cervical smears came to be misclassified. Some doubt must remain because other explanations of a more obvious and reprehensible kind are logically possible, and the control observations which one would record in a well-designed experiment, and which could confirm the present thesis, are not available. Whatever the facts of this particular failure of diagnosis, a psychological analysis of the task facing a cytopathologist engaged in the classification of cervical smears, and of the procedures at present adopted, indicates that this episode was an accident waiting to happen.

Cervical cancer

More than 2000 women in Britain die each year from cancer of the cervix (Sharp et al., 1987, p. 5). In principle, these deaths are avoidable because this form of cancer is easy to detect in its early stages. Women between the ages of 20 and 64 are advised to have a smear taken every five (ideally three) years. A sample of superficial cells is removed from the cervix by the patient's GP, fixed on a slide, and sent for laboratory examination. In the laboratory the cells are stained, and examined under a microscope. Healthy cells are characterized by a small nucleus and abundant cytoplasm, while abnormal cells generally show the contrary aspect, often being nearly all nucleus. There are, moreover, a number of different pathological processes which might occur, not all of them cancerous; they are described by Evans et al. (1986).

Viewed side by side, the difference between healthy and abnormal cervical cells is easy to see. But, in practice, diagnosis may be difficult. In the first place, abnormal cells, when they are found, typically comprise small groups amongst a much larger number of healthy cells; there are frequently more than 100 000 cells in a cervical smear and a small number of abnormal cells are easy to miss. In the second place, the cellular abnormalities may vary in degree from 'mild' to 'severe'. How abnormal does a cell have to be and what proportion of cells have to be abnormal before some medical action is justified? It will not do to 'err on the safe side', not only because that creates additional and unnecessary work in the form of repeated smears, but also because it causes unnecessary anxiety to some women with no real likelihood of developing cervical cancer.

Inspection by a qualified cytopathologist is necessary. However, it takes five to seven minutes to check and report on a healthy smear, and about 10 minutes for a doubtful one. So, in order to economize on the use of scarce professional time, the following procedure is adopted:

The smears are first examined by a Medical Laboratory Scientific Officer (MSLO) or a cytoscreener. These scientific officers have been trained to pick out any possible abnormalities (which are marked with a dot), but they do not make management diagnoses. Diagnosis is the responsibility of the cytopathologist to whom any doubtful smear is passed for further examination. As at the time when the Locum Consultant in Liverpool made her misdiagnoses, about 98 per cent of smears were classified as entirely normal, and the procedure described enabled cytopathologists' attention to be concentrated on that 2 per cent where some further action might be required. Typically 1 per cent of the total laboratory intake gets classified as 'positive', leading usually to a referral for a biopsy, though mild abnormalities would be followed merely by a request for a further smear in three or six months' time. About 0.1 per cent of smears examined lead to new diagnoses of cancer (Committee on Gynaecological Cytology, 1987).

The cytopathologist, then, examines a succession of slides, selected by the MSLOs as possibly requiring some action, and varying in degree of abnormality. She gets no feedback on the accuracy of her diagnoses (except for reports on biopsies received some time after the smear was initially examined). It is to be expected that the screening of cervical smears will exhibit the vagaries of judgement that characterize inspection tasks generally.

Between March 1983 and December 1985 the cervical cytology service at the Women's Hospital in Liverpool processed some 45 000 smears. About 2 per cent of that 45 000 would have been referred to the consultant pathologist for diagnosis. So the 911 smears reviewed by her successor in post must be all or nearly all of the smears personally examined by the (by then retired) pathologist. That means that during her period of office the unfortunate pathologist must have passed as 'negative' nearly every smear submitted to her. That circumstance points to an experiment by Howarth & Bulmer (1956) as a model for the pathologist's failure of judgement.

Detecting flashes of light

The subject in Howarth & Bulmer's experiment watched a dimly lit screen. Every 4 1/2 seconds there was a faint 1/2-second flash of light, accompanied by a bell. The subject was instructed to press a Morse key every time he saw a flash. In an initial series of trials the intensity of the flash was adjusted until the subject was reporting about half of them. Catch trials were introduced and the subject warned if he issued too many false positives. There followed, in the main series of trials, 600 further flashes at the unchanging 50 per cent intensity, with no further catch trials and no feedback to the subject of any kind.

This experiment was intended to study fluctuations in the probability of detection. It has been known since Fernberger (1920) that successive judgements of stimuli are not independent. But the nature of the interaction shows up the most clearly when a subject is induced to make judgements about each of a series of identical stimuli (as in the quantal experiment; Neisser, 1957; Stevens, Morgan & Volkmann, 1941), eliminating any contribution from varying stimulus magnitude. Although the pattern of fluctuation varied somewhat from one subject to another, detections were typically bunched together, and misses, much more so than in a purely random sequence of responses. Successive judgements were positively correlated, and the probability of detection varied locally by as much as [+ or -]0.2. Even more significant, out of 10 subjects run the authors reported that

The experiment was stopped on two occasions after the probability of responding had dropped almost to zero (Howarth & Bulmer, 1956, p. 164).

Those two subjects provide a model for the failure of the pathologist in Liverpool.

Two explanations

Howarth & Bulmer (1956) examined two generic explanations for the fluctuations in the probability of detection:

(a) Spontaneous fluctuations in sensitivity

Suppose the subject's sensitivity to the flashes fluctuates spontaneously. If [p.sub.n] denotes the probability of detection on trial n, the probability on trial n + 1 can be modelled as

[p.sub.n+1] = [Alpha][p.sub.n] + (1 - [Alpha]) [x.sub.n], (1)

where 0 [less than] [Alpha] [less than] 1, and the [x.sub.n](0 [less than or equal to] [x.sub.n] [less than or equal to] 1) are a succession of independent and identically distributed random variables. A large [x.sub.n] (the [x.sub.n] are unobservable) would raise the value of p, and the linear character of Equation (1) would preserve that increase for several trials thereafter.

(b) Assimilation to the previous response

Let [R.sub.n] be the response on trial n, taking the value 1 if the flash is reported and 0 otherwise. Assimilation to the previous response can be modelled as

[p.sub.n+1] = [Alpha][p.sub.n] + (1 - [Alpha]) [R.sub.n], (2)

simply writing [R.sub.n] for [x.sub.n] in Equation (1).

Notwithstanding the formal similarity between Equations (1) and (2), these two models have quite different mathematical properties, as we shall see (see also Laming, 1974). The reason is set out in Fig. 1. The structure of each model is indicated there by solid arrows; what is observed experimentally is the correlation between successive responses, shown by broken arrows. Response assimilation provides the more direct and therefore the stronger linkage. If both processes are operative, response assimilation will dominate.

Response assimilation

A second experiment by Howarth & Bulmer (1956) showed response assimilation to be operative. They repeated the experiment described above with just one difference: blocks of three blank trials (no light flash, but all auditory cues remaining in place) were occasionally inserted into the main series.

(1) After a sufficient number of trials the expectation of [p.sub.n], E{[p.sub.n]}, in Model (a) tends to E{[x.sub.n]}; this follows from taking expectations in Equation (1),

E{[p.sub.n+1]} = [Alpha]E{[p.sub.n]} + (1 - [Alpha]) E{[x.sub.n]}, (3)

and putting E{[p.sub.n+1]} = E{[p.sub.n]}. Thereafter, [p.sub.n] varies randomly about this mean in a stable dynamic equilibrium. Putting E{[x.sub.n]} = [Mu] and subtracting from each side of Equation (1),

([p.sub.n+1] - [Mu]) = [Alpha]([p.sub.n] - [Mu]) + (1 - [Alpha]) ([x.sub.n] - [Mu]),

whence

Var ([p.sub.n+1]) = [[Alpha].sup.2]Var ([p.sub.n]) + [(1 - [Alpha]).sup.2] Var ([x.sub.n]). (4)

So, again after a sufficient number of trials,

Var ([p.sub.n+1]) [right arrow] [(1 - [Alpha])/(1 + [Alpha])] Var ([x.sub.n]), (5)

at which point the process is stationary and would have the same statistical properties if run backwards in time. As a consequence, the probability of detection on the trial immediately following a block of three blanks [P('Yes' after BBB) = 0.348] ought to be the same as on the trial immediately before [P('Yes' before BBB) = 0.537]. But this latter probability is much closer to the unconditional probability of detection [P('Yes') = 0.516]. It is clear that, contrary to Model (a), a block of three blank trials depresses the probability of detection.

(2) Model (b) is driven by the responses. 'No' is the same response, whether resulting from a missed flash or forced by a blank trial. According to this model the effect of three forced 'No's [P('Yes' after BBB) = 0.348] should be comparable to the effect of three missed flashes [P('Yes' after NNN) = 0.301] and much less than the unconditional probability of detection (0.516). Indeed, it is.

(3) Unlike Model (a), the expected probability of detection in Model (b) does not evolve. Taking expectations in Equation (2),

E{[p.sub.n+1]} = [Alpha][p.sub.n] + (1 - [Alpha]) E{[R.sub.n]} = [p.sub.n]; (6)

so E{[p.sub.n+1]} does not vary from E{[p.sub.n]} or, indeed, from its initial value [p.sub.1]. But the variance of [p.sub.n] about its mean increases continually, trial by trial. Rewriting Equation (2),

([p.sub.n+1] - [p.sub.1]) = ([p.sub.n] - [p.sub.1]) + (1 - [Alpha]) ([R.sub.n] - [p.sub.n]),

whence

Var ([p.sub.n+1] - [p.sub.1]) = Var ([p.sub.n] - [p.sub.1]) + [(1 - [Alpha]).sup.2] Var ([R.sub.n] - [p.sub.n]). (7)

A limit is reached only when Var ([R.sub.n] - [p.sub.n]) = 0, that is, when [p.sub.n] is permanently either 0 or 1. The process (2) is a martingale and 'runs away'. Its behaviour matches that of the two of Howarth & Bulmer's subjects who 'dried up' or of the unfortunate pathologist in Liverpool.

Prior expectations

It should be noted that Equation (2) always 'runs away' in this manner, while most subjects and most cytopathologists do not. There is a problem how the performance of the other eight subjects in Howarth & Bulmer's experiment might be accommodated.

That experiment made no sense unless the subject was expected, somehow, to continue to detect some, but not all, of the flashes. In an auditory replication (Treisman & Williams, 1984, p. 79) subjects 'were told they should expect to make a detection on about one half of the trials'. And in an earlier experiment, judging whether two simultaneous stimuli were, or were not, simultaneous, Senders & Sowards (1952, p. 373) reported that 'The Ss tended to make the proportions of "yes" and "no" responses correspond to the proportions they had been told to expect'. Moreover, the two of Howarth & Bulmer's subjects who 'dried up' on their first run, completed the experiment satisfactorily on a second occasion. If a subject has the idea that he is meant to continue responding occasionally throughout the session, he will do so.

A prior expectation can be written into Equation (2) as follows:

[p.sub.n+1] = [Alpha][p.sub.n] + (1 - [Alpha])[[Beta][R.sub.n] + (1 - [Beta])[Pi]], (8)

where [Pi] is the proportion of trials on which the subject expects to detect a signal. Taking expectations in Equation (8),

E{[p.sub.n+1]} = [Alpha]E{[p.sub.n]} + (1 - [Alpha]) [[Beta]E{[p.sub.n]} + (1 - [Beta])[Pi]], (9)

so that E{[p.sub.n+1]} tends to [Pi]. The prior expectation acts as an attractor.

If [Beta] is equal to 1 in Equation (8), that equation reduces to Equation (2). But if [Beta] is less than 1, the asymptotic variance of [p.sub.n] has a limit which is less than the maximum possible, [Pi](1 - [Pi]), and the process no longer runs away (Laming, 1974). At the same time, [p.sub.n] does have an asymptotic variance ([greater than] 0) and the variability of the responses is greater than one would expect from a strictly binomial process with constant [p.sub.n].

The systematic failure of diagnosis in Liverpool came to light when the MSLOs expressed their anxieties to the consultant pathologist's successor. He re-examined all of his predecessor's work and discovered 487 new 'positives'. Some of those positive diagnoses were confirmed by outside experts. But one might nevertheless ask whether this re-examination of the former pathologist's work was itself without bias.

Nationwide, in the three years 1983 to 1985, the National Health Service examined 10514000 smears of which 90919 were classified as 'positive' (Committee on Gynaecological Cytology, 1987). If a constant criterion of assessment equal to that average had been applied consistently during the re-examination in Liverpool, the expected number of positive diagnoses from the original 45000 smears would have been 389. The likelihood of finding as many as 487 positives is about [10.sup.-6]. A similar calculation on the number of biopsies ordered (133) gives a likelihood of less than [10.sup.-7]; nationwide, there were 18787 biopsies during the period in question, and on that basis one would expect only 80 from the 45000 smears examined in Liverpool. On the other hand, the proportion of smears classified as positive in some degree varied from 0.57 per cent in one administrative region of the NHS to 1.24 per cent in another (Committee on Gynaecological Cytology) and the variation between individual cytopathologists would have been even greater. The proportion (1.08 per cent) of the 45000 smears classified as positive by the unfortunate pathologist's successor falls well within these limits. What this analysis shows is that the diagnostic criteria are variable, as between one cytopathologist and another, and this is to be expected on the basis of Model (b).

Finally, if [Beta] is less than 0 in Equation (8), negative correlation is predicted between successive responses. Although a positive correlation is the most common experimental finding, Treisman & Williams (1984) observed negative correlation in two of four subjects trying to detect signals in 'noise-only' trials.

Why 'response assimilation'?

I have used a linear model to relate detection performance during blocks of signal-only trials to the screening of cervical smears. Others (e.g. Dorfman & Biderman, 1971; Thomas, 1973; Treisman & Williams, 1984) have analysed analogous experiments in terms of signal detection theory. That more complicated mathematics permits the expression of more detailed intuitions. But the discourse here is limited to what can be inferred from the natural experiment in Liverpool; and the simple linear model, though only approximate, is actually to be preferred because it enables the basic relationships between observables to be exhibited more clearly. It is now time to see how that linear model translates into psychological process.

Elsewhere (Laming, 1984) I have shown that if (a) 'absolute' judgement is actually mediated by a comparison of the present stimulus with its immediate predecessor and (b) that comparison is no better then ordinal, then a wide variety of results from magnitude estimation and absolute identification paradigms become explicable. Especially compelling is the finding that if a stimulus value happens to be repeated on successive trials in a magnitude estimation experiment, the two successive log. estimates are highly correlated (about +0.8; Baird, Green & Luce, 1980). In that case about two-thirds of the variability in the second judgement is inherited from its predecessor. That result and others related to it are properties of the person making the judgement, not of the experimental paradigm, and ought to transpose to a cytopathologist screening cervical smears. I therefore suppose that the cytopathologist, notwithstanding her professional training, does not actually have any immutable internal criterion with which to compare each smear and can do no better than compare each with its predecessor. (To be more precise, a professionally qualified cytopathologist is well equipped to identify the different kinds of abnormality; I am suggesting here the absence of any internal criterion for assessing degree).

Suppose now that one smear, call it [S.sub.n], is diagnosed as positive. The next smear, [S.sub.n+1], is compared with [S.sub.n] and, if it is judged to be more abnormal than [S.sub.n] or about the same, is also diagnosed as positive. This means that the mere fact of [S.sub.n] being called 'positive' increases the likelihood that [S.sub.n+1] will be classified 'positive' too. That is the idea expressed by Equation (2). The subsequent mathematical argument shows that, in the absence of any other relevant factor, the judgements will 'run away'. They run away because any error in the classification of [S.sub.n] is passed on to the classification of [S.sub.n+1] and so to all subsequent classifications. Errors accumulate; but this is not at all obvious from a purely verbal expression of the idea.

Suppose, now, that a cytopathologist has learned, possibly by comparing her diagnoses with biopsy reports, that in the long run about half of the smears submitted to her for her personal inspection turn out to be positive. In that case her assessment of [S.sub.n+1] will be moderated by the number of positive reports she has recently issued. If there has been a rather large proportion of positives, maybe she has been overready to say 'Yes' and needs to be more cautious. That idea, of a prior expectation moderating the diagnoses, is expressed in Equation (8). The judgements no longer run away; but, again, that is far from obvious from a merely verbal statement of the revised idea.

Nevertheless, regard for long-run frequencies notwithstanding, the probability of detecting a positive smear will still fluctuate locally, leading to positive correlation between successive diagnoses. Because the classification assigned to one smear is used in the evaluation of the next, that positive correlation has the character of response assimilation.

How to prevent a recurrence of the systematic misdiagnosis in Liverpool

Envisage a library of cervical smears for which the correct diagnosis is known - known from the subsequent medical histories of the women from whom the smears were taken. A suitable proportion of these library smears are inserted at random in the sequence passed to the cytopathologist for her personal examination - perhaps one library smear for every nine to be examined, so that approximately every 10th smear which the cytopathologist sees is one for which the correct diagnosis is already known - but in such a manner that the cytopathologist cannot identify the library smears in advance. The cytopathologist types her assessment of each smear directly into a computer and, when the assessment of a library smear is typed in, the cytopathologist is immediately presented with the correct classification. For these 'library' trials Equation (2) becomes

[P.sub.n+1] = [Alpha][P.sub.n] + (1 - [Alpha]) [S.sub.n], (10)

where [s.sub.n] is the correct classification of the smear. It is known from the work of Tanner, Rauk & Atkinson (1970, especially p. 268, Fig. 2) that the knowledge that one has just made an error in a signal-detection experiment effects a large correction in the probabilities of subsequent detections and false positives.

Such a corrective procedure has already been shown to work in simulated sonar watchkeeping by Wilkinson (1964). Naval ratings were set to listen in 75 dB noise to a series of 1/2 s tones occurring every 3 s. The subjects were required to report the occasional slightly shorter (3/8 s) tone, there being just eight such shorter tones in an hour's watch (1200 tones altogether). The proportion of signals reported fell from 61 per cent in the first 15 minutes of the watch to about 30 per cent thereafter. But including 40 additional signals, identical to the real ones, and giving immediate knowledge of results on those 40 additional signals (though not on the 8 original ones) raised the detection rate on the eight signals to about 90 per cent throughout the watch (Wilkinson, 1964, p. 65, Fig. 1).

Whether immediate knowledge of results on a small proportion of cervical smears would work must await a field trial. Wilkinson (1964) compared a variety of different procedures, and showed that it was important that the additional signals be identical to the real ones (the library smears must not be identifiable in advance of their examination) and that immediate knowledge of results be provided. But these benefits are in prospect: in return for, perhaps, an 11 per cent increase in the number of smears inspected, the criteria of judgement would be automatically aligned with the state of nature (that is, with the actual medical condition of the women from whom the library smears were taken rather than with a merely consensual diagnosis). Any substantial deviation from those criteria would be detected typically within 10 smears. Moreover, when such a deviation is detected, those smears which need to be reassessed - the last 10 or so - are immediately identified.

The question whether the increased reliability of the inspection process is adequate compensation for the increased inspection time is again a matter for field trial. But it might even happen that the increased confidence which will come from immediate knowledge of results on the library smears will enable cytopathologists to inspect generally at a faster rate and so lead to a net saving of time, notwithstanding the additional smears to be inspected.

At present, recommended practice in maintaining accuracy in the screening of cervical smears is for everybody involved, MSLOs as well as cytopathologists, to be tested on 10 selected smears once a year, but with a more detailed and accurate classification expected from the more senior personnel (Hudson et al., 1988). In addition, there is some rechecking of previous diagnoses. A random 10 per cent of negative smears are routinely re-examined as also are previous negative smears when a subsequent one is 'positive'. But this does not address the real problem:

Cancer tests on 20,000 women to be checked

A FULL inquiry was launched in Scotland yesterday when it was disclosed that an unknown number of women were wrongly given an 'all clear' on their smear test for cervical cancer. Now 20,000 smears taken over a five-year period are to be re-examined.

The errors came to light ... when a woman developed cervical cancer although her test had earlier been judged clear (The Independent, 29 April 1993, p. 1).

Acknowledgement

I thank Dr Pauline Cooper, consultant cytopathologist at Addenbrooke's Hospital, Cambridge, for her comments on an earlier version of this paper.

References

Baird, J. C., Green, D. M. & Luce, R. D. (1980). Variability and sequential effects in cross-modality matching of area and loudness. Journal of Experimental Psychology: Human Perception and Performance, 6, 277-289.

Committee on Gynaecological Cytology (1987). Statistics, 1975 to 1985. London: DHSS.

Dorfman, D. D. & Biderman, M. (1971). A learning model for a continuum of sensory states. Journal of Mathematical Psychology, 8, 264-284.

Evans, D. M. D. & six others (1986). Terminology in gynaecological cytopathology: Report of the working party of the British Society for Clinical Cytology. Journal of Clinical Pathology, 39, 933-944.

Fernberger, S. W. (1920). Interdependence of judgments within the series for the method of constant stimuli. Journal of Experimental Psychology, 3, 126-150.

Fitzpatrick, J., Utting, J., Kenyon, E. & Burston, J. (1987). Internal review into the laboratory at the Women's Hospital, Liverpool. Liverpool Health Authority.

Howarth, C. I. & Bulmer, M. G. (1956). Non-random sequences in visual threshold experiments. Quarterly Journal of Experimental Psychology, 8, 163-171.

Hudson, E. & eight others (1988). Protocol for a proficiency testing scheme in gynaecological cytopathology, Department of Health.

The Independent, 29 April 1993.

Laming, D. (1974). The sequential structure of the quantal experiment. Journal of Mathematical Psychology, 11, 453-472.

Laming, D. (1984). The relativity of 'absolute' judgements. British Journal of Mathematical and Statistical Psychology, 37, 152-183.

Neisser, U. (1957). Response-sequences and the hypothesis of the neural quantum. American Journal of Psychology, 70, 512-527.

Senders, V. L. & Sowards, A. (1952). Analysis of response sequences in the setting of a psychophysical experiment. American Journal of Psychology, 615, 358-374.

Sharp, F. & eight others (1987). Report of the Intercollegiate Working Party on Cervical Cytology Screening. The Royal College of Obstetricians and Gynaecologists.

Stevens, S. S., Morgan, C. T. & Volkmann, J. (1941). Theory of the neural quantum in the discrimination of loudness and pitch. American Journal of Psychology, 54, 315-335.

Tanner, T. A., Rauk, J. A. & Atkinson, R. C. (1970). Signal recognition as influenced by information feedback. Journal of Mathematical Psychology, 7, 259-274.

Thomas, E. A. C. (1973). On a class of additive learning models: Error-correcting and probability matching. Journal of Mathematical Psychology, 10, 241-264.

The Times, 24 September 1987.

Treisman, M. & Williams, T. C. (1984). A theory of criterion setting with an application to sequential dependencies. Psychological Review, 91, 68-111.

Wilkinson, R. T. (1964). Artificial 'signals' as an aid to an inspection task. Ergonomics, 7, 63-72.

Printer friendly Cite/link Email Feedback | |

Author: | Laming, Donald |
---|---|

Publication: | British Journal of Psychology |

Date: | Nov 1, 1995 |

Words: | 4486 |

Previous Article: | Learning during anaesthesia: a review. |

Next Article: | Cognitivist or behaviourist - who can tell the difference? The case of implicit and explicit knowledge. |

Topics: |