Methodology in diagnostic laboratory test research in Clinical Chemistry and Clinical Chemistry and Laboratory Medicine.

The recently published STARD initiative concerning the publication of studies on diagnostic accuracy is an important step forward in improving the quality of diagnostic investigation. Its relevance is even greater if we consider the many challenges of research on diagnostic tests and that application of epidemiologic principles to the field of clinical diagnosis has been less developed than other clinical areas, or perhaps less effective (1, 2).

In 1996, Reid et al. (3) pointed out serious methodologic limitations of research on diagnostic tests published in the most prestigious international scientific clinical journals. More recently, serious methodologic limitations have been brought to light in reports on genetic testing (4,5).

Clinical Chemistry has been paying special attention to this field of research. From 1996, Clinical Chemistry has included in the instructions for authors the seven criteria used by Reid et al. (3) in their study of the methodologic quality of studies of diagnostic accuracy studies published in general medical journals (Annals of Internal Medicine, JAMA, Lancet, and New England Journal of Medicine). In 1997 and 2000, Clinical Chemistry published drafts of a checklist for reporting of studies of diagnostic accuracy (6, 7), and in 2001, a 28-item checklist became part of the Information for Authors. Furthermore, Clinical Chemistry has published methodologic reviews on diagnostic research and evidence-based laboratory medicine (8, 9). In 2003, Clinical Chemistry published the STARD Initiative for reporting studies of diagnostic accuracy (1, 2,10), a guideline that was also published or promoted in several other journals, including Academic Radiology, American Journal of Clinical Pathology, Annals of Internal Medicine, British Medical Journal, Clinical Biochemistry, Clinical Chemistry and Laboratory Medicine, JAMA, Journal of Clinical Microbiology, Lancet, and Radiology.

Despite outstanding efforts to improve the quality of methodologic aspects of diagnostic investigations, there is a lack of specific information concerning the quality of the investigation of tests carried out in the laboratory. Knowledge of particular weaknesses in this research field could guide progress toward evidence-based laboratory diagnosis.

The objective of our study was to analyze the reporting and methodologic quality of published studies of diagnostic accuracy of laboratory tests. We considered diagnostic studies published in two journals dealing with clinical chemistry, included in the journal Citation Reports (Clinical Chemistry and Clinical Chemistry and Laboratory Medicine), and applied the criteria proposed by Reid et al. (3).

Materials and Methods

Clinical Chemistry and Clinical Chemistry and Laboratory Medicine (previously called European Journal of Clinical Chemistry and Clinical Biochemistry) are journals specializing in clinical laboratory studies that, in addition to having a major impact, frequently publish articles on the evaluation of diagnostic tests in the clinical chemistry laboratory. We reviewed the articles on diagnostic tests published in the two journals in 1996, 2001, and 2002. To select the articles we carried out a search in PubMed, using the most accurate strategy described by Deville et al. (11). The strategy combined the Mesh term "sensitivity and specificity" (exploded) with the text words "specificity", "false negative", and "accuracy". To improve sensitivity we expanded the search including the Mesh term "area under the curve" and the text words "diagnostic odds ratio" and "likelihood ratios".

Articles were accepted for further review if they fulfilled the inclusion criteria described by Reid et al. (3): humans were tested, the test was intended for clinical use, and indexes of accuracy were provided with both sensitivity and specificity or with the counterpart likelihood ratio. Additionally, we reviewed papers that provided ROC curves. Only original articles showing an abstract in Medline were finally reviewed. Two authors checked the abstracts for eligibility criteria, and in case of doubt the full article was reviewed.


We applied the seven methodologic criteria recommended by Reid et al. (3), which we reproduce below literally:

(1) Spectrum composition: This standard was met if at least three of the following four descriptors were provided: sex distribution, age distribution, summary of presenting clinical symptoms and/or disease stage, and eligibility criteria for study subjects.

(2) Analysis of pertinent subgroups: This standard was fulfilled when results for indexes of accuracy were cited for any pertinent demographic or clinical subgroup of the investigated population.

(3) Avoidance of workup bias: For cohort studies, this standard was met if all subjects were assigned to receive both diagnostic testing and gold standard verification, either by direct procedure or by suitable clinical follow-up. In case-control studies, credit depends on whether the diagnostic test preceded or followed the gold standard procedure. If the diagnostic test preceded, credit was given if disease verification was obtained for a consecutive series of study subjects regardless of their diagnostic test result. If the diagnostic test followed, credit was given if the results were stratified according to the clinical factor that evoked the gold standard procedure.

(4) Avoidance of review bias: For prospective cohort studies in which patients always receive the diagnostic test first, credit was given if the gold standard procedures were evaluated independently. A statement about independence in interpreting both the test and the gold standard procedure was required for prospective studies in which the gold standard procedure was sometimes done before the diagnostic test and for case-control studies in which the test preceded the gold standard procedure. In case-control studies in which the diagnostic test followed disease verification, a statement was required to indicate an independent evaluation of the diagnostic test.

(5) Precision of results for test accuracy: This standard was met if standard error or confidence intervals, regardless of magnitude, were reported for test sensitivity and specificity or likelihood ratios.

(6) Presentation of indeterminate test results: To meet this standard, a study had to report all of the appropriate positive, negative, and indeterminate results generated during evaluation of the diagnostic test and whether indeterminate results had been included or excluded when indexes of accuracy were calculated.

(7) Test reproducibility: For test requiring observer interpretation, at least some of the tests subjects should have been evaluated for a summary measure of observer variability. For tests performed without observer interpretation, credit was given for a summary measure of instrument variability.

Each article was evaluated independently by at least two observers. The observer agreement in this phase was 87%. In a second step, all disagreements were evaluated by the third author and solved by consensus. Data analyses were carried out with Epiinfo 2000.


Clinical Chemistry published 440 articles in 1996, 436 in 2001, and 417 in 2002. Clinical Chemistry and Laboratory Medicine published 162 articles in 1996, 210 in 2001, and 232 in 2002. Of the 1897 articles published in both journals, PubMed searches identified 460 on test evaluation. Of those 460 articles, only 358 were original reports. Finally, 79 papers fulfilled the eligibility criteria: 52 were from the journal Clinical Chemistry (11 were published in 1996, 17 in 2001, and 24 in 2002) (12-63) and 27 from Clinical Chemistry and Laboratory Medicine (7 published in 1996, 10 in 2001, and 10 in 2002) (64-90).

Regarding the diagnostic procedures evaluated in these studies, biochemistry was the field most frequently referred to (55%), followed by immunology (22%), microbiology (8%), hormones (9%), genetics (4%), and hematology (2%). The mean number of patients/ samples included in the studies was 331 (range, 10-2971 patients/ samples).

The mean number of methodologic criteria satisfied was 2.3 in 1996, 2.7 in 2001, and 3.4 in 2002 (P = 0.047). For Clinical Chemistry, the means were 2.0 in 1996, 2.9 in 2001, and 4.0 in 2002 (P = 0.001). For Clinical Chemistry and Laboratory Medicine, the means were 2.7 in 1996, 2.4 in 2001, and 1.9 in 2002 (P = 0.66).

Fulfillment of individual methodologic criteria ranged from 0% to 83% for studies published in 1996, from 15% to 81% for studies published in 2001, and from 6% to 82% for studies published in 2002 (Table 1). Articles published in 2002 showed better fulfillment than those published in 1996 or 2001 in all but two criteria. Presentation of data for test reproducibility was met by most studies in 1996 (83%); this value did not change in 2001 (81%), but in 2002 changed to 68%. Forty-four percent of the studies in 1996, 26% in 2001, and 38% in 2002 satisfied the second standard, estimation of accuracy in pertinent subgroups. Among the criteria that appeared to have improved in 2002, the statistical uncertainty of test indexes changed from 22% in articles published in 1996 to 44% in 2001 and to 65% in 2002. The spectrum composition, from 22% in articles published in 1996 to 37% in 2001 and 71% in 2002. Unexpectedly, the incidence and handling of indeterminate test results were hardly discussed.

We compared our results in 1996 with those observed by Reid et al. (3) in the subsample of articles published between 1990 and 1993 from a group of relevant clinical journals (Table 1). In spite of the reduced sample sizes, some statistically significant differences were observed. The articles published in 1996 in Clinical Chemistry more frequently met the second (accuracy in subgroups) and seventh standards (reproducibility). The articles studied by Reid et al. (3) did better in the presentation of indeterminate results.


Judging by the articles published in Clinical Chemistry and Clinical Chemistry and Laboratory Medicine, both with a high impact factor, the quality of the methodology used in research on diagnostic laboratory tests is comparable to the most important international clinical journals. In certain aspects of the methodology, specifically the reproducibility of test results, the articles published in laboratory journals appear to be of higher quality. On the other hand, there are some methodologic flaws that would be easy to correct and would substantially improve the clinical applicability of the results of the investigations. On average the quality of the articles published in laboratory journals is high, but very few articles comply with all or almost all of the methodologic standards.

It is difficult to evaluate whether our analysis was more or less strict than that performed by Reid et al. (3). Although the criteria for application of some of the methodologic standards were not clearly specified by Reid et al., we did not observe any great interobserver variation in the results of the evaluation, and this gives us a certain degree of confidence in our findings.

One of the main discrepancies with the results obtained by Reid et al. (3) is the frequency with which the authors described the reproducibility of the results of the diagnostic test they were evaluating. Our results show that the laboratory studies report this characteristic much more frequently [83% in 1996, 81% in 2001, and 68% in 2002 vs 32% in Reid et al. (3)]. This result may have been expected because the experts in diagnostic laboratory tests usually pay more attention to analytical imprecision than do others.

A key shortcoming of the diagnostic studies done until 2001 in the laboratory is the lack of a full description of the spectrum of patients or samples studied (22% in 1996, 37% in 2001, and 71% in 2002). The improvement observed in 2002, particularly in Clinical Chemistry, has two important positive consequences. On the one hand, readers can better judge whether the population studied is comparable to other populations, which favors the applicability of results. On the other hand, the description of the clinical spectrum allows the presentation of findings by strata so that the reader can judge whether the accuracy of the test evaluated may change depending on the clinical or socio-demographic characteristics of the patients studied. Not only do we believe that authors should describe the spectrum of patients or samples studied and present the analysis by strata, but also that the STARD criteria (1, 2) should be applied, and it should be necessary to provide a more detailed explanation of the scope of the study, the way of presenting the patients, and the type of sampling. This would be a decisive step forward in improving the applicability of the results.

Regarding the presence of workup and review biases, it seems that in our series they have been prevented to an extent similar to that in the series studied by Reid et al. (3). However, it has to be pointed out that many of the publications evaluated appeared to be free of those biases such that potential for them might be said not to exist. The lack of explicit information, however, makes it impossible to guarantee that the study was indeed free of such biases.

Few articles took indeterminate results into account when evaluating the diagnostic test (none in 1996, 15% in 2001, and 6% in 2002) compared with 38% in the study by Reid et al. (3). It is possible that many of the articles analyzed in our study did not have any indeterminate results, but even so, they do not comply with the criterion because they do not say so explicitly, as required by the criteria of Reid et al.

Undoubtedly, the recommendations of the STARD initiative published in Clinical Chemistry (1, 2) include more exhaustive requirements than those proposed by Reid et al. (3). Use of the latter enabled us to compare the papers from clinical laboratory journals with those described in the most important international clinical journals. In addition, it allowed us to analyze the trend. For example, we studied the articles published a year before the appearance in Clinical Chemistry of the first article dealing with the importance of methodology in evaluation of diagnostic tests (1996) (5) and 5-6 years later (2001-2002). This enabled us to study the change produced in the literature and, therefore, the effect of publication in the journal Clinical Chemistry of the recommendations to be followed in the study of laboratory tests.

In 1996, the journal Clinical Chemistry included criteria similar to those of Reid et al. (3) in its instructions to authors, and in 1997 and in 2000 it published preliminary versions of a 28-item checklist, whereas Clinical Chemistry and Laboratory Medicine began to advocate the use of the STARD criteria in 2003 only. It therefore seems logical that Clinical Chemistry improved between 1996, 2001, and 2002, whereas Clinical Chemistry and Laboratory Medicine did not. Similar results were observed by Moher et al. (91) and Altman et al. (92), who demonstrated that the quality of reports in journals that promoted the Consolidated Standards of Reporting Trials (CONSORT) (British Medical Journal, JAMA, and Lancet) showed greater improvement than in a journal that did not advocate its use (New England Journal of Medicine). The results presented in these two reports, as well as ours, support the conclusion that editors and reviewers, in adopting criteria such as STARD or CONSORT, can play a key role in improving the quality of published reports of studies of this sort.

Both the STARD initiative and the appearance of the first-ever publication dedicated specifically to diagnostic investigation (93) may lead to future improvement of the quality of this type of investigation. Indeed, in the case of Clinical Chemistry, this is already evident. In the future, improvement in the quality of the methodology should be monitored to confirm the possible beneficial effects. Furthermore, there are other aspects of diagnostic investigation, with objectives other than accuracy, that will require development and new standards (94, 95). These objectives include both the use of clinical trials of diagnostic tests and investigation of the undesirable effects on health of unwanted information given by these tests.

We thank Judith Williams help in preparing the manuscript; we also thank the two anonymous reviewers and Dr. Joseph Watine for their useful comments during the peer reviewing of the manuscript.


[1] Department of Clinic Analysis, General University Hospital of Alicante, Alicante, Spain.

[2] Department of Internal Medicine, General University Hospital of Elche, Elche, Spain.

[3] Department of Public Health, University of Miguel Hernandez, San Juan de Alicante, Spain.

* Address correspondence to this author at: Department of Public Health, Facultad de Medicina, University of Miguel Hernandez, Carretera de Valencia Km. 8.7, 03550 San Juan de Alicante, Spain. Fax 34-96-5919551; e-mail

Received April 4, 2003; accepted December 12, 2003.

Previously published online at DOI: 10.1373/clinchem.2003.019786
Table 1. Fulfillment of the methodologic criteria used by Reid et al.
(3) in the articles published in 1996, 2001, and 2002 in Clinical
Chemistry and Clinical Chemistry and Laboratory Medicine and in
articles published in general medical journals and reviewed by Reid et

 Number (%) of articles

 Year 19968 Year 2001 Year 2002
 (n = 18) (n=27) (n=34)

Spectrum composition 4 (22) 10 (37) 24 (71)
 Age distribution 9 (50) 18 (67) 28 (82)
 Sex distribution 9 (50) 19 (70) 26 (76)
 Clinical symptoms and/or
 disease stage 6 (33) 11 (41) 12 (35)
 Study eligibility criteria 4 (22) 10 (37) 18 (53)
Accuracy in subgroups 8 (44) 7 (26) 13 (38)
Avoidance of workup bias 6 (33) 13 (48) 24 (71)
Avoidance of review bias 4 (22) 10 (37) 15 (44)
Test accuracy precision 4 (22) 12 (44) 22 (65)
Indeterminate tests results 0 4 (15) 2 (6)
Test reproducibility 15 (83) 22 (81) 21 (68)

 Number (%) of articles

 Reid et al.
 (1990-1993) (b)

Spectrum composition 11(32)
 Age distribution
 Sex distribution
 Clinical symptoms and/or
 disease stage
 Study eligibility criteria
Accuracy in subgroups 4 (12) (c)
Avoidance of workup bias 21 (62)
Avoidance of review bias 16 (47)
Test accuracy precision 8 (24)
Indeterminate tests results 13 (38) (c)
Test reproducibility 11 (32) (c)

(a) Differences between 1996 and 2001 were not statistically

(b) Articles from Reid et al. (3) were restricted to those published
in 1990-1993, the closest dates to 1996.

(c) P <0.05 in Epiinfo Fisher exact test comparing proportions between
1996 and articles reviewed by Reid et al. (3).
 Reader Opinion




