A reply to Loerke, Jones, and Chow (1999) on the "Psychometric benefits" of linked items.
The article entitled: "Psychometric benefits of soft-linked scoring algorithms in achievement testing" by Loerke, Jones, and Chow (1999) has some fundamental problems that education professionals should be aware of. The main problem is that the authors do not take into consideration basic assumptions of classical test theory when concluding that any types of "linked items" have psychometric benefits. The authors of the article used item and test statistics inappropriately, and then make inaccurate claims that they have demonstrated empirically the "reliability superiority" of "soft-linked items" over "hard-linked items". We believe it is important to point out the problems with the article so that the readers do not take home the wrong conclusions about so called "linked items".
Linked items are groups of multiple choice type items that are related to each other on a test (e.g., "Use your answer from question 20 to answer question 21"). The authors make the distinction between "soft" and "hard" linked items, a distinction that is less important than the obvious psychometric assumptions that linked items violate. A significant problem with the Loerke, Jones, and Chow (1999) article is that the results of the paper are obvious without needing to conduct any analyses. Having two or more soft-linked items on a test will mean that the related items are more likely to be scored as correct because an incorrect response to the first question does not mean the other questions will be scored as incorrect. Additionally, selection of the keyed response to the second, third, or fourth item in the link chain will also result in the item being scored as correct. It is therefore expected that an average of the point-biserial correlation coefficients of the soft-linked items will be higher than the correlations for the hard-linked items because more items in the soft-linked group will positively correlate with the total test score. This result is common sense and does not need to be empirically tested.
The authors also misinterpret the function of a point-biserial correlation, drawing conclusions about item and test reliability rather than discrimination. Point-biserial correlation coefficients are discrimination indices (i.e., how well does the item discriminate between examinees of different ability levels) not measures of reliability. As a result statements made in the paper discussing the "reliability superiority" of soft-linked items to hard linked items are completely unfounded. In terms of the treatment of reliability in the paper, the authors did not calculate the KR-20 reliability indices appropriately given that the linked items would need to be treated as testlets.
Having outlined some of the specific problems with the Loerke, Jones, and Chow (1999) paper, the theoretical problems with the paper can now be described. The basis of classical test theory is that an observed score equals a true score plus error (O = T + E). One of the fundamental assumptions of classical test theory is that the error terms of each test item are uncorrelated (Nunnally & Bernstein, 1994). This is achieved by making the items locally independent of one another (i.e., every item on a test must be an independent observed measurement of ability). Items that are linked together result in error terms that are correlated thus violating the assumption. By violating this assumption one must carefully consider the implications in regards to the statistics that describe the items and tests. Unless a group of linked items are treated as a separate test within a test (i.e., testlet) classical item statistics, such as the KR-20 reliability index and point-biserial correlations between items and tests scores, will be calculated incorrectly. This is shown mathematically in most introductory psychometrics textbooks (Nunnally & Bernstein, 1994). The impact of violating this assumption on point-biserial correlations between items and the total test score will be to artificially inflate the correlation coefficients. This is because a group of directly related items are contributing to the total test score. The impact on the reliability coefficient is likely to be similar. These basic issues were not dealt with at all in the Loerke, Jones, and Chow (1999) article, in fact the conclusions and methods used in the article clearly show that these basic assumptions were overlooked when conducting their study.
The authors suggest in their article that linked items would be beneficial in "computerized testing". It is unclear in the article if the authors mean computer-administered tests or computer adaptive testing (CAT). Regardless, linked items create problems for both types of testing. Computer-administered tests would be plagued with the same psychometric problems as conventional paper and pencil tests described previously. Linked items in computer adaptive tests would also be littered with psychometric problems. CAT requires item response theory (IRT) parameter estimates of each item in order to select items adaptively based on the students' performance on previously administered items. The most fundamental assumption of IRT is that of local independence and the related assumption of unidimensionality. Essentially, local independence requires that tests items are independent of one another when measuring theta (i.e., the latent variable being measured, for example mathematics ability). If local independence holds true then the related assumption of unidimensionality will also hold true in that all items are measuring only one latent variable or dimension (Lord, 1980; Hambleton, Swaminathan, & Rogers, 1991). If linked items appear on a test and are not dealt with in a special manner (i.e., treated as a testlet) it is not possible to obtain accurate IRT parameter estimates. As a result it would be inappropriate for linked items, as described by Loerke, Jones, and Chow (1999), to be used in a computer adaptive environment.
Although in the past linked items have been used on two of eleven grade twelve examinations (diploma examinations) produced by the Learner Assessment Branch (formerly Student Evaluation Branch), due to the fundamental problems associated with linked items they are being removed from all future diploma examinations. The initial introduction of linked items was in response to budget restrictions resulting in the removal of a constructed response question from each of the Chemistry 30 and Physics 30 diploma examinations. Linked items were incorporated into these two diploma examinations, without appropriate psychometric considerations, to attempt to assess multi-step processes in place of the constructed response question. The decision of the Learner Assessment Branch was to remove linked items and return the constructed response questions on the two examinations. Achievement tests, which are also produced by the Learner Assessment Branch and administered to grades 3, 6, and 9 students, have never included linked items.
In summary, the Loerke, Jones, and Chow (1999) article contains methodological and theoretical problems that call into question the validity of their results and conclusions. Education professionals should be mindful of these problems when referencing the article.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.
Loerke, D. R. B., Jones, M. N., & Chow, P. (1999). Psychometric benefits of soft-linked scoring algorithms in achievement testing. Education, 120, 273-280.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillside, NJ: Erlbaum.
Nunnally, J. C. & Bernstein, I H. (1994). Psychometric Theory. Third ed., New York: McGraw-Hill.
Gregory A. Pope and Dwight D. Harley, Psychometricians, Analytic Services Unit, Learner Assessment Branch. Alberta Learning.
Correspondence concerning this article should be addressed to Dwight D. Harley, email@example.com or Gregory A. Pope at firstname.lastname@example.org.
|Printer friendly Cite/link Email Feedback|
|Author:||Harley, Dwight D.|
|Publication:||Journal of Instructional Psychology|
|Date:||Sep 1, 2002|
|Previous Article:||Reliability generalization: an HLM approach.|
|Next Article:||The psychometric benefits of soft-linked items: a reply to Pope and Harley.|