A reply to Loerke, Jones, and Chow (1999) on the "Psychometric benefits" of linked items.The Loerke, Jones, and Chow (1999) article entitled en·ti·tle tr.v. en·ti·tled, en·ti·tling, en·ti·tles 1. To give a name or title to. 2. To furnish with a right or claim to something: "Psychometric psy·cho·met·rics n. (used with a sing. verb) The branch of psychology that deals with the design, administration, and interpretation of quantitative tests for the measurement of psychological variables such as intelligence, aptitude, and benefits of soft-linked scoring algorithms In statistics, Fisher's Scoring algorithm is a form of Newton's method used to solve maximum likelihood equations numerically. Sketch of Derivation Let be random variables, independent and identically distributed with twice differentiable p.d.f. in achievement testing" contains many methodological and psychometric errors. Some of these include: a) ignoring basic assumptions of classical test theory and item response theory Item response theory is a body of theory used in the field of psychometrics. Pychometrics is concerned with the theory and technique of educational and psychological measurement. , b) incorrectly using and interpreting item and test statistics such as point-biserial correlation coefficients The point biserial correlation coefficient (rpb) is a correlation coefficient used when one variable (e.g. Y) is dichotomous; Y can either be 'naturally' dichotomous, like gender, or an artificially dichotomized variable. and KR-20 reliability coefficients. It is essential that education professionals are made aware of these problems and appreciate that the findings of the study cannot be used as "evidence" that linked items have psychometric benefits. ********** The article entitled: "Psychometric benefits of soft-linked scoring algorithms in achievement testing" by Loerke, Jones, and Chow (1999) has some fundamental problems that education professionals should be aware of. The main problem is that the authors do not take into consideration basic assumptions of classical test theory when concluding that any types of "linked items" have psychometric benefits. The authors of the article used item and test statistics inappropriately, and then make inaccurate claims that they have demonstrated empirically the "reliability superiority" of "soft-linked items" over "hard-linked items". We believe it is important to point out the problems with the article so that the readers do not take home the wrong conclusions about so called "linked items". Linked items are groups of multiple choice type items that are related to each other on a test (e.g., "Use your answer from question 20 to answer question 21"). The authors make the distinction between "soft" and "hard" linked items, a distinction that is less important than the obvious psychometric assumptions that linked items violate. A significant problem with the Loerke, Jones, and Chow (1999) article is that the results of the paper are obvious without needing to conduct any analyses. Having two or more soft-linked items on a test will mean that the related items are more likely to be scored as correct because an incorrect response to the first question does not mean the other questions will be scored as incorrect. Additionally, selection of the keyed response to the second, third, or fourth item in the link chain will also result in the item being scored as correct. It is therefore expected that an average of the point-biserial correlation coefficients of the soft-linked items will be higher than the correlations for the hard-linked items because more items in the soft-linked group will positively correlate with the total test score. This result is common sense and does not need to be empirically tested. The authors also misinterpret mis·in·ter·pret tr.v. mis·in·ter·pret·ed, mis·in·ter·pret·ing, mis·in·ter·prets 1. To interpret inaccurately. 2. To explain inaccurately. the function of a point-biserial correlation, drawing conclusions about item and test reliability rather than discrimination. Point-biserial correlation coefficients are discrimination indices (i.e., how well does the item discriminate dis·crim·i·nate v. dis·crim·i·nat·ed, dis·crim·i·nat·ing, dis·crim·i·nates v.intr. 1. a. between examinees of different ability levels) not measures of reliability. As a result statements made in the paper discussing the "reliability superiority" of soft-linked items to hard linked items are completely unfounded. In terms of the treatment of reliability in the paper, the authors did not calculate the KR-20 reliability indices appropriately given that the linked items would need to be treated as testlets. Having outlined some of the specific problems with the Loerke, Jones, and Chow (1999) paper, the theoretical problems with the paper can now be described. The basis of classical test theory is that an observed score equals a true score plus error (O = T + E). One of the fundamental assumptions of classical test theory is that the error terms of each test item are uncorrelated (Nunnally & Bernstein, 1994). This is achieved by making the items locally independent of one another (i.e., every item on a test must be an independent observed measurement of ability). Items that are linked together result in error terms that are correlated cor·re·late v. cor·re·lat·ed, cor·re·lat·ing, cor·re·lates v.tr. 1. To put or bring into causal, complementary, parallel, or reciprocal relation. 2. thus violating the assumption. By violating this assumption one must carefully consider the implications in regards to the statistics that describe the items and tests. Unless a group of linked items are treated as a separate test within a test (i.e., testlet) classical item statistics, such as the KR-20 reliability index and point-biserial correlations between items and tests scores, will be calculated incorrectly. This is shown mathematically in most introductory psychometrics psychometrics Science of psychological measurement. Psychometricians design and administer psychological tests (see psychological testing), both to generate empirical data on mental processes and to refine their understanding of measurement techniques and the textbooks (Nunnally & Bernstein, 1994). The impact of violating this assumption on point-biserial correlations between items and the total test score will be to artificially inflate inflate - deflate the correlation coefficients Correlation Coefficient A measure that determines the degree to which two variable's movements are associated. The correlation coefficient is calculated as: . This is because a group of directly related items are contributing to the total test score. The impact on the reliability coefficient coefficient /co·ef·fi·cient/ (ko?ah-fish´int) 1. an expression of the change or effect produced by variation in certain factors, or of the ratio between two different quantities. 2. is likely to be similar. These basic issues were not dealt with at all in the Loerke, Jones, and Chow (1999) article, in fact the conclusions and methods used in the article clearly show that these basic assumptions were overlooked when conducting their study. The authors suggest in their article that linked items would be beneficial in "computerized computerized adapted for analysis, storage and retrieval on a computer. computerized axial tomography see computed tomography. testing". It is unclear in the article if the authors mean computer-administered tests or computer adaptive testing (CAT). Regardless, linked items create problems for both types of testing. Computer-administered tests would be plagued with the same psychometric problems as conventional paper and pencil tests Pencil test has multiple meanings.
n. 1. a. A disorderly accumulation of objects; a pile. b. Carelessly discarded refuse, such as wastepaper: the litter in the streets after a parade. 2. with psychometric problems. CAT requires item response theory (IRT IRT Item Response Theory IRT In Regard To IRT Incident Response Team IRT In Reference To IRT In Regards To IRT Icing Research Tunnel (wind tunnel) IRT Interborough Rapid Transit ) parameter estimates of each item in order to select items adaptively based on the students' performance on previously administered items. The most fundamental assumption of IRT is that of local independence and the related assumption of unidimensionality. Essentially, local independence requires that tests items are independent of one another when measuring theta Theta A measure of the rate of decline in the value of an option due to the passage of time. Theta can also be referred to as the time decay on the value of an option. If everything is held constant, then the option will lose value as time moves closer to the maturity of the option. (i.e., the latent variable In statistics, Latent variables (as opposed to observable variables), are variables that are not directly observed but are rather inferred (through a mathematical model) from other variables that are observed and directly measured. being measured, for example mathematics ability). If local independence holds true then the related assumption of unidimensionality will also hold true in that all items are measuring only one latent variable or dimension (Lord, 1980; Hambleton, Swaminathan, & Rogers, 1991). If linked items appear on a test and are not dealt with in a special manner (i.e., treated as a testlet) it is not possible to obtain accurate IRT parameter estimates. As a result it would be inappropriate for linked items, as described by Loerke, Jones, and Chow (1999), to be used in a computer adaptive environment. Although in the past linked items have been used on two of eleven grade twelve examinations (diploma examinations) produced by the Learner Assessment Branch (formerly Student Evaluation Branch), due to the fundamental problems associated with linked items they are being removed from all future diploma examinations. The initial introduction of linked items was in response to budget restrictions resulting in the removal of a constructed response question from each of the Chemistry 30 and Physics 30 diploma examinations. Linked items were incorporated into these two diploma examinations, without appropriate psychometric considerations, to attempt to assess multi-step processes in place of the constructed response question. The decision of the Learner Assessment Branch was to remove linked items and return the constructed response questions on the two examinations. Achievement tests, which are also produced by the Learner Assessment Branch and administered to grades 3, 6, and 9 students, have never included linked items. In summary, the Loerke, Jones, and Chow (1999) article contains methodological and theoretical problems that call into question the validity of their results and conclusions. Education professionals should be mindful mind·ful adj. Attentive; heedful: always mindful of family responsibilities. See Synonyms at careful. mind of these problems when referencing the article. References Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage. Loerke, D. R. B., Jones, M. N., & Chow, P. (1999). Psychometric benefits of soft-linked scoring algorithms in achievement testing. Education, 120, 273-280. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillside, NJ: Erlbaum. Nunnally, J. C. & Bernstein, I H. (1994). Psychometric Theory. Third ed., New York New York, state, United States New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of : McGraw-Hill. Gregory A. Pope and Dwight D. Harley, Psychometricians, Analytic Services Unit, Learner Assessment Branch. Alberta Learning. Correspondence concerning this article should be addressed to Dwight D. Harley, dwight.harley@gov.ab.ca or Gregory A. Pope at greg.pope@gov.ab.ca. |
|
||||||||||||||||||

Printer friendly
Cite/link
Email
Feedback
Reader Opinion