# The psychometric benefits of soft-linked items: a reply to Pope and Harley.

In this issue, Pope and Harley criticized our recent work with soft-linked items (Loerke, Jones, and Chow, 1999), claiming that soft-linked items are not independent, and thus, violate the basic assumption of classical test theory. Furthermore, they claim that our findings that soft-linked items had better point-biserial correlation coefficients (PBCCs) than hard-linked items could have been predicted by "common sense," in that they simply reflect a higher proportion correct for soft-linked items. Because an examinee's response to an initial item has no effect on the scoring of the second item in a soft-linked pair, soft-linked items clearly meet independence, and cause no problem for classical test theory. Since the scoring outcomes of hard-linked items are more likely to be consistent (correct or incorrect) than those for soft-linked items, common sense suggests that hard-linked items would produce higher PBCCs than soft-linked items. In addition, it is pointed out that Pope and Harley's lack of understanding of the concepts of local independence and unidimensionality within the framework of item-response theory may have provided them with nebulous logic that will confuse readers.

**********

Background

We recently presented evidence that soft-linked items have better psychometric properties than hard-linked items in achievement tests (Loerke, Jones, & Chow, 1999). Pope and Harley (in the current issue of this journal) have criticized this work, claiming that the findings are "common sense" and that differences between hard-linked items and soft-linked items are due to an increased probability that soft-linked items will be scored as correct. Pope and Harley also claimed that soft-linked items are not independent. We acknowledge that there is indeed a link artifact, but this artifact exists only for hard-linked items and not for soft-linked items. We would like to point out that soft-linked items are indeed independent (this is the major appeal of them) and it is hard-linked items that are not.

Linked items are items in which the examinee uses his/her answer from one item to compute an answer for a second item, typically using the numerical response format. These items are often used in multi-step calculations, allowing hierarchical computer scoring of complex reasoning. Linked items are a computer-scorable alternative to constructed response questions.

Hard-linked items require the examinee to get the first linked item correct before any of the subsequent linked items may be answered correctly. Thus, hard-linked items have a fixed key. Conversely, soft-linked items do not require the examinee to answer the first linked item correctly to have his/her response scored as correct on the subsequent linked item; that is, soft-linked items have an adaptive key. One method of accomplishing this is by a computer algorithm that generates the appropriate keys for the soft-linked items on the basis of the response given to the previous item. Using this method, examinees are not penalized twice for a single incorrect response if the soft-linked item is answered correctly using the initial incorrect answer. Although several items may be linked together (nested linking), we will limit our discussion to the simple case of a single item linked with an initial item.

Linked Items and Independence

Pope and Harley made the claim that linked items, in general, violate independence. However, they failed to make the distinction between the two types of linked items. Item independence is a fundamental assumption of classical test theory that states that item responses are randomly related when ability (_) is held constant (Nunnally & Bernstein, 1994). Thus, all shared variance between items that is not explained by _must be due to pure random error. Alternatively phrased, score on an item must be a function of_ only, not a function of score on any other item or a function of any other trait.

Hard-linked items violate independence because an incorrect response to the initial item automatically makes the second one incorrect (the items have correlated error terms). However, soft-linked items meet the independence assumption because the response given to the initial item has no effect on the scoring of the second item. Thus, soft-linked items have independent error terms, similar to any other two items on a test taken at random.

Further, Pope and Harley erroneously proposed that it is "common sense" that soft-linked items would on average have higher point-biserial correlation coefficients (PBCCs) than hard-linked items because soft-linked items are more likely to be scored as correct compared to their hard-linked counterparts. Since proportion correct plays an important role in the calculation of PBCC, Pope and Harley's intuition that soft-linked items would have higher PBCCs than hard-linked items is a possibility, but it is not the only possibility.

PBCC is basically a correlation between item score (dichotomously-scored as "correct" or "incorrect") and the total test score. An item with a high PBCC indicates that examinees with high test scores tend to answer it correctly and examinees with low test scores tend to answer it incorrectly. Conversely, items with negative PBCCs tend to be answered correctly primarily by examinees with low test scores. An alternative common sense approach to Pope and Harley's prediction would be that since the outcomes of two items that are hard-linked are more likely to be the same (either correct or incorrect) than the outcomes of two items that are soft-linked, hard-linked items are expected to positively correlate more highly with total test score than soft-linked items because for an examinee to get the second hard-linked item correct, he must have answered the initial item correctly. On the other hand, if an examinee was scored as correct on the second soft-linked item, he may or may not have answered the initial item incorrectly. Thus, even though more examinees tend to get soft-linked items correct, examinees that get hard-linked items correct tend to have higher test scores because they must also have the initial item correct. This results in the prediction that the PBCCs for hard-linked items should be higher than soft-linked items.

Point-Biserial Correlation

Pope and Harley were also critical of our interpretation of PBCC, particularly in drawing conclusions about item reliability. Their discomfort with our use of PBCC as an index of item reliability reveals their lack of breadth and depth in their literature review in this area. Although typically used as an item discrimination index, PBCC is an item-total correlation, and hence, the square of it (PBCC-squared) is a measure of the proportion of variance in total test score that is predictable by an item. Alternatively phrased, it is an index of consistency between an item and the total test score. This is why a strong relationship exists between PBCC and alpha or KR-20 (Traub, 1994, pp. 101-107). This approach has been used elsewhere (Chow, Russell, & Traub, 2000).

Soft-Linked Items, Computerized Adaptive Testing, and Item Response Theory (IRT)

Another confusion of Pope and Harley' s paper is how soft-linked items and computers fit together. It was fairly obvious in the original article that we were referring to computerized scoring of soft-linked items, rather than computer-administered testing, or computer adaptive testing (CAT) (e.g. the title of the paper was "soft-linked scoring algorithms"). After all, it is the computer scoring algorithm that makes linked items soft-linked; hence the name. Since we have established that soft-linked items are independent, it would make no difference whether the test administration was paper and pencil or computerized. The only stipulation is that the scoring be done by computer. Whether or not soft-linked items might fit into a CAT framework is completely determined by the dimensionality of the test itself. We did factor analyze the data and the assumption of unidimensionality could not be fulfilled for these chemistry tests.

Pope and Harley mentioned that "The most fundamental assumption of item response theory (IRT) is that of local independence and the related assumption of unidimensionality .... If local independence holds true then the related assumption of unidimensionality will also hold true in that all items are measuring only one latent variable or dimension (Lord, 1980; Hambleton, Swaminathan, & Rogers, 1991)." It seems that Pope and Harley simply copied these statements verbatim from textbooks without actually understanding what they put down. The concept of unidimensionality and local independence beg for some clarifications here lest the reader be led down the path of confusion.

Since 1968 most adaptive testing research has used IRT as a psychometric basis. Within the context of IRT, different models for dichotomously scored items have been proposed and implemented. These models specify the mathematical form of the regression of Pi(_), the probability of responding correctly to item "i" given ability _, on _, the latent ability continuum (see Lord & Novick, 1968, Birnbaum, 1968). Different parameter and ability estimation procedures have also been proposed to correct for guessing (e.g., Chow, 1987).

The basic principle of IRT (also called latent trait theory) is that for any fixed values of the latent traits the observed variables are mutually statistically independent; this has come to be known as the principle of local independence (Lazarsfeld, 1950, Lazarsfeld & Henry, 1968). Local independence requires that any two items be uncorrelated when _ is fixed. However, as Lord (1980) pointed out, "local independence follows automatically from unidimensionality. It is not an additional assumption" (p.19) as the way understood by Pope & Harley.

In most existing research on computerized adaptive testing, it is assumed that the test is unidimensional_that the item pool is designed to measure only a single common dimension of ability. Unidimensionality is the prerequisite for the use of IRT in adaptive testing. As Lord (1980, p.21) lamented that "there is no generally valid statistical test to determine whether a set of test items is strictly unidimensional." Lord proposed a "rough" procedure, in which the size of the latent roots of the tetrachoric item intercorrelation matrix are compared to see if there is one dominant factor. Green et al. (1984) suggested that "a single factor that accounts for 70% of the total common variance is probably strong enough evidence for unidimensionality; one that accounts for less than 50% probably signals the use of subtests ..." (p. 351). Our factor analyses on the data obtained from those chemistry tests resulted in many factors with eigenvalues larger than one and besides, none of these factors could account for more than 50% of the variance in the data. We therefore decided to abandon the use of IRT to analyze the data and turned to good old classical test theory.

It is reassuring to know that Lord (1980), the founding father of modern IRT theory and CAT, specifically pointed out in no unequivocal terms that "We can easily imagine tests that are not [unidimensional]. An achievement test in chemistry might in part require mathematical training or arithmetic skill and in part require knowledge of nonmathematical facts" (p.20).

Conclusion

Contrary to Pope and Harley's belief, soft-linked items are independent, and thus, are not a problem for either classical test theory or IRT. Further, given the nature of hard- and soft-linked items, it is not simply "common sense" to predict the magnitudes of their PBCCs.

The main purpose of creating the soft-linked items was that they could be scored independently. This method of scoring linked items makes for an attractive item format to assess higher cognitive skills without the statistical pitfalls inherent in the conventional scoring of linked items. Compared to hard-linked items, soft-linked items are a better measure of achievement status. We stand behind the soft link approach and believe that it offers substantial benefits for assessment.

References

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores. Reading, Mass.: Addison-Wesley.

Chow, P. (1987). A simulation study of respondent and parameter-induced bias in computerized adaptive testing. Unpublished doctoral dissertation, University of Toronto.

Chow, P., Russell, H., and Traub, R. E. (2000). Expertise sensitive item selection. Psychological Reports, 87, 791-801.

Green, B. F., Bock, R. D., Humphreys, L. G., Linn, R. L., & Reckase, M. (1984). Technical guidelines for assessing computerized adaptive tests. Journal of Educational Measurement, 21(4). 347-360.

Lazarsfeld, P. F. (1950). The logical and mathematical foundation of latent structure analysis. In S. A. Stouffer et al. (Eds.), Measurement and Prediction. NJ: Princeton University Press.

Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. Boston, MA: Houghton Mifflin.

Loerke, D. R. B., Jones. M. N., & Chow, P. (1999). Psychometric benefits of soft-linked scoring algorithms in achievement testing. Education, 120, 273-280.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, Mass.: Addison-Wesley.

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, N. J.: Lawrence Erlbaum Associates.

Nunnally, J. C. & Bemstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill.

Pope, G. A., & Harley, D. (in press). A reply to Loerke, Jones, and Chow (1999) on the "psychometric benefits" of linked items. Instructional Psychology,

Traub, R. E. (1994). Reliability for the social sciences: Theory and applications. Thousand Oaks, CA: Sage.

Dr. Chow, Faculty, Nipissing University, North Bay, Ontario, Canada. Dr. Jones, Department of Psychology, Queen's University, Kingston, Ontairo. Mr. Loerke, Spruce Grove Composite High Shcool, Spruce Grove, Alberta, Canada.

Correspondence concerning this article should be addressed to Dr. Chow, 100 College Drive, Box 5002, North Bay, Ontario Canada P1B 8L7.

**********

Background

We recently presented evidence that soft-linked items have better psychometric properties than hard-linked items in achievement tests (Loerke, Jones, & Chow, 1999). Pope and Harley (in the current issue of this journal) have criticized this work, claiming that the findings are "common sense" and that differences between hard-linked items and soft-linked items are due to an increased probability that soft-linked items will be scored as correct. Pope and Harley also claimed that soft-linked items are not independent. We acknowledge that there is indeed a link artifact, but this artifact exists only for hard-linked items and not for soft-linked items. We would like to point out that soft-linked items are indeed independent (this is the major appeal of them) and it is hard-linked items that are not.

Linked items are items in which the examinee uses his/her answer from one item to compute an answer for a second item, typically using the numerical response format. These items are often used in multi-step calculations, allowing hierarchical computer scoring of complex reasoning. Linked items are a computer-scorable alternative to constructed response questions.

Hard-linked items require the examinee to get the first linked item correct before any of the subsequent linked items may be answered correctly. Thus, hard-linked items have a fixed key. Conversely, soft-linked items do not require the examinee to answer the first linked item correctly to have his/her response scored as correct on the subsequent linked item; that is, soft-linked items have an adaptive key. One method of accomplishing this is by a computer algorithm that generates the appropriate keys for the soft-linked items on the basis of the response given to the previous item. Using this method, examinees are not penalized twice for a single incorrect response if the soft-linked item is answered correctly using the initial incorrect answer. Although several items may be linked together (nested linking), we will limit our discussion to the simple case of a single item linked with an initial item.

Linked Items and Independence

Pope and Harley made the claim that linked items, in general, violate independence. However, they failed to make the distinction between the two types of linked items. Item independence is a fundamental assumption of classical test theory that states that item responses are randomly related when ability (_) is held constant (Nunnally & Bernstein, 1994). Thus, all shared variance between items that is not explained by _must be due to pure random error. Alternatively phrased, score on an item must be a function of_ only, not a function of score on any other item or a function of any other trait.

Hard-linked items violate independence because an incorrect response to the initial item automatically makes the second one incorrect (the items have correlated error terms). However, soft-linked items meet the independence assumption because the response given to the initial item has no effect on the scoring of the second item. Thus, soft-linked items have independent error terms, similar to any other two items on a test taken at random.

Further, Pope and Harley erroneously proposed that it is "common sense" that soft-linked items would on average have higher point-biserial correlation coefficients (PBCCs) than hard-linked items because soft-linked items are more likely to be scored as correct compared to their hard-linked counterparts. Since proportion correct plays an important role in the calculation of PBCC, Pope and Harley's intuition that soft-linked items would have higher PBCCs than hard-linked items is a possibility, but it is not the only possibility.

PBCC is basically a correlation between item score (dichotomously-scored as "correct" or "incorrect") and the total test score. An item with a high PBCC indicates that examinees with high test scores tend to answer it correctly and examinees with low test scores tend to answer it incorrectly. Conversely, items with negative PBCCs tend to be answered correctly primarily by examinees with low test scores. An alternative common sense approach to Pope and Harley's prediction would be that since the outcomes of two items that are hard-linked are more likely to be the same (either correct or incorrect) than the outcomes of two items that are soft-linked, hard-linked items are expected to positively correlate more highly with total test score than soft-linked items because for an examinee to get the second hard-linked item correct, he must have answered the initial item correctly. On the other hand, if an examinee was scored as correct on the second soft-linked item, he may or may not have answered the initial item incorrectly. Thus, even though more examinees tend to get soft-linked items correct, examinees that get hard-linked items correct tend to have higher test scores because they must also have the initial item correct. This results in the prediction that the PBCCs for hard-linked items should be higher than soft-linked items.

Point-Biserial Correlation

Pope and Harley were also critical of our interpretation of PBCC, particularly in drawing conclusions about item reliability. Their discomfort with our use of PBCC as an index of item reliability reveals their lack of breadth and depth in their literature review in this area. Although typically used as an item discrimination index, PBCC is an item-total correlation, and hence, the square of it (PBCC-squared) is a measure of the proportion of variance in total test score that is predictable by an item. Alternatively phrased, it is an index of consistency between an item and the total test score. This is why a strong relationship exists between PBCC and alpha or KR-20 (Traub, 1994, pp. 101-107). This approach has been used elsewhere (Chow, Russell, & Traub, 2000).

Soft-Linked Items, Computerized Adaptive Testing, and Item Response Theory (IRT)

Another confusion of Pope and Harley' s paper is how soft-linked items and computers fit together. It was fairly obvious in the original article that we were referring to computerized scoring of soft-linked items, rather than computer-administered testing, or computer adaptive testing (CAT) (e.g. the title of the paper was "soft-linked scoring algorithms"). After all, it is the computer scoring algorithm that makes linked items soft-linked; hence the name. Since we have established that soft-linked items are independent, it would make no difference whether the test administration was paper and pencil or computerized. The only stipulation is that the scoring be done by computer. Whether or not soft-linked items might fit into a CAT framework is completely determined by the dimensionality of the test itself. We did factor analyze the data and the assumption of unidimensionality could not be fulfilled for these chemistry tests.

Pope and Harley mentioned that "The most fundamental assumption of item response theory (IRT) is that of local independence and the related assumption of unidimensionality .... If local independence holds true then the related assumption of unidimensionality will also hold true in that all items are measuring only one latent variable or dimension (Lord, 1980; Hambleton, Swaminathan, & Rogers, 1991)." It seems that Pope and Harley simply copied these statements verbatim from textbooks without actually understanding what they put down. The concept of unidimensionality and local independence beg for some clarifications here lest the reader be led down the path of confusion.

Since 1968 most adaptive testing research has used IRT as a psychometric basis. Within the context of IRT, different models for dichotomously scored items have been proposed and implemented. These models specify the mathematical form of the regression of Pi(_), the probability of responding correctly to item "i" given ability _, on _, the latent ability continuum (see Lord & Novick, 1968, Birnbaum, 1968). Different parameter and ability estimation procedures have also been proposed to correct for guessing (e.g., Chow, 1987).

The basic principle of IRT (also called latent trait theory) is that for any fixed values of the latent traits the observed variables are mutually statistically independent; this has come to be known as the principle of local independence (Lazarsfeld, 1950, Lazarsfeld & Henry, 1968). Local independence requires that any two items be uncorrelated when _ is fixed. However, as Lord (1980) pointed out, "local independence follows automatically from unidimensionality. It is not an additional assumption" (p.19) as the way understood by Pope & Harley.

In most existing research on computerized adaptive testing, it is assumed that the test is unidimensional_that the item pool is designed to measure only a single common dimension of ability. Unidimensionality is the prerequisite for the use of IRT in adaptive testing. As Lord (1980, p.21) lamented that "there is no generally valid statistical test to determine whether a set of test items is strictly unidimensional." Lord proposed a "rough" procedure, in which the size of the latent roots of the tetrachoric item intercorrelation matrix are compared to see if there is one dominant factor. Green et al. (1984) suggested that "a single factor that accounts for 70% of the total common variance is probably strong enough evidence for unidimensionality; one that accounts for less than 50% probably signals the use of subtests ..." (p. 351). Our factor analyses on the data obtained from those chemistry tests resulted in many factors with eigenvalues larger than one and besides, none of these factors could account for more than 50% of the variance in the data. We therefore decided to abandon the use of IRT to analyze the data and turned to good old classical test theory.

It is reassuring to know that Lord (1980), the founding father of modern IRT theory and CAT, specifically pointed out in no unequivocal terms that "We can easily imagine tests that are not [unidimensional]. An achievement test in chemistry might in part require mathematical training or arithmetic skill and in part require knowledge of nonmathematical facts" (p.20).

Conclusion

Contrary to Pope and Harley's belief, soft-linked items are independent, and thus, are not a problem for either classical test theory or IRT. Further, given the nature of hard- and soft-linked items, it is not simply "common sense" to predict the magnitudes of their PBCCs.

The main purpose of creating the soft-linked items was that they could be scored independently. This method of scoring linked items makes for an attractive item format to assess higher cognitive skills without the statistical pitfalls inherent in the conventional scoring of linked items. Compared to hard-linked items, soft-linked items are a better measure of achievement status. We stand behind the soft link approach and believe that it offers substantial benefits for assessment.

References

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores. Reading, Mass.: Addison-Wesley.

Chow, P. (1987). A simulation study of respondent and parameter-induced bias in computerized adaptive testing. Unpublished doctoral dissertation, University of Toronto.

Chow, P., Russell, H., and Traub, R. E. (2000). Expertise sensitive item selection. Psychological Reports, 87, 791-801.

Green, B. F., Bock, R. D., Humphreys, L. G., Linn, R. L., & Reckase, M. (1984). Technical guidelines for assessing computerized adaptive tests. Journal of Educational Measurement, 21(4). 347-360.

Lazarsfeld, P. F. (1950). The logical and mathematical foundation of latent structure analysis. In S. A. Stouffer et al. (Eds.), Measurement and Prediction. NJ: Princeton University Press.

Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. Boston, MA: Houghton Mifflin.

Loerke, D. R. B., Jones. M. N., & Chow, P. (1999). Psychometric benefits of soft-linked scoring algorithms in achievement testing. Education, 120, 273-280.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, Mass.: Addison-Wesley.

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, N. J.: Lawrence Erlbaum Associates.

Nunnally, J. C. & Bemstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill.

Pope, G. A., & Harley, D. (in press). A reply to Loerke, Jones, and Chow (1999) on the "psychometric benefits" of linked items. Instructional Psychology,

Traub, R. E. (1994). Reliability for the social sciences: Theory and applications. Thousand Oaks, CA: Sage.

Dr. Chow, Faculty, Nipissing University, North Bay, Ontario, Canada. Dr. Jones, Department of Psychology, Queen's University, Kingston, Ontairo. Mr. Loerke, Spruce Grove Composite High Shcool, Spruce Grove, Alberta, Canada.

Correspondence concerning this article should be addressed to Dr. Chow, 100 College Drive, Box 5002, North Bay, Ontario Canada P1B 8L7.

Printer friendly Cite/link Email Feedback | |

Author: | Loerke, Donald R.B. |
---|---|

Publication: | Journal of Instructional Psychology |

Geographic Code: | 1USA |

Date: | Sep 1, 2002 |

Words: | 2193 |

Previous Article: | A reply to Loerke, Jones, and Chow (1999) on the "Psychometric benefits" of linked items. |

Next Article: | An exploratory study of academic goal setting, achievement calibration and self-regulated learning. |

Topics: |