# Fitting Rasch model using appropriateness measure statistics.

Psychological measurement has changed from the massive administration of the classic test model (Gulliksen, 1950; Lord & Novick, 1968) to the use of mathematical models that demand severe restrictions of the data to justify that the test designed measures the attribute it is meant to measure. One of the models that has generated the most research ever since its publication in 1960 has been the model of Rasch (Fisher & Molenaar, 1995; Rasch, 1960; Van der Linden & Hambleton, 1997; Wright & Stone, 1979), which shares with the other item response theory (IRT) models the assumptions of unidimensionality and local independence. Unidimensionality is evidence that the items essentially measure one and only one attribute (Stout, 1987), whereas local independence is evidence that the responses to an item are not influenced by the responses to prior or subsequent items, or that the responses of a group of subjects of the same ability are not related to each other (Hambleton & Swaminathan, 1995). If these assumptions are met, then Rasch's probabilistic model guarantees a unidimensional scale of the attribute measured, where the separability of item parameters and examinee abilities is a reality instead of a mere hypothesis assumed in the model (Bond & Fox, 2001). But, as occurs with the rest of the models in the IRT framework, Rasch's model (Rasch, 1960; Wright &Stone, 1979) does not assume that unidimensionality and local independence are mere hypotheses that are deduced after the model has been fitted, but instead they should be empirically proved. That is, before stating that a set of items met the Rasch model's expectations, studies of fit must be performed, both on the items and on examinee response patterns, in order to determine the extent to which the responses obtained follow the pattern expected in the model.Several statistics have been proposed to prove that item response patterns and/or examinee patterns met the model characteristics. Some fit statistics have been specifically developed for Rasch's model, such as the residual statistics of Wright and Stone (1979), whereas statistics based on the likelihood function (Drasgow & Levine, 1986; Levine & Rubin, 1979) and on the comparison of item characteristic curves--ICCs--(Harnish & Tatsuoka, 1983; Tatsuoka, 1984) can be used in any of the dichotomic item response models: Rasch's model (Rasch, 1960; Wright & Stone, 1979), the 2- p logistic model (Birnbaum, 1968; Lord, 1980), and the 3- p model (Birnbaum, 1968; Lord, 1980). Generally, these statistics have standardized versions under the normal curve that allow making decisions about fit with specific significance levels.

The residual statistics were developed to study the fit of the items and of examinee response patterns to the model (Wright & Masters, 1982; Wright & Stone, 1979), whereas the statistics based on the likelihood function (Drasgow & Levine, 1986) and those that use the comparison of ICCs (Harnish & Tatsuoka, 1983) were developed exclusively to study the fit of the examinee response pattern to the proposed model, generating a research field known as appropriateness measure (Hulin, Drasgow, & Parsons, 1983). However, as with the residual statistics, appropriateness measure statistics can be applied as item fit statistics and vice versa, item fit statistics as statistics to study the degree of aberration of examinee response patterns (Reise, 1990).

Outfit Statistic

The unweighted total fit statistic (Outfit) is based on the residual obtained from subtracting the probability predicted by the model as a function of the estimated parameters from the observed response. It is calculated as:

MS(UT) = 1/N [N.summation over (i=1)] [([U.sub.i] - [P.sub.ij]/[w.sub.ij].sup.2] (1)

where [U.sub.ij] is the observed response for the subject i in the item j, [P.sub.ij] is the probability of a correct response according to the Rasch model, [w.sub.ij] = [P.sub.ij] (1 - [P.sub.ij]), and N is the sample size. Note that [w.sub.ij] is the information function of the item defined in the model. The standard deviation of this statistic can be estimated by:

[sigma](UT) = [[[N.summation over (i=1)] 1/[w.sub.ij] - 4N].sup.1/2] (2)

The unweighted total fit statistic (MS(UT)) follows a [chi square] distribution with one degree of freedom. Its mathematical expectation is 1 and its standard deviation is obtained by Equation 2. Smith (1991) and Smith, Schumacker and Bush (1998) showed that there is no unique critical value to study item fit, but instead it depends on sample size and on information function. Moreover, this statistic is highly affected both by high-ability subjects' unexpected incorrect responses to easy items and by low-ability subjects' unexpected correct responses to difficult items.

Infit Statistic

The weighted total fit statistic (Wright & Masters, 1982) has the following form:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (3)

where [U.sub.ij], [P.sub.ij], and [w.sub.ij]--are interpreted as in Equation 1. In this statistic, the residual is weighted by the information function, which reduces the influence of extreme values. Its standard deviation is:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (4)

Both statistics, Outfit and Infit, have been standardized under the normal distribution by the following transformation:

t = [[(MS.sup.1/3] - 1) (3/[sigma]) + [[sigma]/3] (5)

where MS is the mean square of Equations 1 or 3, [sigma] is the standard deviation of Equations 2 or 4. As Lz and the indexes ECI2z and ECI4z of Tatsuoka (1984) are also standardized under the normal distribution, in this study, we will use the t transformation of the Outfit and Infit mean square, which we shall call T-outfit and T-infit. Values of T-outfit and T-infit of less than -2 indicate less variation than expected by the model, which means that the response pattern is fairly close to the expected Guttman pattern, whereas values of T-out and T-infit higher than +2 indicate that the response pattern obtained has more randomness than expected by the model.

Lz Statistic

The Lz statistic is calculated by:

[l.sub.z] = l([theta]) - [member of][l([theta])]/{Var [[l([theta]).sup.1/2] (6)

where

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII].

where [U.sub.ij] and [P.sub.ij] are defined as in the Outfit and Infit statistics, and [Q.sub.ij] = 1 - [P.sub.ij]. This statistic follows a standardized normal distribution when calculated with true item and subject parameters (Reise, 1990). Negative Lz values are associated with unlikely response patterns, whereas positive values are associated with more consistent response patterns than expected by the model.

Eci2z and Eci4z Statistics

Tatsuoka and Linn (1983) developed six caution statistics (ECI1 to ECI6) to detect aberrant response patterns. These statistics were standardized and adapted under the IRT by Tatsuoka (1984). In this study, we will use two statistics (ECI2z and ECI4z) out of the six original ones, because Tatsuoka (1984) suggested that ECI4z and ECI6z have identical standardized forms, and the correlation between ECI1z and ECI2z was very close to 1.

Caution statistics are based on the ratio between two covariances. The numerator is the covariance between the observed item patterns and the test response patterns, whereas the denominator is the covariance between the pattern expected by the model and the Guttman pattern. The mathematical expression of the statistic ECI2z is:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (7)

and of the ECI4z statistic:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (8)

where [U.sub.ij], [P.sub.ij] y [Q.sub.ij] = 1 - [P.sub.ij] were already commented on in the previous statistics [G.sub.i] = 1/n [N.summation over (i=1)] [P.sub.ij], [micro]G = 1/N [N.summation over (i=1) [G.sub.i],

[[mu].sub.P] = 1/N [N.summation over (i=1)] [P.sub.i], n is the number of items and N is the number of persons.

Thus, ECI2z compares the pattern of item scores with the mean probability through the test items, whereas ECI4z compares the pattern of item scores with the expected probability according to the Rasch model. Low values, around 0, of these statistics represent a good fit of the data to the proposed model (Birenbaum, 1986).

Previous Studies

When a test is developing, it is important to decide what kind of items to use as a latent construct indicator. In psychometric tests, closed-answer items are generally used; items with several options from which examinees must select one. If the test is an achievement, skill, or ability test, there will only be one correct choice, which means that some examinees, usually low-ability ones, will try to guess the correct answer randomly, thus artificially altering the parameters of these items (Meijer, 1996). In the Rasch model, the probability of a minimum-ability examinee guessing an item correctly is 0 (Rasch, 1960; Wright & Stone, 1979), so the item fit statistics employed in this model must detect the items that have received a higher than expected percentage of random responses. Moreover, the fit statistics should maintain their distributional characteristics even when the conditions under which the test is administered are not optimal.

Research carried out till now, however, has revealed that the fit statistics show problems in their distributions when the conditions of response-pattern evaluation are not optimal, that is, when these response patterns do not clearly fulfill the assumptions of the model. Thus, Meijer and Sijtsma (2001) stated that it is doubtful whether the t transformation of the Outfit and Infit mean squares follows a normal distribution, although Rogers and Hattie (1987) found that they were sensitive to guessing. Smith (1991) found that the t transformations only followed a normal standardized distribution when they were calculated from true parameters, but when calculated from item or examinee estimated parameters, or from both, this produced severe restrictions in the means and standard deviations, which affected Type I error rate. However, Smith (1991) found that these transformations were sensitive to random guessing of items.

Regarding the statistics used in the field of measurement of aberrant patterns, Molenaar and Hoijtink (1990, 1996) found that the statistic Lz only followed a normal standardized distribution when calculated from true parameters, and its variance, calculated from estimated parameters, was smaller than the one expected under normal distribution (Molenaar & Hoijtink, 1990; Nering, 1995, 1997; Reise, 1995). Noonan, Boss, and Gessaroli (1992) also found that the distribution of Lz was negatively skewed.

Drasgow, Levine, and McLaughlin (1987) stated that ECI4z was better standardized (Li & Olejnik, 1997) and had a higher detection rate than ECI2z. On the other hand, Noonan et al. (1992) said that ECI4z had means and standard deviations close to the normal distribution, although the distributions were positively skewed, and this skewness was less than one half of that of the other statistics (ECI2z and T-Infit) and, moreover, was less affected by test length.

The object of this investigation is to study the power of three statistics normally used in the field of appropriateness measure: Lz of Drasgow and Levine (1986), and Eci2z and Eci4z (Tatsuoka, 1984) as item fit statistics under the Rasch model, and to compare them with the statistics (Outfit and Infit), and their corresponding t transformations, generally used to study item fit in the context of this model. The distributional properties of these five statistics in the context of random item guessing will also be studied.

Method

Experimental Conditions

An item with k response options has a 1/k probability of being randomly guessed correctly. In IRT, this probability is expressed in the pseudo-guessing parameter, defined as the probability of a low-ability level examinee correctly guessing an item randomly (Hambleton & Swaminathan, 1985; Lord, 1980). To evaluate the extent to which the five statistics can detect this alteration with regard to the Rasch model, four sample sizes were selected: 100, 250, 500, and 1000 subjects. Each sample size was simulated under two types of distribution: uniform and normal, with mean 0 and standard deviation 1.In order to increase random guessing, we subtracted a unit of each value from the original samples, resulting in two new distributions (uniform and normal) in each sample size, with mean -1 and standard deviation 1.

Two test lengths were selected: 15 and 30 items, to observe whether any differential effect was produced as a function of text length. The discrimination parameters (a) of the items in all tests were 1.00 (expected in the Rasch model), whereas for the difficulty parameters (b), two uniform distributions were employed: [+ or -] 1 logits and [+ or -] 2 logits. Finally, the pseudo-guessing parameter for all items was fixed at 0.25. In summary, a total of 64 conditions-- 4 (Sample Size) x 4 (Type of Distribution) x 2 (Test Length) x 2 (Distribution of b)--were examined. Each experimental condition was replicated 50 times.

Generation of Item-Response Data

The item-response data were generated under the 3-p model with all the discrimination parameters set at 1. The item responses were generated with a computer program that works as follows: Using the examinees' true ability parameters and the true discrimination, difficulty, and pseudo-guessing parameters of the items, it calculates the likelihood of responding correctly to the item ([P.sub.ij]). The program subsequently generates a random number R in the range [0, 1], and it compares it with [P.sub.ij]. If [P.sub.ij] > R the response is 0; if [P.sub.ij] < R, then the response is 1.

Fit Evaluation

To examine the behavior of the fit statistics, the ability and item parameters were estimated in each replication using the incorrect model; that is, the Rasch model. Parameter estimation was performed with the ConQuest program (Wu, Adams, & Wilson, 1998). These parameters, together with the original response matrixes, were the basis of a new computer program to calculate the five fit statistics: T-outfit, T-infit, Lz, ECI2z, and ECI4z. Subsequently, with SYSTAT 10.0, the basic statistics were determined (means and standard deviations) and the power of the items that did not fit the model. To determine the power, the cutting-point score [+ or =] 2 was employed, which corresponds roughly to the nominal rate [alpha] = .05.

Results

Basic Statistics

In Tables 1 and 2 are displayed the means and standard deviations obtained for each of the fit statistics studied and in each of the manipulated conditions. As expected, the fit statistics revealed considerable differences in their standardization when calculated from the estimated parameters. Thus, in this study, Lz means were higher than 0, indicating that the response patterns obtained for the items were more consistent than those expected by Rasch's model and they increased systematically with sample size, group ability level, and amplitude of test-difficulty interval. Thus, in a 15-item test, the Lz mean changed from .075 (N = 100) to .199 (N = 1000) when the distribution of group ability was normal and equal to the test difficulty mean, and the test-difficulty interval was logits (see Table 1). If the difficulty interval was increased to logits, then the Lz mean was .218 for N = 100, increasing to .699 for N = 1000. If the fit statistics in a lower ability group -N (-1, 1) - were calculated, they would also be affected by the same conditions. Thus, if the test-difficulty interval was [+ or -] 1 logits, the mean was .112 for N = 100, which increased to .304 for N = 1000; and if the difficulty interval was [+ or -] 2 logits, then the mean changed from .232 for N = 100 to .715 for N =1000. This pattern occurred regardless of whether the ability distribution was normal or uniform.

If the test length was increased to 30 items, a similar pattern was observed, although [+ or -] 1 in the logit difficulty interval, the mean of the Lz statistics was approximately 0 when the ability distribution was normal or uniform. Thus, the Lz mean was between .044 (N = 100) and -.002 (N = 500) when the ability distribution was normal, and between -.011 (N = 1000) and .008 (N = 500) when the distribution was uniform. However, in a lower ability group, the Lz mean was between .061 (N = 100) and .181 (N = 1000) when the ability distribution was normal, and it increased from .051 (N = 100) to .169 (N = 1000) when the ability distribution was uniform. When the difficulty interval was increased to [+ or -] 2 logits, an increase in the Lz mean was observed in all experimental conditions, from a small sample size (N = 100) to a large one (N = 1000).

The statistics based on residuals (T-outfit and T-infit) presented a pattern similar to that obtained by the Lz statistic. In some conditions, the means were negative, indicating that the simulated response patterns have less variation than expected by the model, approaching the Guttman pattern. Thus, in short tests (n = 15) with normal ability distribution and difficulty interval of [+ or -] 1 logits, the T-outfit mean was -.050 (N = 100), which decreased to -.200 (N = 1000), and the T-infit mean was -.068 (N = 100), which decreased to -.128 (N = 1000). Again, the same effect was observed when the ability distribution was uniform, that is, the T-outfit mean was -.063 (N = 100), decreasing to -.220 (N = 1000), and the T-infit mean was -.047 (N = 100), decreasing to -.165 (N = 1000). When the difficulty interval increased, the T-outfit and T-infit means in all experimental conditions decreased. The same occurred when the fit statistics were calculated in a lower ability group (see Table 1).

When test length was increased to 30 items, the distribution was centered, and the test difficulty interval was between [+ or -] 1 logits, a similar effect was obtained to that observed in Lz. The T-outfit mean was between .001 (N = 100) and -.037 (N = 500) when test length was 30 items, whereas the T-infit mean was -.025 (N = 500), which changed to .023 (N = 1000). The test was subsequently compared to a lower ability group, resulting in a T-outfit mean between -.069 (N = 100) and -.150 (N = 1000), whereas the T-infit mean was between -.062 (N = 100) and -.199 (N = 1000). If the distribution was uniform, the T-outfit mean was between -.027 (N = 100) and -.164 (N = 1000), and the T-infit mean was between -.058 (N = 100) and -.203 (N = 1000). If the test difficulty interval was increased to logits, all T-outfit and T-infit means were considerably reduced.

However, the means of the statistics Eci2z and Eci4z were similar in the experimental conditions manipulated in this study. Thus, when using a short test (n = 15), difficulty interval of logits, and normal ability distribution, the Eci2z mean was between -.002 (N = 100) and -.007 (N = 1000)--see Table 1-and the Eci4z mean was between -.024 (N = 100) and -.046 (N = 1000). The Eci4z mean was an unexpected value only in a few conditions. Thus, with a 30item test, difficulty interval of [+ or -] 2 logits, and normal ability distribution (see Table 2), the Eci4z mean was -.122 (N = 500) and -.163 (N = 1000). In the same experimental conditions, but with uniform distribution, the Eci4z mean was -.141 (N = 1000).

Regarding the standard deviations, a more or less common pattern in all five fit statistics was observed. Thus, the standard deviations only maintained their expected value of 1 when the sample size was relatively small (100 and 250), but they increased considerably when the sample size was 500 or higher. That is, with a short test (n = 15), difficulty interval of [+ or -] 1 logits, and normal and centered ability distribution (see Table 1), the Lz standard deviation was .978 (N = 100) and .988 (N = 250), but it increased to 1.442 (N = 1000). In the lower ability group, the standard deviation was between .959 (N = 100) and 1.573 (N = 1000).

The variability of the fit statistics increased considerably when the test difficulty interval was logits, especially when the sample size was 500 or more. Thus, with a short test (n = 15) and normal ability distribution, the standard deviation of Lz was 1.235 (N = 500) and 1.605 (N = 1000), slightly higher than those obtained in the same sample sizes at the difficulty interval of [+ or -] 1 logits (see Table 1). If the ability distribution was uniform, then a small increase in the standard deviations of all statistics was observed in the various experimental conditions. This was systematically repeated in all fit statistics, observing high standard deviations in the Eci2z and Eci4z statistics when the sample size was 500 or more persons. A similar effect was observed in the standard deviations of the Lz, T-outfit, and T-infit statistics when increasing test length to 30 items, but very few relevant differences were observed in short tests (n = 15). That is, the standard deviations of these statistics changed as a function of the type of ability distribution (normal vs. uniform) of the mean group ability, and the sample size, but no appreciable changes were observed due to increase in test length. Thus, if n = 15 items, ability distribution is normal, and the difficulty interval is [+ or -] 1 logits (see Table 1), the standard deviation of T-outfit was 1.435 (N = 1000), and in the same conditions but with a 30-item test (see Table 2), the standard deviation of T-outfit was 1.431 (N = 1000). Only the standard deviations of Eci2z and Eci4z increased slightly because of the increase in test length. Thus, with a 15-item test, normal ability distribution, and difficulty interval of logits (see Table 1), the standard deviation of Eci2z was 1.458 (N = 1000), but if the test length was increased to 30 items, the standard deviation of Eci2z increased to 1.612 (N = 1000).

Power of Fit Statistics

In Tables 3 and 4 is showed the power of each of the fit statistics examined, in each of the manipulated conditions. As all the items were simulated under the modified 3-p model with the probability of randomly correct guessing at c = .25, it was expected that none the items would be detected as fitting the model, assuming that, for the Rasch model, the probability of random correct guessing is 0. However, Table 3 (n = 15) and Table 4 (n = 30) provide very different results from those expected. Generally, the power was low or very low, especially in the Lz, T-outfit, and T-infit statistics, and somewhat higher in Eci2z and Eci4z. When the ability distribution was normal, regardless of whether the test length was short (n = 15) or longer (n = 30), the power of the fit statistics was very low when sample size was 500 or more. For example, when the test difficulty interval was [+ or -] 1 logits, at the sample size of 500 and normal ability distribution, the power of the five fit statistics was between 9% (Lz and Tinfit) and 11% (Eci2z and Eci4z). If the fit statistics were calculated in the lower ability group, the power was between 12% (T-infit) an 18% (Eci2z and Eci4z).

As expected, the power increased with sample size in all experimental conditions, but in sample size N = 1000, using a lower ability group than the mean test difficulty, uniform distribution, and item difficulty interval of [+ or -] 2 logits, the power of the fit statistics was close to 70%. Thus, when n = 30, test difficulty interval was [+ or -] 1 logits, the power of Eci2z and Eci4z was 50% and 49%, respectively, which increased to 71% and 70% when the test difficulty interval was increased to [+ or -] 2 logits. The increase was also significant for the Lz, T-outfit, and T-infit statistics, which, at the difficulty interval of [+ or -] 1 logits obtained power values of 44%, 41%, and 43%, respectively; and they increased to 61%, 65%, and 56%, when the difficulty interval was logits.

Conclusions

In this simulation study, we examined whether three fit statistics that are habitually used in the area of detection of aberrant response patterns (appropriateness measure) can also be used to detect items that do not fulfill the assumptions of the Rasch model.

In view of the results obtained, it seems that the usefulness of the item fit statistics (T-outfit and T-infit) and of the statistics of appropriateness measure (Lz, Eci2z, and Eci4z) is very limited because when using estimated parameters, low or very low detection rates, usually with sample sizes of less than 500 examinees, were detected.

In any case, all the computer programs of parameter estimation with IRT models include one or more item fit statistics, so that psychologists should decide whether these fit statistics are useful when making decisions to select items. If they decide that they are useful, the following information should be taken into account. First, the Lz, T-outfit, and T-infit statistics do not tend toward the expected values when they are calculated using estimated parameters. These results are in accordance with those reported by Smith (1991) on the evaluation of item fit and by Moleenar and Hoijtink (1990, 1996) on the evaluation of person fit. As occurs in person-fit evaluation (Li & Olejnik, 1997; Noonan et al., 1992), the distributions of these statistics depend on sample size, test-difficulty interval amplitude, group ability level, and test length. Nevertheless, if the item difficulty interval is relatively narrow, as the test length increases, the properties of the distributions of the Lz, T-outfit, and T-infit statistics improve considerably.

Second, despite the fact that the Lz, T-outfit, and T-infit statistics are based on different concepts to evaluate the fit of the items to the model, no important differences were observed in the distributional properties of these statistics. In fact, their behavior was observed to be similar independently of the factors manipulated in this study.

Third, the behavior of the statistics Eci2z and Eci4z (Tatsuoka, 1984) is satisfactory in all experimental conditions, except for some isolated case of Eci4z. Therefore, it seems that their distribution (the mean) is relatively stable, independently of test length, sample size, type of ability distribution (normal vs. uniform), and group ability level. The same cannot be said about their variability, which showed an increase as a function of sample size, test length, difficulty interval amplitude, and type of ability distribution. These results partially contrast with those found when these statistics are applied to evaluate person fit. Thus, Noonan et al. (1992) found that the ECI4z statistic fit a normal distribution better--both the mean and the standard deviation--than the ECI2z statistic and the Lz, T-outfit, and T-infit statistics, showing a less skewed distribution and showing less influence of sample size. However, Noonan et al. (1992) study used true parameters, not estimated ones.

Fourth, a small sample size (N = 250 or less) may be sufficient (Lord, 1983) to obtain estimations that are consistent with item difficulty parameters, but it will not help to decide whether or not the items fit the Rasch model, because the power of the five fit statistics was relatively low or very low.

Fifth, the power of the fit statistics drops drastically when the item is a multiple-choice item and can be guessed correctly at random. In this case, not even a large sample size (N = 1000) ensures sufficient power of any of the five statistics to guarantee that an item does not follow the assumption of the Rasch model, where the probability of correct guessing at random is 0. In any case, if N = 250 or higher, the ECI2z and ECI4z statistics present higher power to detect these items than the statistics based on the likelihood function (Lz) or on T-outfit and T-infit residuals.

Finally, in practically all the experimental conditions manipulated, the power of the Eci2z and Eci4z statistics was between 5% and 10% higher than that Lz, T-outfit, and T-infit, although these differences were even greater when the ability distribution was normal, their mean was the same as the test mean, and the sample size was less than 1000 cases.

In view of these results, perhaps some of the well-known computer programs used to examine the fit of the Rasch model should include the Eci2z and Eci4z statistics as item and person fit statistics.

Received: November 18, 2004

Revision received: January 1, 2005

Accepted: January 26, 2005

References

Birenbaum, M. (1986). Effect of dissimulation motivation and anxiety on response pattern appropriateness measures. Applied Psychological Measurement, 10, 167-174.

Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397-472). Reading, MA: Addison-Wesley.

Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, NJ: Erlbaum.

Drasgow, F., & Levine, M. V. (1986). Optimal detection of certain forms of inappropriate test scores. Applied Psychological Measurement, 10, 59-67.

Drasgow, F., Levine, M. V, & Mclaughlin, M. E. (1987). Detecting inappropriate test scores with optimal and practical appropriateness indices. Applied Psychological Measurement, 11, 59-79.

Fisher, G. H., & Molenaar, I. W. (Eds.) (1995). Rasch models: Foundations, recent developments, and applications. New York: Springer-Verlag.

Gulliksen, H. (1950). Theory of mental test. New York: Wiley. Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer-Nijhoff.

Harnish, D. L., & Tatsuoka, K. K. (1983). A comparison of appropriateness indices based on item response theory. In R. K. Hambleton (Ed.), Applications of item response theory (pp. 104-122). Vancouver, Canada: Educational Research Institute of British Columbia.

Hulin, Ch. L., Drasgow, F., & Parsons, Ch. K. (1983). Item response theory: Application to psychological measurement. Homewood, IL: Dow-Jones Irwin.

Levine, M. V., & Rubin, D. F. (1979). Measuring the appropriateness of multiple-choice test scores. Journal of Educational Statistics, 4, 269-290.

Li, M. F., & Olejnik, S. (1997). The power of Rasch person-fit statistics in detecting unusual response patterns. Applied Psychological Measurement, 21, 215-231.

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.

Lord, F. M. (1983). Small N justifies Rasch model. In D. J. Weiss (Ed.), New horizons in testing: Latent trait test theory and computerized adaptive testing (pp. 51-61). New York: Academic Press.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading MA: Addison-Wesley.

Meijer, R. R. (1996). The influence of the presence of deviant item score patterns on the power of a person-fit statistic. Applied Psychological Measurement, 20, 141-154.

Meijer, R. R., & Sijtsma, K. (2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25, 107-135.

Molenaar, I. W., & Hoijtink, H. (1990). The many null distributions of person-fit indices. Psychometrika, 55, 75-106.

Molenaar, I. W., & Hoijtink, H. (1996). Person-fit and the Rasch model, with an application to knowledge of logical quantors. Applied Measurement in Education, 9, 27-45.

Nering, M. L. (1995). The distribution of person fit using true and estimated person parameters. Applied Psychological Measurement, 19, 121-129.

Nering, M. L. (1997). The distribution of indexes of person-fit within the computerized adaptive testing environment. Applied Psychological Measurement, 21, 115-127.

Noonan, B. W., Boss, M. W., & Gessaroli, M. E. (1992). The effect of test length and IRT model on the distribution and stability of three appropriateness indexes. Applied Psychological Measurement, 16, 345-352.

Rasch, G. (1960). Probabilistic models for some intelligence and attainment test. Copenhagen: The Danish Institute of Educational Research. (Expanded edition, 1980. Chicago: The University Chicago Press.)

Reise, S. P. (1990). A comparison of item-and person-fit methods of assessing model-data fit in IRT. Applied Psychological Measurement, 14, 127-137.

Reise, S. P. (1995). Scoring method and the detection of person misfit in a personality assessment context. Applied Psychological Measurement, 19, 213-229.

Rogers, H. J., & Hattie, J. A. (1987). A Monte Carlo investigation of several person and item fit statistics for item response models. Applied Psychological Measurement, 11, 47-57.

Smith, R. M. (1991). The distributional properties of Rasch item-fit statistics. Educational and Psychological Measurement, 51, 541-565.

Smith, R. M., Schumacker, R. E., & Bush, M. J. (1998). Using item mean squares to evaluate fit to the Rasch model. Journal of Outcome Measurement, 2, 66-78.

Stout, W. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52, 589-617.

SYSTAT (v. 10.0) (2000). The system for statistics. SPSS Inc. Tatsuoka, K. K. (1984). Caution indices based on item response theory. Psychometrika, 49, 95-110.

Tatsuoka, K. K., & Linn, R. L. (1983). Indices for detecting unusual response patterns: Links between two general approaches and potential applications. Applied Psychological Measurement, 7, 81-96.

Van der Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item response theory. New York: Springer-Verlag.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA Press.

Wright, B. D., & Stone, M. (1979). Best test design. Chicago: MESA Press.

Wu, M. L., Adams, R. J., & Wilson, M. R. (1998). Acer ConQuest: Generalised item response modelling software. Melbourne, Australia: Australian Council for Educational Research.

Jose Antonio Lopez Pina and M. Dolores Hidalgo Montesinos

University of Murcia

Correspondence should be addressed to: Jose A. Lopez Pina, Depto. de Psicologia Basica y Metodologia, Facultad de Psicologia, Campus de Espinardo, 30100-Murcia (Spain). Phone: 968-363478. Fax: 968-364115. E-mail: jlpina@um.es

Translation: Virginia Navascues Howard

Table 1 Means and Standard Deviations (in brackets) of the Five Fit Statistics in the IS-Item Test with Normal and Uniform Distributions, Two Difficulty Intervals, and Four Sample Sizes Difficulty Interval [-1, +1] Sample size Distribution Statistic 100 250 Normal (0.1) Lz 0.075 0.108 Distribution (0.978) (0.988) T-outfit -0.050 -0.097 (0.958) (1.035) T-infit -0.068 -0.099 (0.900) (0.966) Eci2z -0.002 -0.003 (1.034) (1.057) Eci4z -0.024 -0.022 (1.036) (1.055) (-1, 1) Lz 0.112 0.161 (0.959) (1.104) T-outfit -0.087 -0.151 (0.983) (1.100) T-infit -0.113 -0.168 (0.958) (1.101) Eci2z -0.002 -0.002 (1.090) (1.237) Eci4z -0.008 -0.008 (1.091) (1.236) Uniform (0.1) Lz 0.061 0.098 Distribution (1.014) (1.187) T-outfit -0.063 -0.110 (1.008) (1.164) T-infit -0.047 -0.089 (1.001) (1.169) Eci2z 0.002 0.001 (1.105) (1.232) Eci4z -0.012 -0.019 (1.111) (1.230) (-1, 1) Lz 0.098 0.139 (1.030) (1.235) T-outfit -0.067 -0.141 (1.029) (1.200) T-infit -0.111 -0.146 (1.028) (1.241) Eci2z 0.000 -0.003 (1.133) (1.357) Eci4z -0.017 -0.006 (1.132) (1.352) Difficulty Interval [-1, +1] Sample size Distribution Statistic 500 1000 Normal (0.1) Lz 0.141 0.199 Distribution (1.184) (1.442) T-outfit -0.152 -0.200 (1.191) (1.435) T-infit -0.128 -0.190 (1.165) (1.412) Eci2z -0.005 -0.007 (1.241) (1.458) Eci4z -0.031 -0.046 (1.235) (1.444) (-1, 1) Lz 0.223 0.304 (1.263) (1.573) T-outfit -0.195 -0.283 (1.262) (1.552) T-infit -0.235 -0.319 (1.253) (1.560) Eci2z -0.003 -0.003 (1.438) (1.802) Eci4z -0.014 -0.012 (1.426) (1.788) Uniform (0.1) Lz 0.135 0.183 Distribution (1.316) (1.728) T-outfit 0.148 -0.220 (1.304) (1.677) T-infit -0.123 -0.165 (1.295) (1.703) Eci2z -0.001 -0.002 (1.340) (1.723) Eci4z -0.025 -0.030 (1.327) (1.704) (-1, 1) Lz 0.200 0.280 (1.516) (1.950) T-outfit -0.190 -0.281 (1.472) (1.864) T-infit -0.218 -0.307 (1.515) (1.948) Eci2z -0.005 -0.005 (1.677) (2.178) Eci4z -0.014 -0.016 (1.663) (2.150) Difficulty Interval [-2, +2] Sample size Distribution Statistic 100 250 Normal (0.1) Lz 0.218 0.359 Distribution (0.860) (0.988) T-outfit -0.145 -0.331 (0.961) (1.086) T-infit -0.208 -0.333 (0.829) (0.945) Eci2z -0.004 -0.016 (1.039) (1.208) Eci4z -0.052 -0.066 (1.045) (1.197) (-1, 1) Lz 0.232 0.354 (0.936) (1.058) T-outfit -0.190 -0.323 (0.995) (1.135) T-infit -0.237 -0.366 (0.910) (1.027) Eci2z -0.006 -0.002 (1.194) (1.382) Eci4z -0.032 -0.023 (1.189) (1.373) Uniform (0.1) Lz 0.203 0.329 Distribution (0.990) (1.204) T-outfit -0.186 -0.340 (1.084) (1.249) T-infit -0.184 -0.311 (0.944) (1.163) Eci2z -0.003 -0.004 (1.153) (1.406) Eci4z -0.044 -0.042 (1.151) (1.392) (-1, 1) Lz 0.214 0.335 (0.989) (1.279) T-outfit -0.172 -0.309 (1.097) (1.369) T-infit -0.221 -0.358 (0.957) (1.232) Eci2z 0.000 -0.004 (1.226) (1.616) Eci4z -0.023 -0.034 (1.216) (1.579) Difficulty Interval [-2, +2] Sample size Distribution Statistic 500 1000 Normal (0.1) Lz 0.481 0.699 Distribution (1.235) (1.605) T-outfit -0.457 -0.650 (1.378) (1.769) T-infit -0.474 -0.708 (1.165) (1.519) Eci2z -0.019 -0.019 (1.486) (1.959) Eci4z -0.095 -0.130 (1.442) (1.872) (-1, 1) Lz 0.490 0.715 (1.284) (1.638) T-outfit -0.413 -0.638 (1.415) (1.819) T-infit -0.524 -0.761 (1.230) (1.551) Eci2z -0.011 -0.009 (1.701) (2.227) Eci4z -0.063 -0.067 (1.674) (2.159) Uniform (0.1) Lz 0.460 0.642 Distribution (1.483) (1.995) T-outfit -0.494 -0.693 (1.561) (2.113) T-infit -0.439 -0.625 (1.417) (1.893) Eci2z -0.011 -0.018 (1.718) (2.206) Eci4z -0.061 -0.098 (1.663) (2.202) (-1, 1) Lz 0.456 0.654 (1.613) (2.168) T-outfit -0.442 -0.614 (1.715) (2.314) T-infit -0.489 -0.720 (1.541) (2.062) Eci2z -0.005 -0.011 (2.080) (2.816) Eci4z -0.045 -0.074 (2.008) (2.720) Table 2 Means and Standard Deviations (in brackets) of the Five Fit Statistics in the 30-Item Test with Normal and Uniform Distributions, Two Difficulty Intervals, and Four Sample Sizes Difficulty Interval [-1, +1] Sample size Distribution Statistic 100 250 (0.1) Lz 44 0.000 (0.948) (1.021) T-outfit 0.001 -0.025 (0.968) (1.040) T-infit -0.011 0.021 (0.919) (0.993) Eci2z -0.005 -0.011 (1.095) (1.169) Eci4z -0.019 -0.024 (1.094) (1.168) Normal (-1, 1) Lz 0.061 0.090 Distribution (1.968) (1.094) T-outfit -0.039 -0.064 (0.975) (1.098) T-infit -0.062 -0.100 (0.963) (1.085) Eci2z -0.005 -0.007 (1.131) (1.297) Eci4z -0.012 -0.019 (1.133) (1.292) (0.1) Lz 0.001 0.012 (0.990) (1.100) T-outfit -0.004 -0.037 (0.979) (1.095) T-infit 0.016 0.009 (0.962) (1.078) Eci2z -0.003 -0.007 (1.115) (1.201) Eci4z -0.018 -0.018 (1.117) (1.200) Uniform (-1, 1) Lz 0.051 0.078 Distribution (1.046) (1.288) T-outfit -0.037 -0.081 (1.048) (1.286) T-infit -0.058 -0.090 (1.039) (1.277) Eci2z -0.003 -0.010 (1.192) (1.447) Eci4z -0.013 -0.016 (1.190) (1.436) Difficulty Interval [-1, +1] Sample size Distribution Statistic 500 1000 (0.1) Lz -0.002 -0.001 (1.204) (1.434) T-outfit -0.037 -0.035 (1.211) (1.431) T-infit -0.025 0.023 (1.169) (1.393) Eci2z -0.017 -0.026 (1.364) (1.612) Eci4z -0.036 -0.062 (1.354) (1.593) Normal (-1, 1) Lz 0.132 0.181 Distribution (1.306) (1.585) T-outfit -0.116 -0.150 (1.309) (1.574) T-infit -0.143 -0.199 (1.293) (1.567) Eci2z -0.013 -0.013 (1.555) (1.927) Eci4z -0.021 -0.026 (1.546) (1.910) (0.1) Lz 0.008 -0.011 (1.317) (1.719) T-outfit -0.041 -0.066 (1.287) (1.653) T-infit 0.013 -0.015 (1.290) (1.683) Eci2z -0.012 -0.013 (1.437) (1.850) Eci4z -0.034 -0.040 (1.425) (1.827) Uniform (-1, 1) Lz 0.116 0.169 Distribution (1.629) (2.031) T-outfit -0.118 -0.164 (1.594) (1.988) T-infit -0.140 -0.203 (1.626) (2.022) Eci2z -0.012 -0.015 (1.834) (2.334) Eci4z -0.023 -0.031 (1.840) (2.313) Difficulty Interval [-2, +2] Sample size Distribution Statistic 100 250 (0.1) Lz 0.090 0.126 (0.858) (1.052) T-outfit -0.076 -0.155 (0.933) (1.148) T-infit -0.056 -0.094 (0.816) (0.992) Eci2z -0.021 -0.041 (1.134) (1.384) Eci4z -0.050 -0.077 (1.135) (1.364) Normal (-1, 1) Lz 0.104 0.175 Distribution (0.879) (1.064) T-outfit -0.085 -0.155 (0.975) (1.196) T-infit -0.096 -0.181 (0.837) (1.000) Eci2z -0.011 -0.014 (1.240) (1.565) Eci4z -0.028 -0.039 (1.223) (1.534) (0.1) Lz 0.063 0.107 (0.970) (1.273) T-outfit -0.076 -0.185 (1.032) (1.357) T-infit -0.027 -0.066 (0.916) (1.201) Eci2z -0.024 -0.040 (1.211) (1.556) Eci4z -0.052 -0.066 (1.199) (1.521) Uniform (-1, 1) Lz 0.088 0.156 Distribution (0.977) (1.317) T-outfit -0.069 -0.185 (1.063) (1.442) T-infit -0.089 -0.163 (0.931) (1.243) Eci2z -0.007 -0.018 (1.235) (1.823) Eci4z -0.030 -0.034 (1.322) (1.782) Difficulty Interval [-2, +2] Sample size Distribution Statistic 500 1000 (0.1) Lz 0.182 0.249 (1.278) (1.654) T-outfit -0.216 -0.306 (1.432) (1.860) T-infit -0.154 -0.216 (1.190) (1.531) Eci2z -0.063 -0.082 (1.699) (2.216) Eci4z -0.122 -0.163 (1.643) (2.133) Normal (-1, 1) Lz 0.249 0.343 Distribution (1.322) (1.706) T-outfit -0.249 -0.341 (1.485) (1.945) T-infit -0.261 -0.365 (1.240) (1.582)) Eci2z -0.029 -0.038 (1.981) (2.635) Eci4z -0.059 -0.081 (1.937) (2.563) (0.1) Lz 0.162 0.209 (1.612) (2.143) T-outfit -0.262 -0.365 (1.710) (2.290) T-infit -0.124 -0.154 (1.527) (2.008) Eci2z -0.053 -0.083 (1.980) (2.653) Eci4z -0.090 -0.141 (1.917) (2.545) Uniform (-1, 1) Lz 0.222 0.325 Distribution (1.682) (2.283) T-outfit -0.267 -0.369 (1.682) (2.498) T-infit -0.236 -0.366 (1.585) (2.135) Eci2z -0.026 -0.032 (2.372) (3.238) Eci4z -0.046 -0.074 (2.312) (3.142) Table 3 Power of the Five Fit Statistics in the 15-Item Test with Normal and Uniform Distributions, Two Difficulty Intervals, and Four Sample Sizes Difficulty Interval [-1, +1] Sample size Distribution Statistic 100 250 500 1000 Normal (0, 1) Lz .03 .05 .09 .14 Distribution T-outfit .04 .05 .10 .15 T-infit .03 .04 .09 .14 Eci2z .06 .06 .11 .16 Eci4z .06 .07 .11 .16 (-1, 1) Lz .04 .08 .13 .26 T-infit .05 .08 .13 .24 T-outfit .05 .07 .12 .25 Eci2z .08 .13 .18 .32 Eci4z .08 .12 .18 .31 Uniform (0, 1) Lz .06 .09 .11 .30 Distribution T-outfit .05 .09 .13 .29 T-infit .06 .09 .11 .29 Eci2z .08 .12 .13 .29 Eci4z .08 .11 .13 .27 (-1, 1) Lz .05 .12 .22 .40 T-outfit .05 .09 .20 .37 T-infit .05 .13 .22 .39 Eci2z .07 .16 .27 .45 Eci4z .08 .16 .27 .44 Difficulty Interval [-2, +2] Sample size Distribution Statistic 100 250 500 1000 Normal (0, 1) Lz .03 .04 .10 .30 Distribution T-outfit .04 .05 .18 .38 T-infit .03 .04 .09 .23 Eci2z .06 .10 .20 .36 Eci4z .06 .09 .17 .34 (-1, 1) Lz .03 .06 .15 .35 T-infit .04 .07 .21 .42 T-outfit .03 .06 .14 .32 Eci2z .10 .15 .27 .44 Eci4z .09 .15 .25 .44 Uniform (0, 1) Lz .04 .10 .18 .53 Distribution T-outfit .06 .12 .24 .57 T-infit .04 .09 .15 .43 Eci2z .08 .16 .26 .46 Eci4z .07 .16 .25 .44 (-1, 1) Lz .04 .11 .24 .55 T-outfit .05 .16 .34 .56 T-infit .04 .11 .27 .52 Eci2z .11 .24 .43 .63 Eci4z .11 .24 .40 .61 Table 4 Power of the Five Fit Statistics in the 30-Item Test with Normal and Uniform Distributions, Two Difficulty Intervals, and Four Sample Sizes Difficulty Interval [-1, +1] Sample size Distribution Statistic 100 250 500 1000 Normal (0, 1) Lz .04 .05 .10 .15 Distribution T-outfit .05 .05 .09 .16 T-infit .04 .05 .09 .14 Eci2z .07 .10 .16 .23 Eci4z .07 .09 .15 .23 (-1, 1) Lz .04 .07 .13 .27 T-outfit .04 .07 .13 .24 T-infit .05 .07 .12 .25 Eci2z .08 .13 .21 .38 Eci4z .08 .13 .21 .37 Uniform (0, 1) Lz .05 .07 .13 .25 Distribution T-outfit .05 .07 .12 .25 T-infit .05 .07 .12 .25 Eci2z .07 .10 .18 .32 Eci4z .08 .10 .17 .32 (-1, 1) Lz .06 .12 .26 .44 T-outfit .06 .12 .25 .41 T-infit .06 .12 .26 .43 Eci2z .10 .17 .34 .50 Eci4z .10 .17 .33 .49 Difficulty Interval [-2, +2] Sample size Distribution Statistic 100 250 500 1000 Normal (0, 1) Lz .03 .06 .11 .23 Distribution T-outfit .04 .08 .15 .35 T-infit .02 .05 .09 .19 Eci2z .08 .15 .26 .48 Eci4z .08 .15 .25 .46 (-1, 1) Lz .03 .05 .10 .33 T-outfit .04 .08 .19 .45 T-infit .02 .04 .08 .23 Eci2z .11 .22 .37 .60 Eci4z .10 .21 .35 .58 Uniform (0, 1) Lz .05 .11 .24 .42 Distribution T-outfit .05 .13 .28 .56 T-infit .05 .10 .22 .37 Eci2z .11 .21 .37 .60 Eci4z .10 .21 .36 .58 (-1, 1) Lz .04 .12 .29 .61 T-outfit .05 .17 .37 .65 T-infit .03 .09 .24 .56 Eci2z .14 .32 .52 .71 Eci4z .13 .30 .49 .70

Printer friendly Cite/link Email Feedback | |

Author: | Pina, Jose Antonio Lopez; Montesinos, M. Dolores Hidalgo |
---|---|

Publication: | Spanish Journal of Psychology |

Article Type: | Report |

Date: | May 1, 2005 |

Words: | 7636 |

Previous Article: | The neurophysiological validation of the hyperpolarization theory of internal inhibition. |

Next Article: | Is there any relationship between sexual attraction and gender typology? |