Reliability generalization: an HLM approach.
Reliability is an important index in educational and psychological measurement. According to a joint committee of the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education (1985), "Reliability refers to the degree to which test scores are free from errors of measurement" (p. 19). Since the decrease of measurement errors is often associated with an increase of the measurement consistency in various circumstances, "reliability generalization may provide an important tool for characterizing score equality" (Vacha-Haase, 1998, p. 16). In this study, the purpose is to discuss conditional and unconditional hierarchical models accounting for measurement errors at different levels that are essential to the generalization of reliability. Since the test score reliability depends on many conditions of the test and subjects, empirical factors need to be introduced at the multiple levels to describe these conditions and facilitate generalization of the reliability computing in different settings.
The classical test theory represents one of the cornerstones in educational and psychological measurement (Lord & Novick, 1968; Pedhazur & Schmelkin, 1991). Pedhazur and Schmelkin (1991) recollected: "Since it was proposed by Spearman (1904), the tree-score model, or what has come to be known as classical test theory, has been the dominant theory guiding estimation of reliability" (p. 83). Specifically, Novick, Jackson, and Thayer (1971) elaborated,
In the classical test theory model, the observed score X on a person is taken to have expectation x, the true score for that person. The error score is defined by e = x - [tau]. The corresponding random variables defined over persons are related by the equation (1.1) X = T + E with [epsilon] (E|[tau]) = 0 (p. 261)
Regarding the reliability computing, Novick, Jackson, and Thayer (1971) added,
The reliability (intraclass correlation) of a test is defined as (1.3) [[rho].sup.2.sub.XT] = [[sigma].sup.2.sub.T]/[[sigma].sup.2.sub.x] = [[sigma].sup.2.sub.T]/([[sigma].sup.2.sub.T] + [[sigma].sup.2.sub.E]) = [[rho].sub.XX.sup.1] where X and X' are parallel measurements. (p. 261)
In a test containing multiple items, student responses to each test item can be treated as an indicator of the true score. Thus, the responses to a set of items comprise multiple indicators of the individual performance. The hierarchical data structure is illustrated by the fact that the item responses are nested within each student. In addition, factors at the student level can be employed to reflect different test conditions, such as the differences in student demographics, past experiences, as well as the instructional coverage of the test contents. Hence, considerations of the multilevel factors are essential to a proper generalization of the reliability assessment in various learning and/or testing environments.
Vacha-Hasse (1998) searched the PsycINFO database for articles published from 1984 to July 1997, and conducted a meta-analysis on issues of reliability generalization. She noted,
Of the articles reviewed for the present study, 65.76% made absolutely no reference to reliability. At the other extreme, authors of only 13.06% of the articles reported reliability coefficients for the data analyzed in the respective studies. (p. 12)
While the research synthesis revealed the lack of reliability reporting in the existing literature, the meta-analysis approach was primarily based on the past records,and cannot automatically result in a new method of reliability generalization. In essence, to generalize an empirical index in various applications, the statistical computing must be flexible to represent specific conditions of a test setting. Therefore, the reliability generalization hinges on advancement of statistical methods to cover important factors of the test conditions (Pedhazur & Schmelkin, 1991).
In the last four decades, many researchers attempted to analyze multiple sources of variation in educational assessment. Cronbach and his colleagues were among the first group of researchers to highlight the needs of identifying different sources of score variation and deciding which specific sources contributed to errors of the measurement (Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Cronbach, Rajaratnam, & Gleser, 1963). More recently, Linacre (1989) suggested incorporation of different sources of variation in models of individual examinee-item outcomes. However, limited by the computing capacity before the 1990s, few of the researchers considered hierarchical structures of the multilevel factors (Bryk & Raudenbush, 1992). Goldstein (1995) adduced examples to show that ignoring the hierarchical structure can cause substantial mistakes in statistical findings. In this article, the hierarchical structure is considered in a method entitled hierarchical linear modeling (HLM).
Reliability Estimation in HLM
Researchers noted that reliability was not an isolated feature of a test instrument. Factors at the individual level can substantially alter the interpretation of a reliability index. Thompson (1994) pointed out, "The same measure, when administered to more heterogeneous or more homogenous sets of subjects, will yield scores with differing reliability" (p. 839). Because multiple item scores are nested under each student, a hierarchical linear model (HLM) can be employed to partition the score variance at levels of students and item responses. In addition, factors of learning and instruction can be introduced in the HLM model to explain the score variation under different conditions (Raudenbush, 1988). Thus, given the existence of pertinent condition factors, the generalization of reliability can be made in similar or dissimilar circumstances. For simplicity, an unconditional model provides an initial assessment of the variance partition before including the conditional factors in the HLM computing (Singer, 1999).
The Unconditional HLM Model
The unconditional model describes the item responses ([Y.sub.ij]) in terms of the true score of the subject (b0j) and a random error ([r.sub.ij]):
(1) [Y.sub.ij] = [[beta].sub.0j] + [r.sub.ij]
where [r.sub.ij], the random error in the jth subject' s response to the ith item, is assumed. to be normally distributed with a mean of zero and a constant variance [[sigma].sup.2]. Thus, the tree score ([[beta].sub.0j]) is an expected average performance of the jth subject (Lord & Novick, 1968).
In addition, the true score ([[beta].sub.0j]) may vary among different subjects, i.e.,
(2) [[beta].sub.0j] = [[gamma].sub.00] + [u.sub.0j]
where [Y.sub.00] is the grand mean score of the population, and [u.sub.0j] is the random effect associated with subject j (j = 1, 2,..., m) and is assumed to have a mean of zero and variance [[tau].sub.00]. Combining (1) and (2), one may get:
(3) [Y.sub.ij] = [[gamma].sub.00] + [u.sub.0j] + [r.sub.ij]
The score variance can be partitioned as:
(4) Var ([Y.sub.ij])=Var([u.sub.0j] + [r.sub.ij])=[[tau].sub.00] + [[sigma].sub.2]
Hence, the variance of individual item scores ([Y.sub.ij]) not only depends on variability of the item response ([[sigma].sup.2]), but also reflects the degree of heterogeneity in the subject grouping ([[tau].sub.00]). This model is categorized as a "fully unconditional model" because no other factors have been introduced in equation (2) to describe the specific patterns of individual variation (Bryk & Raudenbush, 1992).
Despite simplicity of the unconditional model, the variance partition demonstrated characteristics of the reliability configuration. Specifically, equation (4) concurred with an observation of Dowis (1987), "Because reliability is a function of sample as well as of instrument, it should be evaluated on a sample from the intended target population--an obvious but sometimes overlooked point" (p. 486).
Novick, Jackson, and Thayer (1971) pointed out that the reliability of item responses can be represented by an intraclass correlation coefficient. Bryk and Raudenbush (1992) added that "This coefficient is given by the formula p = [[tau].sub.00]/([[tau].sub.00] + [[sigma].sup.2])" (p. 18). In reality, however, the reliability for one item measure is usually less useful than the reliability for an average score over multiple test items. On basis of equation (1), the mean response scores over n test items can be modeled as:
(5) [bar][Y.sub.*j] = [[beta].sub.0j] + [r.sub.*j]
where [bar][r.sub.*j] = [SIGMA][r.sub.ij]/n, which has a variance
(6) Var ([bar][r.sub.*j]) = [[sigma].sup.2]/n = [V.sub.j]
In general, the number of test items (n) is no less than one, and thus, [V.sub.j] = [[sigma].sup.2]/n < [[sigma].sup.2]. The smaller variance ([V.sub.j]) suggests that the mean response score is a better estimate of student true score. The reliability of [Y.sub.*j] is:
(7) [[lambda].sub.j] = Var([[beta].sub.0j]/Var([y.sub.*j]=[[tau].sub.00]/([[tau.sub.00] + [V.sub.j]
This result is used in computation of the intraclass correlation coefficient, or the mean score reliability, in the HLM software (Bryk, Raudenbush, & Congdon, 1996, p. 81). In fact, the partition of variance ([[tau].sub.00] and [V.sub.j]) at different levels represents a special feature of hierarchical linear modeling, and the model configuration has added the flexibility of including multilevel factors to facilitate the generalization of reliability under various test conditions.
The Conditional HLM Model
Reliability of test scores may also depend on the circumstance from which the test takes place. For instance, while SAT test scores provide an important measure to justify award of academic scholarships in college, the score fluctuation may result from the test preparation courses taken by different students (College Board, 1999). Thus, the difference in student coaching should be considered in generalizations of the score reliability. A subject level factor ([W.sub.j]) can be added to the conditional HLM model, and the expected average performance of the jth individual over a set of n test items can be modeled as
(8) [[beta].sub.0j] = [[gamma].sub.00] + [[gamma].sub.01][W.sub.j] + [u.sub.0j]
Compared to equation (2), the factor [W.sub.j] is included in equation (8) to describe the individual differences in the expected test performance ([[beta].sub.0j]).
In contrast, [u.sub.0j] of equation (2) represents the random deviation of student j's mean performance from the grand mean of the student group. In equation (8), [u.sub.0j] represents the conditional deviation after the control of [W.sub.j] effect. Whereas more factors like [W.sub.j] can be included in equation (8), the conditional model discussed here is limited to a single [W.sub.j] for simplicity. Substituting (8) into (5), one may have a combined model to describe the mean score of the jth student:
(9) [bar][Y.sub.j] = [[gamma].sub.00] + [[gamma].sub.01][W.sub.j] + [u.sub.0j] + [bar][r.sub.*j]
where [u.sub.0j] represents a random effect connected to subject j with a mean of zero and variance [[tau].00], and [bar][r*.sub.j] = [SIGMA][r.sub.ij]/n has a mean of zero and variance Var ([bar][r*.sub.j]) = [[sigma].sup.2]/n = [V.sub.j].
The reliability of [bar][Y*.sub.j] as an estimate of [[beta].sub.0j] is:
(10) [[lambda].sub.j] = Var([[beta].sub.0j])/Var([bar][y.sub.*j]) = [[tau].sub.00]/([[tau].sub.00] + [V.sub.j]
Or in terms of the notation of Bryk and Raudenbush (1992, p. 40), [[lambda].sub.j] = [Vj.sup.-1]/([Vj.sup.-1] + [[tau].sub.00.sup.-1]). Since [[tau].sub.00] is the variance of [u.sub.0j] after removing the effect of [W.sub.j], the reliability generalization has incorporated on a proper consideration of the specific test conditions, such as the individual test preparation described by [W.sub.j].
It should be noted that while equation (5) is unchanged in the conditional HLM model, and can be employed to estimate the true score [[beta].sub.0j] using [bar][Y.sub.*j], equation (8) has added an alternative estimate of [[beta].sub.0j] using the level 2 information (i.e., [[beta].sub.0j] = [[gamma].sub.00] + [[gamma].sub.01][W.sub.j]). Thus, the true score can be estimated not only by the average performance of the jth student ([bar][Y.sub.*j]), but also by the performance of similar student ([[gamma].sub.00] + [[gamma].sub.01][W.sub.j]) under influence of the condition factor [W.sub.j]. Novick, Jackson, and Thayer (1971) combined the two sources of information to produce an empirical Bayesian estimator ([[beta].sub.0j]*) with smaller mean square error of prediction for [[beta].sub.0j]:
(11) [[beta].sub.0j]* = [[[tau].sub.00]/ ([[tau].sub.00] + [V.sub.j])[Y.sub.*j] + [[V.sub.j]/([[tau].sub.00] + [V.sub.j])]([[gamma].sub.00] + [[gamma].sub.01][W.sub.j])
Compared to the unconditional model (2), effects of the confounding conditions ([W.sub.j]) have been considered in equation (8), and thus, the variance of [u.sub.0j] (i.e., [[tau].sub.00]) has been reduced after the control of factor [W.sub.j]. Accordingly, [[V.sub.j]/([[tau].sub.00] + [V.sub.j])] is increased, which adds more weight to the level 2 information ([[gamma].sub.00] + [[gamma].sub.01][W.sub.j]) in equation (11). By substituting (10) into (11), one may get
(12) [[beta].sub.0j]* = [[gamma].sub.j][Y.sub.*j] + (1 - [[lambda].sub.j])([[gamma].sub.00] + [[gamma].sub.01][W.sub.j])
Hence, when reliability ([[lambda].sub.j]) of the measurement for the jth individual ([bar][Y*.sub.j]) is small, the Bayesian estimate ([[beta].sub.0j]*) will rely more on the relevant information from other individuals under a similar circumstance ([W.sub.j]). Thus, the reliability computation has played an important role in improving the true score estimation in educational and psychological measurement (Pedhazur & Schmelkin, 1991, p. 110).
In a typical test setting, item scores are combined to measure the individual performance. Generalization for the measurement reliability depends on the degree of homogeneity of a student population. In an unconditional model, the jth person's response to the jth test item is described by [Y.sub.ij] = [[beta].sub.0j] + [r.sub.ij] and [[beta].sub.0j] = [[gamma].sub.00] + [u.sub.0j]. As was noted by Raudenbush (1988), "True score [[[beta].sub.0j]] are then assumed to vary randomly over the population of persons around a grand mean [[[gamma].sub.00]]. This is the simplest application of HLM [Hierarchical Linear Model]" (p. 103). The reliability index ([[lambda].sub.j]) in equation (7) depends on the performance of the jth individual ([V.sub.j]) and the random variation of the student population ([[tau].sub.00]). A small value of [[tau].sub.00] implies less heterogeneity in the population, and thus, the individual true score ([[beta].sub.0j]) is centered around the population grand mean ([[gamma].sub.00]). In this case, no factors (e.g., [W.sub.j]) need to be introduced at the student level, and the unconditional model provides a good description of the true score ([[beta].sub.0j]) variation at different levels of the data hierarchy. In other words, the generalization of the score reliability may not require considerations of any conditional factors (e.g., [W.sub.j]) to describe the little variability in the homogeneous population.
On the other hand, for a heterogeneous population, the true score may depend on individual factors ([W.sub.j]s) besides the random variation [u.sub.0j] (see equation 8). The improvement from the conditional model depends on a proper selection of [W.sub.j] to account for the variation in [u.sub.0j] Bryk and Raudenbush (1992) pointed out, "if a substantial proportion of the variation in [[beta].sub.0j] is explained by [W.sub.j], the residual variance around the regression line, [[tau].sub.00], will be small" (p. 41-42). Hence, generalization of the reliability can be made under the conditional HLM model after controlling the heterogeneity effect of [W.sub.j].
While a general description of score reliability hinges on a thorough consideration of the real test conditions (Crocker & Algina, 1986), the unconditional and conditional models are relevant statistical tools to incorporate those considerations in the reliability computing. Because the score variations can be distributed at the item and subject levels, the Hierarchical Linear Model (HLM) presents one of the statistical methods to facilitate the reliability generalization. According to the classical test theory and the HLM literature, the reliability index can be computed in terms of the intraclass correlation (Bryk & Raudenbush, 1992; Novick, Jackson, & Thayer, 1971), which may differ between unconditional or conditional HLM models, depending on the level of heterogeneity in the student population. Although the advancement of statistical methodology may facilitate generalization of the reliability computing, researchers and practitioners are urged to use their expertise in the selection of proper condition factors ([W.sub.j]), and determine whether a substantial portion of the variance can be pulled out from a heterogeneous true score distribution. Hence, a practical approach to the reliability generalization may demand collaborative efforts of the measurement specialists and the education practitioners.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1985). The standards for educational and psychological testing. Washington, DC: American Psychological Association.
Bryk, A. S. & Raudenbush, S. W. (1992). Hierarchical linear model. London, UK: Sage.
Bryk, A. S., Raudenbush, S. W., & Congdon, R. T. (1996). Hierarchical linear and nonlinear modeling with the HLM/2L and HLM/ 3L programs. Chicago, IL: Scientific Software.
College Board (1999). SAT program: Will taking a test preparation course (August, 28,1999).
Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York, NY: Holt, Rinehart, & Winston.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York, NY: Wiley.
Cronbach, L. J., Rajaratnam, N., & Gleser, G. C. (1963). Theory of generalizability: A liberalization of reliability theory. British Journal of Statistical Psychology, 16, 137-163.
Dowis, R. V. (1987). Scale construction. Journal of Counseling Psychology, 34, 481-489.
Goldstein, H. (1995). Multilevel statistical models (2nd ed.). London: Edward Arnold.
Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago, IL: MESA Press.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Menlo Park, CA: Addison-Wesley.
Novick, M. R., Jackson, P. H., & Thayer, D. T. (1971). Bayesian inference and the classical test theory model: Reliability and true scores. Psychometrika, 36(3), 261-288.
Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, design, and analysis: An integrated approach. Hillsdale, NJ: Lawrence Erlbaum.
Raudenbush, S. W. (1988). Educational applications of hierarchical linear models: A review. Journal of Educational Statistics, 13(2), 85-116.
Singer, J. (1999). Using SAS PROC MIXED to fit multilevel models, hierarchical models, and individual growth models. The Journal of Educational and Behavioral Statistics, 24, 323-355.
Thompson, B. (1994). Guidelines for authors. Educational and Psychological Measurement, 54, 837-847.
Vacha-Hasse, T. (1998). Reliability generalization: Exploring variance in measurement error affecting score reliability across studies. Educational and Psychological Measurement, 58(1), 6-20.
Jianjun Wang, Professor of Educational Statistics and Research Design, California State University.
Correspondence concerning this article should be addressed Dr. Jianjun Wang, Department of Advanced Educational Studies, School of Education, California State University, 9001 Stockdale Highway, Bakersfield, CA 93311-1099. E-mail: email@example.com.
|Printer friendly Cite/link Email Feedback|
|Publication:||Journal of Instructional Psychology|
|Date:||Sep 1, 2002|
|Previous Article:||A knowledge base for cultural diversity in administrator training.|
|Next Article:||A reply to Loerke, Jones, and Chow (1999) on the "Psychometric benefits" of linked items.|