Printer Friendly

Effects of examiner familiarity on black, Caucasian, and Hispanic children: a meta-analysis.

This article presents a quantitative synthesis of examiner familiarity effects on Caucasian and minority students' test performance. Fourteen controlled studies were coded in terms of methodological quality (high vs. low) and race-ethnicity (Caucasian vs. Black and Hispanic). An analogue to analysis of variance conducted on weighted unbiased effect sizes indicated that examiner familiarity produced a significant effect, with Caucasian and minority examinees' test performance raised by.05 and. 72 standard deviations, respectively. Examiner familiarity's differential effect on Caucasian and minority examinees did not interact with the methodological quality of the studies. Nevertheless, limitations of the extant data base require caution in drawing implications for assessment practice.

Two decades ago, Dunn (1968) observed that minority groups were over-identified as handicapped. He believed this overrepresentation was caused by discriminatory intelligence and achievement tests. Dunn and others (e.g., Cole & Bruner, 1972) have contended that these tests are biased primarily because they are ethnocentric: Test content is drawn exclusively from White middle class experience. Evidence for minority disproportionality in special education and claims of biased testing were influential in heralded court cases in the 1970s (e.g., Larry P. v. Wilson Riles, 1971), which curtailed intelligence testing in many school districts (see Bersoff, 1981).

Nevertheless, many measurement specialists, school psychologists, and others are increasingly skeptical that many well-known and widely used intelligence and achievement tests are biased against minorities. Reschly (I 98 1), for example, has pointed out that subjective judgment, rather than data, often has been the basis for charges that these tests are ethnocentric. Recent empirical investigations of such tests' content, construct, and criterion validity have failed to show bias e.g., Oakland, 1983).

In this intense, sustained, and well-publicized debate, both sides tend to focus narrowly on the test instrument and virtually ignore the context in which assessment occurs. Research has infrequently addressed contextual factors such as (a) examinees' interpretation of the purpose of testing and comprehension of test instructions and (b) examiners' personality, pretest information on examinees, attitudes about the legitimacy of testing, and choice of test location. This paucity of research on context is not surprising, given that we tend to conceptualize the test situation as decontextualized; that is, a setting in which extra-test factors can be controlled and their effects on performance neutralized (see Sigel, 1974). Surprising or not, our lack of interest in test context prevents us from knowing whether typical situational factors in testing affect minority and nonminority children differently.

One exception to the foregoing is the specific question of whether Black children achieve higher scores when tested by Black, rather than by White, examiners, an issue receiving moderate attention by researchers (see Sattler & Gwynne, 1982). Another contextual variable explored with relative frequency, although hot with respect to minority assessment, is examiner familiarity.

D. Fuchs and associates have demonstrated that language handicapped children obtain higher scores when tested by familiar, rather than by unfamiliar, examiners and that this performance pattern appears robust (see Fuchs, D., Featherstone, Garwick, & Fuchs, 1984- Fuchs, D., Fuchs, Dailey, & Power, 1985; Fuchs, D., Fuchs, Garwick, & Featherstone, 1983; Fuchs, L. S., & Fuchs, 1984). Moreover, it appears that unfamiliar examiners depress the performance of language handicapped, but not nonhandicapped, children (Fuchs, D., Fuchs, Power, & Dailey, 1985), indicating that examiner unfamiliarity is a source of systematic error or bias in the assessment of language handicapped children. The importance of this finding is underscored by the fact that most examiners are strangers to their examinees (Fuchs, D., 1981).

Since examiner unfamiliarity is part of the test procedure, rather than the test instrument per se, we choose to refer to this systematic error as "test procedure bias." This is similar to "situational bias," which Jensen (1981) defined as "conditions in the test situation, such as the race, language, or manner of the tester, that could differentially affect the test performance of persons of different races or cultural backgrounds" (p. 137). Jensen properly distinguishes this type of bias from (a) external indicators of bias, whereby test scores are related to other variables external to the test or test situation (such as in a test's predictive validity); and (b) internal indicators of bias, or psychometric properties of the test (such as a test's reliability and rank order of item difficulty).

Given that unfamiliar examiners appear to negatively bias the test procedure with certain handicapped children, one may ask whether examiner unfamiliarity constiutes a similar bias against minority pupils. If so, then the ubiquitous procedure of employing unfamiliar examiners contributes to a spuriously low performance of minority children and increases the likelihood that they will be identified inaccurately as handicapped. Such a possibility should be of concern to examiners, those who set professional standards for testing, and parents and teachers of minority students. Thus, a quantitative synthesis was conducted of the examiner unfamiliarity literature to determine the importance of this contextual factor to minority (i.e., Black and Hispanic) and Caucasian students. METHOD Search Procedure The search for pertinent studies included: I .A computer search of three on-line data bases,

ERIC (from 1966), Psych Info (from 1967), and

Dissertation Abstracts (from 1927). 2. A manual search of American Journal of Mental

Deficiency; Child Development; Developmental

Psychology; Exceptional Children; Journal of

Abnormal and Social Psychology; Journal of

Consulting and Clinical Psychology; Journal of

Experimental Child Psychology; Journal of Ge - netic Psychology; Journal of Speech and Hearing

Disorders, Language, Speech, and Hearing in the

Schools; Merrill Palmer Quarterly; and Psvychol - ogy, in the Schools (1965-1982, inclusive). 3. Identification of references within selected psychological

and educational assessment textbooks

as well as in all identified investigations.

A study was considered for inclusion if it compared examiner familiarity to unfamiliarity in terms of effects on examinees' performance during individualized testing. Familiarity was defined broadly, including either children's long-term or experimentally induced acquaintance with an examiner. Long-term acquaintance denotes a relatively intimate relationship enduring over weeks or months (e.g., a teacher-pupil alliance), if not years (e.g., a mother-child relationship). Experimentally induced acquaintance typically refers to an examiner's comparatively brief interaction with an examinee prior to testing. Examiner unfamiliarity signifies a condition in which examiner and examinee are virtual strangers, but one in which the examiner has exercised typical procedures for establishing rapport.

The search yielded 22 studies, of which 14 provided unambiguous data on Caucasian and/or minority examinees' performance in familiar and unfamiliar examiner conditions. (See Fuchs, D., & Fuchs, 1986a, for references.) Of these studies, 6 involved only Caucasian children, 6 included only minority (Black and/or Hispanic) children, and 2 employed both Caucasian and minority subjects. Thus, an equal number of studies (N = 8) provided data on minority and Caucasian pupils' performance in the two examiner conditions.

Of the investigations, I I were published and 3 were unpublished. A total of 989 subjects participated; 426 were Black or Hispanic and 563 were Caucasian. The sex of 442 subjects (45%) was not reported. Among the remaining 547 participants, 235 (43%) were female and 312 (57%) were male. Across 12 of 14 studies providing pertinent information, 162 examiners were used. Tests administered in the investigations were identified as intelligence (7 studies), speech/language (5 studies), or educational achievement (2 studies) measures. (See Table I of Fuchs, D., & Fuchs, 1986a, which describes each study's test participants, major substantive variables, methodological quality, and unbiased effect sizes.) Data Obtained from Studies Results were transformed to a common metric, effect size. Effect sizes were derived by determining the mean difference between examinees' scores in familiar and unfamiliar examiner conditions and dividing this difference by the standard deviation of examinees' scores in the unfamiliar condition (Glass, McGaw, & Smith, 1981). Some of the studies reported more than one effect. In all but two instances, a median effect size of examiner familiarity/ unfamiliarity was calculated for each study. Exceptions were the two investigations incorporating separate groups of Caucasian and minority examinees in the same experiment. In each of these studies two effect sizes were reported, one for minority and one for Caucasian examinees. Thus, 16 effect sizes (8 for Caucasian and 8 for minority children) were derived. Each effect size was converted to an unbiased effect size (UES), correcting for the inconsistency in estimating true from observed effect sizes (see Hedges, 1981). In combining UESS, weighted averages were calculated to account for the variance associated with this metric (see Hedges, 1984). Methodological Quality of Studies Effects of examiner familiarity/unfamiliarity were related to a composite procedural variable, indicating the overall methodological quality of each investigation. Derivation of this variable was based on an analysis of nine design-related features: (a) assignment of examinees to examiners; (b) assignment of examinees to treatments; (c) examiner expectancy; (d) fidelity of treatment; (e) multiple treatment effects; (f) number of examiners; (g) order of testing; (h) scoring; and (i) technical adequacy of dependent measure. (See Fuchs, D., & Fuchs, 1986a, for standards associated with each methodological feature.)

One of the authors and a colleague, blind to the purposes of this study, independently scored six (43%) randomly selected investigations. Average agreement across all methodological features was .89, ranging from .67 to 1.00. Interrater agreement was calculated using the following formula (Coulter cited in Thompson, White, & Morgan, 1982): Percentage of agreement = agreements between raters A and B/, (agreements and disagreements between raters A and B + omissions by rater A + omissions by rater B). Disagreements were resolved through discussion between the raters.

Since one study provided insufficient information to determine methodological quality, we evaluated the quality of 13 studies using a 4-step procedure. First, the investigations were coded "acceptable" (1) or unacceptable" (0) on each design dimension. Second, as a means of indicating relative importance, a weight of 1 or 2 was assigned each design feature. A composite score then was generated for each study by multiplying the coded values i or 0) by the assigned weights i or 2), summing these products, and dividing the sum by the number of applicable study characteristics. Finally, we developed a frequency distribution of these composite scores, which facilitated identification of 7 high- and 6 low-quality studies. (These steps to determine study quality are described in greater detail in Fuchs D., & Fuchs, 1986a.) RESULTS A test for the homogeneity of effect size (Hedges, 1982), undertaken to determine whether the population effect size was constant across Caucasian and minority UESS, yielded a significant value, X2 (15, N = 16) = 89.22, p < .01. Therefore, additional analyses were conducted to explain variations in UESs by examinees' Caucasian/minority status. To compare magnitude of UESs of Caucasian and minority examinees, Hedges's (1984) chi-square analogue to analysis of variance was employed.

The mean quality rating for the 8 studies involving Caucasian examinees was .99 (SD = .40); the average quality rating for 7 studies associated with minority examinees was .91 (SD = .40). This difference was not statistically significant, t 13i .39, ns.

For Caucasian examinees, the average weighted UES was .05 (v = .073), z = .72, ns. The average weighted UES for Black and Hispanic examinees was .72 (v = .096), z = 7.47, p < 00 1. A chi-square analogue to analysis of variance indicated that this difference was statistically significant, X2 (1, N = 16) - 30.35, p < 001. The minority group's UES indicates that, given a normative test (such as an intelligence measure) with a population mean of 100 and a standard deviation of 15, the use of a familiar examiner would raise the typical minority student's score from 100 to 111. In contrast, the Caucasian group's UES suggests virtually no change in score as a function of examiners' familiarity/unfamiliarity. In terms of Cohen's (1977) well-known U, (or percentage of nonoverlap) statistic, the upper 50% of the minority students' distribution of scores in the familiar examiner condition exceeded 76% of the distribution of scores in the unfamiliar examiner condition. DISCUSSION Whereas Caucasian students performed similarly in familiar and unfamiliar examiner conditions, Black and Hispanic children scored significantly and dramatically higher with familiar examiners. This Caucasian versus minority dissimilarity represents a difference between differences (see Kaufman, DudleyMarling, & Serlin, 1986); that is, minority examinees' differential performance across the two examiner conditions was greater than Caucasian examinees' differential performance. Thus, the disparity between the two groups does not represent a simple mean difference. Rather, it is conceptually similar to an interaction, whereby the independent variable, examiner familiarity, has a different effect on various aspects or expressions of another independent variable, race-ethnicity.

This difference between the two groups is described succinctly by the aforementioned average UESs of minority and Caucasian subjects. It also emerges as patterns in the data. Five of 8 investigations involving minority subjects were associated with UESs ranging from .58 to 1.44. Cohen's (1977) well-known rule of thumb indicates these UESs range in magnitude from "moderately strong" to "strong. " Contrastingly, 5 of 8 studies with Caucasian examinees were associated with UESs ranging from . 01 to .23. Cohen's guidelines suggest such UESs are "weak" to nonexistent. (See Table I in Fuchs, D., & Fuchs, 1986a.) Thus, the reported average UESs of minority (.72) and Caucasian (.05) subjects were not produced by the results of one or two discrepant investigations; rather, these divergent UESs reflect equally dissimilar patterns of UESs among studies involving minority and Caucasian examinees. Internal Validity Does examiner familiarity, then, selectively bias the performance of Black and Hispanic examinees, and represent test procedure bias in the assessment of minority children? In short, are findings from our synthesis true? Truth, here, has two important meanings. The first, often described as internal validity, refers to whether race-ethnicity was principally and causally related to minority examinees' poorer performance with unfamiliar examiners. At this point, we are unsure. Our uncertainty is dictated by two aspects of the extant data base.

First, families of minority subjects were described consistently in terms of low SES, whereas Caucasian subjects came from low and middle SES backgrounds. Since all Black and Hispanic examinees were of low SES, it is difficult to determine which of the two characteristics, race-ethnicity or SES, may be more important in explaining examiner familiarity effects. Second, in 12 investigations, Caucasian or minority children were subjects, leaving only two studies that compared differential performance of Caucasian to minority examinees within the same experimental design. In one of these two salient studies, Caucasian children demonstrated greater performance with familiar examiners. This finding, or course, contradicts the overall pattern of results and, we believe, introduces another important note of caution in any interpretation of the meta-analysis. External Validity A second sort of truth, frequently discussed as external validity, addresses the issue of generalizability. Applied to our meta-analysis, this concern boils down to whether Caucasian and minority subjects' performances are representative of Caucasian and minority children. This question, too, is currently unanswerable and, again, it is due to the nature of the data. An obvious limitation of the data base in this regard is that it comprises a small group of 14 studies. Moreover, most examinees were preschoolers and elementary school children; many examiners were not professionally trained; and only three investigations involved Hispanic subjects (see Table I in Fuchs, D., & Fuchs, 1986a). Thus, the data provide minimal evidence on the importance of examiner familiarity to older and Hispanic children and to trained experienced examiners.

Despite these serious constraints and necessary caveats, the pattern of findings is clear and intriguing. At the very least, the data compel us to ask, "What if examiner unfamiliarity biases the assessment process against minority examinees?" If such were the case, there would be very important implications for practice. For example, test developers' use of unfamiliar examiners to generate normative data and indices of validity (see Fuchs, D., Fuchs, Benowitz, & Barringer, 1987) would be problematic for minority pupils. Comparing minority students' suboptimal performance with unfamiliar examiners to the more maximal performance of largely Caucasian normative populations could result in spuriously low and improperly restrictive educational placements of minority children. In such an event, examiner unfamiliarity would be a partial explanation for the frequently noted overrepresentation of minorities in special education classrooms. It also would represent a condition under which disproportionality of placement constitutes inequity of treatment, as defined by the National Research Council's Panel on Selection and Placement of Students in Programs for the Mentally Retarded (see Messick, 1984).

If use of unfamiliar examiners selectively biases assessment against minority examinees, an apparent remedy would be to require examiners to become familiar with such children prior to testing. Less clear is how much pretest contact is necessary. Following a review of pertinent literature, we have estimated a minimum of I hour of examiner-examinee interaction is required to obtain reliable familiarity effects (see Fuchs, D., & Fuchs, 1986b). We believe this estimate may contrast sharply with conventional practice. In a study of user manuals of 20 widely used preschool IQ and speech/language tests, we found that 13 manuals encourage examiner friendliness, typically defined as demonstrating warmth, maximizing comfort, and reducing anxiety or suspicion. However, only two manuals prescribe pretest contact, described as establishing rapport gradually by meeting with the examinee on one or more occasions prior to testing (Fuchs, D., 1987). If, as we suspect, examiners take their cues from user manuals and if the manuals in our study are representative, one could expect few examiners to engage regularly in prior contact with examinees.

These implications are presented to underscore the importance of determining whether examiner unfamiliarity as well as other typical factors in the test situation negatively bias assessment against minority examinees. If future research corroborates results of our meta-analysis, an important related task will be to explore why examiner unfamiliarity affects minority and Caucasian children differently.

It has been suggested that many minority examinees (a) are relatively unmotivated to perform well (e.g., Katz, 1968); (b) are more likely to experience test anxiety (e. g., Hawkes & Furst, 197 1) because of fear of failure, low self-concept, and unfamiliarity with test procedures (Samuda, 1975); and (c) are hostile toward White examiners and, as a result, concerned about controlling this feeling rather than concentrating on task demands (e.g., Shade, 1982). This purported hostility may be connected to reports that minority examinees tend to misconstrue (a) information and rhetorical questions as demands for accountability (e.g., Goody, 1978) and (b) direct question-answer sequences as punishment (e.g., Philips, 1983) or feedback that an incorrect choice has been made e.g,, Goodnow, 1984).

Studies supporting such suggestions, however, have been sporadic; their validity and importance in explaining minority examinees' possibly poorer performance with unfamiliar examiners remain open to question. Moreover, the bulk of this research assumes that examinee characteristics, like presumed test anxiety, are responsible for observed performance. Investigators infrequently have explored possible effects of examiner characteristics, or quality of interaction between test participants, as explanations for minority children's performance.

Assuming the test situation to be bidirectional, we (Fuchs, L. S., & Fuchs, 1984) explored whether pretest contact influences examinees' behavior and/or examiners' inaccuracy of scoring. Language impaired children were tested by familiar and unfamiliar examiners on a comprehensive language measure. Then certified speech clinicians, who did not know the study's purpose, examinees, or examiners, scored all performances from videotaped recordings of the test sessions. Results demonstrated that examinees performed consistently and significantly higher with familiar examiners, regardless of whether the scorer was the actual examiner or independent rater. At the same time, however, familiar examiners evidenced greater inaccuracy (overestimation) in scoring than did unfamiliar examiners, suggesting that, for certain handicapped examinees, familiarity influences both examinee and examiner behavior.

Although perhaps trite, it nevertheless is true that more research is sorely needed to determine whether, and if so how, unfamiliar examiners selectively depress the performance of minority children. Until such determination is made, we believe it is precipitous, if not incorrect, to claim testing is unbiased toward minority children.
COPYRIGHT 1989 Council for Exceptional Children
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 1989 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:analysis of developmental testing and examiner bias
Author:Fuchs, Douglas; Fuchs, Lynn S.
Publication:Exceptional Children
Date:Jan 1, 1989
Previous Article:Survey on prereferral practices: responses from state Departments of Education.
Next Article:Community involvement of persons with severe retardation living in community residences.

Related Articles
Meta-analysis: apples and oranges, or fruitless.
Trends in the assessment of infants and toddlers with disabilities.
Were bank examiners too strict with New England and California banks?
A Comparative Analysis of Teachers', Caucasian Parents' and Hispanic Parents' Views of Problematic School Survival Behaviors. (Among The Periodicals).
Minorities in children's television commercials: new, improved, and stereotyped.

Terms of use | Copyright © 2016 Farlex, Inc. | Feedback | For webmasters