Empirical analysis of the relationship between student examiners' learning with deliberate test practice and examinees' intelligence test performance.To evaluate the implications of deliberate practice when teaching test administration skills, novice, but trained, graduate student examiners administered intelligence tests to a convenience sample of volunteer school-age examinees assigned to a first test session. A second, different convenience sample of volunteer school-age examinees were administered a final test session by the same graduate student examiners who had acquired more experience and deliberate test administration practice. IQs obtained by examinees in the final test session, when tested by the more experienced examiners, were significantly higher than IQs obtained by examinees in the first test session when tested by novice examiners. These findings highlight the importance of deliberate practice when teaching and learning testing skills. In addition, the findings are consistent with cognitive load Cognitive Load is a term (used in Educational psychology and other fields of study) that refers to the load on working memory during problem solving, thinking and reasoning (including perception, memory, language, etc.). theory and have implications for educational data-based decision-making.
Keywords: Deliberate Practice, Cognitive Load Theory, Intelligence Testing, Scholarship of Teaching and Learning The SoTL movement
The Scholarship of Teaching and Learning (SoTL; pronounced so'.tl or S O T and L) is a growing movement in post-secondary education.
Intelligence tests utilize standard procedures to measure multiple characteristics of examinees. These tests are used in virtually all schools to describe cognitive functioning, assist in educational programming, screen for special needs, diagnose disabling dis·a·ble
tr.v. dis·a·bled, dis·a·bling, dis·a·bles
1. To deprive of capability or effectiveness, especially to impair the physical abilities of.
2. Law To render legally disqualified. disorders, and estimate future behaviors (Edwards & Oakland, 2006). In addition, intelligence and other high-stakes standardized tests have important implications for school funding and accountability. Obtaining the most accurate assessment of an individual is always the goal of examiners who administer intelligence and other types of standardized tests (Edwards & Paulin, 2007).
The scores students obtain when administered an intelligence test are likely to have an enduring impact on their schooling and life. Consequently, it is critical that systematic errors in the testing process do not affect examinee scores and inappropriately influence program placement, treatment, services, and outcomes. "First do no harm" is an important principle stated frequently by faculty who train student test examiners (Edwards, 2009). In light of the periodic revision of intelligence tests, coupled with examiners' requirement to learn multiple tests, examiners are sometimes presented with a steep learning curve that may result in systematic errors adversely affecting examinees' scores as well as their educational and life outcomes (Loe, Kadulbek, & Marks, 2007).
As intelligence tests are revised and renormed, examiners must learn to administer, score, and interpret these new versions of the tests. When learning a new iteration One repetition of a sequence of instructions or events. For example, in a program loop, one iteration is once through the instructions in the loop. See iterative development.
(programming) iteration - Repetition of a sequence of instructions. of an intelligence test, the examiner is required to utilize appropriate testing skills (AERA AERA American Educational Research Association
AERA Automotive Engine Rebuilders Association
AERA Air Emissions Risk Analysis
AERA Accelerating Economic Recovery in Asia
AERA American European Racquetball Association ,APA (All Points Addressable) Refers to an array (bitmapped screen, matrix, etc.) in which all bits or cells can be individually manipulated.
APA - Application Portability Architecture , & NCME NCME National Council on Measurement in Education
NCME National Center for Montessori Education , 1999). That is, they are required to administer and score tests using standard procedures described in the test manual. Using the precise standard procedures permit examiners to employ test scores in the manner recommended in the test manual (Edwards, 2009). However, learning to administer, score, and interpret multiple iterations of intelligence tests and learning new assessment techniques can be a daunting daunt
tr.v. daunt·ed, daunt·ing, daunts
To abate the courage of; discourage. See Synonyms at dismay.
[Middle English daunten, from Old French danter, from Latin task due to the myriad important competencies associated with test administration.
Examiners must also learn to administer, score, and interpret new iterations of tests and acquire familiarity with their strengths and limitations (Edwards, 2009). These competencies are necessary in order to ensure that accurate administration, scoring, and interpretation facilitate positive, rather than pejorative pejorative Medtalk Bad…real bad , student outcomes. Complying with these numerous competencies may overload inexperienced examiners' cognitive capacity to fluently and efficiently administer intelligence and score tests.
Cognitive Load Theory
According to according to
1. As stated or indicated by; on the authority of: according to historians.
2. In keeping with: according to instructions.
3. cognitive load theory (CLT CLT
total lung-thorax compliance. ),the human cognitive structure is comprised of general-purpose working memory that is limited to the short-term storage of approximately seven, plus or minus two chunks of information, the capacity to simultaneously process about two or three chunks of information, and practically unlimited long-term memory that holds information stored in schemas or knowledge structures (van Gog, Ericsson, Rikers, & Paas, 2005). Humans' limited working memory and our difficulty in processing more than three chunks of information simultaneously are considered important variables that impact the effectiveness of teaching, learning, and expert performance (Kalyuga & Sweller, 2005). These important variables influence novice examiners as their limited working memory capacity could easily be overloaded as they attempt to process multiple chunks of information simultaneously during test administration (see Baddeley, 1986).
It is likely that novice examiners will be more adversely affected by the limitations of working memory than experienced examiners because they have yet to develop automaticity in applying the necessary test administration competencies. Automaticity refers to the ability to perform tasks with minimal mental awareness or conscious effort and automaticity helps circumvent cir·cum·vent
tr.v. cir·cum·vent·ed, cir·cum·vent·ing, cir·cum·vents
1. To surround (an enemy, for example); enclose or entrap.
2. To go around; bypass: circumvented the city. the limitations of working memory (Kalyuga & Sweller, 2005).
In light of experienced examiners' expertise and familiarity with specific tests, it is likely they have developed greater automaticity with regard to the administration and scoring of these specific tests. They need to depend less on their working memory. Their cognitive capacity is less susceptible to capacity overload during the testing process and it is likely they are better able to appropriately maximize examinee test performance. That is because research suggests expert professionals often develop new schemas or knowledge structures that overcome limitations often encountered by novices (van Gog et al., 2005). For example, the limitations of working memory are mitigated by transfer to long-term memory and experienced professionals have more opportunities to shift information to long-term memory (Ericsson & Kintsch, 1995). In addition, more fluent or efficient performance can be attained by experienced professionals because of their ability to anticipate client or environmental behaviors and respond appropriately (Ericsson, 2002). CLT research suggests novice examiners require substantial instructional support, time, effort, and deliberate practice before they can construct new schemas to function as expert examiners who are able to administer tests without errors (Kalyuga & Sweller, 2005).
Literature Regarding Examiner Errors
Resent research finding found 17 graduate student test examiners committed errors on 98% of the 51 protocols examined (Loe, et al., 2007). More than 1300 errors were committed across all of the protocols. Errors per record form ranged from 0 to 60 with an average of 25.8. Of particular importance and relevance, these graduate student examiners also underestimated the overall IQ on 51% of the protocols and overestimated the overall IQ on only 16% of the protocols.
Practitioners also make errors during their administration and scoring of intelligence tests (Rottmann, 2006). Practitioners, who on average, had eight years of experience and had administered the Wechsler Adult Intelligence Scale--Revised (WAIS-R; Wechsler, 1981) approximately 160 times, still made errors that affected overall IQs on 27 of 50 the test's protocols (Slate, Jones, Murray, & Coulter, 1993). These errors affected the correct IQ by as much as five points. Taken together, these studies suggest student examiners and practitioners make multiple errors during the testing process that can spuriously spu·ri·ous
1. Lacking authenticity or validity in essence or origin; not genuine; false.
2. Of illegitimate birth.
3. Botany Similar in appearance but unlike in structure or function. impact examinees' IQs.
Only one fairly old study was found that directly compared the effects of experienced and inexperienced examiners' administration of standardized tests on examinees' test scores (Fuchs, Fuchs, Dailey, & Power, 1985). However, the study focused on examiner familiarity with the examinee and only investigated experience tangentially tan·gen·tial also tan·gen·tal
1. Of, relating to, or moving along or in the direction of a tangent.
2. Merely touching or slightly connected.
3. . In this study, experienced and inexperienced examiners were trained to administer a standardized speech and language scale. The examiners administered the scale to 22 preschool students who had a mean age of 58.32 months. The researchers did not analyze their data for statistical significance between experienced and inexperienced examiners. Nonetheless, the findings indicated that examinees tested by experienced examiners obtained substantially higher scores (approximately 5.5 standard score points) than examinees tested by inexperienced examiners (Fuchs et al., 1985).
In light of the literature reviewed, if examinees evaluated by more experienced examiners obtain higher scores on intelligence tests, than examinees evaluated by less experienced examiners, then the latter examinees are disadvantaged and potentially harmed. Inaccurate lower scores will influence students' educational programming and placement due to systematic error or factors other than the students' cognitive functioning.
Purpose of the Study
The purpose of this study is to investigate whether examinees will obtain higher intelligence tests scores after being tested by examiners who obtained substantive test administration practice. Scores earned by examinees evaluated by examiners who have acquired additional deliberate practice with intelligence tests are compared to scores obtained by examinees evaluated by examiners with less experience and practice with the instruments. It is hypothesized that a relationship will be found between examiner experience and deliberate practice with a testing instrument and examinees' test scores. Specifically, it is hypothesized that the average overall IQ will be higher in cases where examiners have more experience and deliberate practice administering intelligence tests.
The participants in this study consist of 14 female graduate students working towards an Educational Specialist Degree in School Psychology. The ethnic composition of the graduate student population is as follows: 10 Caucasians, 3 Hispanics, and 1 African American African American Multiculture A person having origins in any of the black racial groups of Africa. See Race. . Examinee participants included 74 volunteer school-age examinees recruited by the graduate school psychology students to participate in practice evaluations required for an intellectual assessment course. Examinee participants were frequently acquaintances of the examiners. Of these participants, 39 were male and 35 female. Examinee participants ranged in age from 5 to 18 years old with a mean age of 9.90 and a standard deviation of 3.82. Data regarding examinees socioeconomic status and level of parental education were unavailable. Examinee participants were required to be enrolled in general education school programs.
The instruments used in this study include the Wechsler Intelligence Scale for Children--Fourth Edition (WISC-IV), the Stanford-Binet Intelligence Scales--Fifth Edition (SB5), and the Differential Abilities Scales (DAS). These instruments were selected due to their research history, rich tradition, frequent use in the psychoeducational evaluations of school-age children, and strong psychometric psy·cho·met·rics
n. (used with a sing. verb)
The branch of psychology that deals with the design, administration, and interpretation of quantitative tests for the measurement of psychological variables such as intelligence, aptitude, and properties.
The WISC-IV is an individually administered instrument for assessing cognitive abilities of children aged 6 years through 16 years 11 months. This version contains 10 to 15 subtests and composite scores that represent intellectual functioning in specific cognitive domains and general intellectual ability--the Full Scale IQ (Wechsler, 2003). The reliability coefficient for the WISC-IV Full Scale IQ is approximately .97 for all age groups (Wechsler, 2003).
The SB5 is an individually administered assessment of intelligence and cognitive abilities. It is administered to persons ranging in age from 2 years through greater than 85 years. The Full Scale IQ is a composite scale and it consists of all ten subtests. The reliability coefficients for the SB5 Full Scale IQ are quite high (.97 to .98) and relatively consistent across age groups (Roid, 2003).
The DAS is an individually administered battery of cognitive and achievement tests for children and adolescents aged 2 1/2 through 17 years. The DAS overall IQ or General Conceptual Ability (GCA GCA, ground-controlled approach: see instrument-landing system. ) score has reliability coefficients ranging from .89 to .95 based on the age range utilized (Elliot, 1990).
The 14 school psychology graduate students obtained a convenience sample of participants to serve as volunteers as the graduate student examiners practiced the administration, scoring, and interpretation of the different intelligence tests. The examinee participants consist of a convenience sample of volunteer school-age students in a first test session examinee group and a second, different convenience sample of volunteer school-age students in a final test session examinee group. Examinee participants were enrolled and assigned to test session conditions sequentially and purposively to obtain similar groups with respect to gender and ethnicity. Parental informed consent and child assent An intentional approval of known facts that are offered by another for acceptance; agreement; consent.
Express assent is manifest confirmation of a position for approval. were obtained for all examinees.
All examiner participants had taken and passed courses in the assessment of academic achievement and social-emotional assessment prior to enrolling in their intellectual assessment course. The graduate students received training in the administration and scoring of intelligence tests and completed and received a passing score on at least one practice administration before they administered the first test session.
Novice, but trained, graduate student examiners administered intelligence tests to the convenience sample of volunteer schoolage examinees assigned to the first test session. The second and different convenience sample of volunteer school-age examinees were administered a final test session by the same graduate students who had become more experienced graduate student examiners. The same novice examiners were provided with deliberate test administration practice to bring them to the level of more experienced examiners by the final test session.
Each examiner administered the same number of assessments in the first test session and the final test session. An important procedural step was implemented to provide the novice examiners with deliberate test practice. Thus, before they administered the final test session, the graduate student examiners also administered the WISC-IV three times, the SB5 two times, and the DAS two times. Each of these administrations was graded by their instructor. This procedural step was included to ensure the students received substantial experience, deliberate practice, and instructive feedback in test administration.
Data were obtained for 14 administrations of the WISC-IV and 14 of the DAS in the first test session and an equivalent number in the final test session. Due to the requirements of sequential and purposive pur·po·sive
1. Having or serving a purpose.
2. Purposeful: purposive behavior.
pur assignment, comparable examinee test data were only available for nine administrations of the SB5 from the first test session and an equivalent number for the final test session. This is because five of the initial examinee participants who were administered the SB5 were above the age range for the study group during test session one. All 14 examiner participants administered the WISC-IV and DAS in the first and final test sessions. Due to the fact that five of the initial examinee participants who were administered the SB5 were above the age range for the study group during the first test session, only nine examiner participants administered the SB5 in the final SB5 test session to ensure comparable groups.
A mixed between-within analysis of variance was conducted to examine the impact of examiner deliberate practice and test administration experience on examinee overall mean IQs. The results reveal a statistically significant difference for mean examinee IQs at the p<.05 level for the main effect examiner level of experience [Wilks' Lambda = .812 F (1, 34) = 7.851, p = .008]. For the three tests combined, the mean IQ for the first test session is 104.43 and the mean IQ for the final test session is 111.19 ([DELTA] M= -6.76). The magnitude of the difference is large ([[eta].sup.2] =. 188). Generally, .01 is considered a small effect, .06 a moderate effect, and. 14 a large effect (Cohen cohen
(Hebrew: “priest”) Jewish priest descended from Zadok (a descendant of Aaron), priest at the First Temple of Jerusalem. The biblical priesthood was hereditary and male. , 1988). No significant interaction effect is evident between examiner level of experience and the specific type of intelligence test administered [Wilks' Lambda = .954 F (2, 34) = .817,p = .450]. That is, there is no statistically significant effect for change in scores from the first to the final test session on the WISC-IV compared to the SB5 and compared to the DAS. In addition, the main effect comparing examinee mean IQs on the three intelligence tests is not significant, F (2, 34) = .020,p = .98. That is, the mean IQs for the three tests are not significantly different. Assumption testing indicated that based on the variability of scores there are no significant differences between the groups. In light of sample, the assumption of normality normality, in chemistry: see concentration. is not assured. Nonetheless, because 37 protocols were obtained in the first test session and 37 in the final session, and with a relatively large overall N of 74, violations of normality are unlikely to affect the findings (Tabachnick & Fidell, 1996).
The results of this study support the hypothesis that final test session IQs will be higher than first test session IQs for groups of different examinees tested by relatively experienced and inexperienced examiners, respectively. These results suggest examinees evaluated by examiners who are provided deliberate test practice and have more experience with the administration and scoring of an intelligence test will tend to obtain higher scores than examinees evaluated by examiners who have less experience and deliberate practice with test administration. As noted earlier, proficient performers develop the ability to anticipate examinee or environmental behaviors and respond appropriately (Ericsson, 2002). It is conceivable that experienced examiners may provide examinees subtle, nuanced test directions that do not violate standardized administration, but clarify the nature of the task. For example, when assembling geometric design models that examinees must replicate, examiners may exaggerate fashioning the angles of their models to help examinees clearly understand nuances of the task. Additionally, when testing auditory recall by stating words or numbers examinees must repeat either forward or backward, examiners can ensure that examinees not only hear their words but see the movement of their lips as a means of providing multisensory input to aid recall. Given the relatively small increase in DAS mean IQ from the first to the final test session, it is conceivable that tests such as the DAS that include numerous teaching items will moderate (but not eliminate) the effect of examiner experience as less experienced examiners have additional opportunities to ensure examinees more fully understand test directions and task requirements.
Overall, these research findings have important practical implications for evaluations conducted in myriad settings. In schools and clinical settings, many intern intern /in·tern/ (in´tern) a medical graduate serving in a hospital preparatory to being licensed to practice medicine.
in·tern or in·terne
n. examiners and less experienced examiners conduct evaluations of students. As a function of CLT, clinical supervisors and administrative supervisors need to ensure that interns and new practitioners have sufficient experience and deliberate practice with the instruments they administer. Although this research did not identify a minimum number of practice test sessions, given that five test sessions were conducted with the WISC WISC Wechsler Intelligence Scale for Children Psychology A 10-category test that measures both verbal and performance IQ. See Psychological testing. =IV, it appears that at least double that number of practice test sessions may be necessary.
Further, research regarding CLT and expert performance reveals that factors other than experience are important in developing expertise. Deliberate practice is considered crucial in efforts to improve performance because individual differences in performance are largely accounted for by variance in the amount of deliberate practice that is specifically defined (van Gog et al., 2005). Due to inexperienced examiners potential for cognitive overload, supervisors can ensure that "[d]eliberate practice activities are at an appropriate, challenging level of difficulty, and enable successive refinement by allowing for repetition, giving room to make and correct errors, and providing informative feedback to the learner" (van Gog et al., 2005, p. 75). Without ample opportunities for deliberate practice, including monitoring of numerous test administrations, checking for and correcting errors, and providing instructive feedback to inexperienced test examiners, obtained examinee test scores may be invalid for use in educational and other decision-making.
Limitations, Future Research, and Conclusion
The implications of these findings are moderated by specific limitations. A fairly small and geographically homogenous homogenous - homogeneous sample participated in the study. In addition, a random sample and random assignment were not utilized. Thus, these results cannot be generalized because they are susceptible to confounding confounding
when the effects of two, or more, processes on results cannot be separated, the results are said to be confounded, a cause of bias in disease studies.
confounding factor by order of arrival and because the participants were not representative of the general population. Therefore, the findings are considered preliminary and additional studies are needed.
Research in the future should use a random sample of examinees comprised of students from various educational, ethnic, and socioeconomic backgrounds. In addition, the experienced and inexperienced examiners should administer a single intelligence test to participants in a counterbalanced manner. Finally, future research should also be conducted with adult examinees. Despite its limitations, this first-step study with intelligence tests is consistent with research findings that indicate graduate student examiners tend to substantially underestimate examinees' overall IQs (Loe et al., 2007). Overall, the results of this present study indicate instructors of test examiners to utilize systematic, deliberate practice with specific test instruments as a teaching and learning model in order to obtain accurate examinees' scores.
American Educational Research Association, American Psychological Association, & National
Council on Measurement in Education. (1999). Standards for educational and psychological testing psychological testing
Use of tests to measure skill, knowledge, intelligence, capacities, or aptitudes and to make predictions about performance. Best known is the IQ test; other tests include achievement tests—designed to evaluate a student's grade or performance . Washington, DC: Author.
Baddeley, A. D. (1986). Working memory. New York New York, state, United States
New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of : Oxford University Press.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences behavioral sciences,
n.pl those sciences devoted to the study of human and animal behavior. (2nd ed) Hillsdale, NJ. Lawrence Erlbaum.
Edwards, O.W. & Oakland,T.D. (2006). Factorial factorial
For any whole number, the product of all the counting numbers up to and including itself. It is indicated with an exclamation point: 4! (read “four factorial”) is 1 × 2 × 3 × 4 = 24. invariance in·var·i·ant
1. Not varying; constant.
2. Mathematics Unaffected by a designated operation, as a transformation of coordinates.
An invariant quantity, function, configuration, or system. of Woodcock-Johnson III scores for Caucasian Americans and African Americans. Journal of Psychoeducational Assessment, 24, 358-366. doi: 10.1177/0734282906289595
Edwards, O.W., & Paulin, R. (2007). Referred students scores on the Reynolds Intellectual Assessment Scales and the Wechsler Intelligence Scale for Children-IV. Journal of Psychoeducational Assessment, 27, 334-340. doi: 10.1177/0734282907300453
Elliott, C.D. (1990). DAS Administration and Scoring Manual. New York: Psych psych also psyche Informal
v. psyched, psych·ing, psyches
a. To put into the right psychological frame of mind: Corp.
Ericsson, K. A. (2002). Attaining excellence through deliberate practice: Insights from the study of expert performance. In M. Ferrari (Ed.), The pursuit of excellence through education (pp. 21-55). Hillsdale, NJ: Erlbaum. doi: 10.4219/ jeg-2005-335
Ericsson, K.A., & Kintsch, W. (1995). Longterm working memory. Psychological Review, 102, 211- 245. doi: 10.1037/0033-295X. 102.2.211
Fuchs, D., Fuchs, L., Dailey, A., & Power, M. (1985). The effect of examiner's personal familiarity and professional experience on handicapped children's test performance. Journal of Educational Research, 78, 141-146. doi: 10.1177/074193258600700508
Kalyuga, S., & Sweller, J. (2005). Rapid dynamic assessment of expertise to improve the efficiency of adaptive e-learning. Educational Technology, Research and Development, 53, 83-93. doi: 10.1007/BF02504800
Loe, S., Kadlubek, R., & Marks, W. (2007). Administration and scoring errors on the WISCIV among graduate student examiners. Journal of Psychoeducational Assessment, 25, 237-247. doi: 10.1177/0734282906296505.
Roid, G.H. (2003). Stanford-Binet Intelligence Scales Stanford-Binet Intelligence Scales Definition
The Stanford-Binet intelligence scale is a standardized test that assesses intelligence and cognitive abilities in children and adults aged two to 23. , Fifth Edition. Itasca, IL: Riverside.
Slate, J.R., Jones, C.H., Murray, R.A., & Coulter, C. (1993). Evidence that practitioners err in administering and scoring the WAIS-R. Measurement and evaluation in counseling and development, 25, 156-161.
Tabachnick, B.G., & Fidell, L.S. (1996). Using multivariate statistics (3rd ed.). Mahwah, New Jersey Mahwah is a township in Bergen County, New Jersey, United States. As of the United States 2000 Census, the township population was 24,062. The name Mahwah is derived from the Lenni Lenape word "mawewi" which means "Meeting Place" or "Place Where Paths Meet". : Lawrence Erlbaum.
vanGog,T., Ericsson, K.A., Rikers, R. M3. P., and Paas, F. (2005). Instructional design Instructional design is the practice of arranging media (communication technology) and content to help learners and teachers transfer knowledge most effectively. The process consists broadly of determining the current state of learner understanding, defining the end goal of for advanced learners: Establishing connections between the theoretical frameworks of cognitive load and deliberate practice. Educational Technology Research and Development 53, 73-81. doi: 10.1007/ BF02504799
Wechsler, D. (1974). Wechsler Intelligence Scale for Children-Revised. New York: Psych Corp.
Wechsler, D. (1991). Wechsler Intelligence Scale for Children-Third Edition. New York: Psych Corp.
Wechsler, D. (1994). Wechsler Adult Intelligence Scale-Revised Wechsler Adult Intelligence Scale-Revised WAIS-R Psychology A measure of a person's cognitive abilities. See Psychological tests. . New York: Psych Corp.
Wechsler, D. (2003). Wechsler Intelligence Scale for Children-Fourth Edition. New York: Psych Corp.
Correspondence concerning this article should be addressed to Dr. Oliver W. Edwards, Associate Professor and Coordinator, School Psychology Program, University of Central Florida “UCF” redirects here. For other uses, see UCF (disambiguation).
UCF is a member institution of the State University System of Florida. UCF was founded in 1963 as Florida Technological University with the goal of providing highly trained personnel to support the Kennedy , Email: email@example.com
Oliver W. Edwards, Ph.D., NCSP NCSP Nationally Certified School Psychologist
NCSP National Cervical Screening Programme
NCSP National Cyber Security Partnership
NCSP National Communications Support Programme (New York, NY)
NCSP National Certified School Psychologist and Amy Rottman, University of Central Florida.