Printer Friendly

Assessing growth in young children: a comparison of raw, age-equivalent, and standard scores using the Peabody Picture Vocabulary Test.

Many tests provide users with several different types of scores to facilitate interpretation and description of students' performance. Common examples include raw scores, age- and grade-equivalent scores, and standard scores. However, when used within the context of assessing growth among young children, these scores should not be interchangeable because they provide different information. To examine how raw, age-equivalent, and standard scores function when assessing growth among young children, this article uses scores on the Peabody Picture Vocabulary Test-Third Edition to compare the use of these scores for the purpose of measuring growth in receptive vocabulary skills among a sample of 259 low-income, predominantly Hispanic preschoolers age 3 to 5 years. Results suggest a notable floor effect in the distribution of age-equivalent scores that was not observed in the raw score or standard score distributions. This floor effect may significantly affect the results of correlational data analyses conducted with these scores. In light of these findings and combined with a trend in the literature in which researchers often do not provide a clear rationale for choosing which test scores to use in statistical analyses, the authors offer suggestions for researchers when using test scores as dependent variables.

Keywords: assessment, psychological tests, measurement, young children


Measuring growth or change among young children is a common goal of educational and psychological assessment, especially within the context of measuring academic progress over time or intervention studies attempting to investigate treatment effects using pre-post designs. Within this context, constructs of interest include cognitive functioning, early academic skills, motor functioning, language, social skills, and adaptive behaviors (McConnell, Priest, Davis, & McEvoy, 2002; Spector, 1999). Psychological and educational tests designed for these purposes typically provide the user with several different types of scores to use to facilitate interpretation and description of students' performance, and which scores are most appropriate to interpret will depend on the purpose of the assessment. For example, raw scores and percentage-correct scores can be used to describe the student's current level of mastery, whereas norm-referenced scores, such as age- and grade-equivalent scores, standard scores, and percentiles, can be used to describe the student's performance relative to her same-age peers. Given this variety of scores from which to choose, test users may wonder which score is the best for reliably and validly measuring change over time. Further, intervention studies using standardized tests to assess growth sometimes neglect to report which test scores were used as measures of the dependent variables, which may influence the results of statistical analyses and therefore influence the conclusions reached. The purpose of this article is to briefly define raw, age-equivalent, and standard scores; review some of the psychometric limitations associated with these different scores; and empirically compare these scores for the purpose of assessing growth among young children in particular, using scores on the Peabody Picture Vocabulary Test-Third Edition (PPVT-III; Dunn & Dunn, 1997) from a sample of preschoolers. The PPVT-III is an ideal measure for this purpose because it is widely used for clinical and research purposes and provides all three scores under investigation. Through our review and empirical findings, we hope to demonstrate that these scores should not be used interchangeably, because they provide different pieces of information about the student, and that the psychometric limitations of some of these scores suggest the need for cautious interpretation.


Raw Scores

Within the context of cognitive and achievement assessment, raw scores are typically obtained by simply counting the number of test items answered correctly by the student (Angoff, 1984). Some tests (e.g., processing speed subtests) employ more complex scoring procedures to obtain raw scores, such as subtracting the number of errors from the number of correct responses. Further, on tests of affective constructs, such as emotional and behavioral functioning, raw scores are not determined by item correctness because there are no "correct" or "incorrect" responses to these items. Rather, raw scores are determined by adding together the student's (or parent's, or teacher's) responses to items employing a Likert-type scale.

For the purposes of criterion-referenced assessment, in which test users are interested in the individual student's mastery without regard to comparisons with other children, raw scores may be sufficient in describing the student's performance. For the purposes of norm-referenced assessment, however, raw scores by themselves are often less informative (Urbina, 2004). Instead, they must be converted into a norm-based score to describe performance relative to other children the student's age. Further, raw scores for different tests (even different tests of the same construct) cannot be compared with one another, because the same raw score can mean different things for different tests based on factors such as the number of items, type of items, minimum and maximum scores, item difficulty, time limits, and process for calculating raw scores. At the same time, raw scores (such as those obtained via curriculum-based assessment) may be more sensitive than norm-based scores to smaller changes in psychological or educational functioning over time, and thus may be especially useful in measuring growth within individual students (Riccio, Sullivan, & Cohen, 2010).

Age-Equivalent Scores

Although the review that follows will include discussion of age-equivalent (AE) and grade-equivalent (GE) scores (due to their conceptual similarities and shared limitations), the study itself employed only AE scores, because the PPVT-III does not provide GEs. The use of AE and GE scores for clinical and educational decision-making has a long history, particularly in the identification of students with learning disabilities (Hishinuma & Tadaki, 1997; Reynolds, 1981), diagnosing speech problems and specific language impairment (Lawrence, 1992; McCauley & Demetras, 1990; Plante, 1998), and measuring the development of adaptive behaviors over time among children with developmental disabilities (Chadwick, Cuddy, Kusel, & Taylor, 2005). In spite of their widespread use, AE and GE scores have a number of concerns that limit their clinical utility and minimize the interpretations and decisions that should be made on the basis of these scores. These limitations have been well articulated in the literature (Angoff, 1984; Bracken, 1988; Pearson Assessments, 2010; Reynolds, 1981), and will be reviewed only briefly here.

AE scores can be defined as "the chronological ages for which the given test performances are average" (Angoff, 1984, p. 20). Angoff (1984) described the process used to develop AE scores: children in the norm group are divided into subgroups based on age (e.g., using intervals of 3, 6, or 12 months, as is often done with norm-referenced tests of cognitive ability); either the mean or median test score is identified for each age-defined subgroup; and this score (either the mean or median) then becomes the AE score for each age-defined subgroup. Thus, if a raw score of 30 converts to an AE score of 8-9, this means that the raw score of 30 was the average score for the group of children age 8 years 9 months in the norm sample.

Similar to AE scores, GE scores are defined as "the grades for which these test performances are average" (Angoff, 1984, p. 22). Thus, GEs are obtained in the same way as AE scores, only grade is used as the basis for creating subgroups instead of age, with the score representing the mean or median score for each grade level (Angoff, 1984). For example, if a raw score of 30 converts to a GE score of 3.4, this means that the raw score of 30 was the average score for the group of children in grade 3.4 (i.e., 3rd grade, fourth month).

In sum, then, AEs and GEs represent the mean or median raw score obtained by a particular age group or grade level. The limitations associated with this type of score are many. First, interpretation of AEs and GEs depends on the unique distribution and variance of scores around the mean at each age/grade (Bishop, 1997; Schulz & Nicewander, 1997), and on the correlation between age/grade and test performance (Angoff, 1984); these distributions and correlations change with age/grade, even among different age/grade subgroups on the same test. For example, reading skills and language skills do not develop at a stable or constant rate; rather, they typically develop rapidly during young childhood but then level off into adolescence and adulthood, so the score distributions will vary with different ages and grade levels (Bishop, 1997; Bracken, 1988; Reynolds, 1981). The implication of this issue is that, similar to raw scores, AEs and GEs are not equivalent or comparable across different tests, so the same AE or GE score may indicate a substantial deficit on a test of reading comprehension but not on another test of a different construct. Thus, a GE score of 10.3 on a test of mathematics achievement and a GE score of 10.3 on a test of reading achievement cannot be directly compared, due to grade-related differences in the development of these constructs and differences in score distributions for these constructs (Angoff, 1984). To be able to accurately interpret an AE or GE score, the test user must have access to information about the sample and relationships among variables. For the practitioner, this requirement makes AEs and GEs cumbersome. Another limitation specifically in the interpretation of GEs is that curricula and instruction are not constant across different schools and school districts (e.g., different states emphasize different knowledge and skills at different grade levels), so these scores do not represent uniform measures of achievement in reading, writing, mathematics, and so on (Urbina, 2004).

In light of these limitations of AE and GE scores, why do researchers and clinicians continue to use them? One of the advantages of using AEs and GEs is that they are simple, intuitive, and give the appearance of being easily understood by parents and teachers. For this reason, these scores are frequently used by clinicians to facilitate score interpretation (Angoff, 1984; Hishinuma & Tadaki, 1997; McCauley & Swisher, 1984b). For example, AEs and GEs can be used to compare several different students on the same test to identify those students who are functioning at the highest and lowest levels. Alternatively, AEs and GEs can be used to identify strengths and weaknesses across multiple subject areas or constructs for an individual student (although skills in different constructs develop at different rates; Payne, 1997; Thorndike, 2005). These types of scores also are used to describe performance on constructs that change rapidly during childhood or adolescence as a result of normal developmental processes and learning (McCauley & Swisher, 1984a). However, this notion of simplicity and ease of interpretation is countered by the argument that these scores are too easily misinterpreted or overinterpreted (Lawrence, 1992). Thus, the appearance of intuitiveness and simplicity may in fact represent the most significant danger in using AEs and GEs, as they are not as simple for nonexperts to interpret as they may appear. For example, if a student in the 4th grade obtained a GE score of 5.4 on a test of reading achievement, we would not recommend that she suddenly be placed in the 5th grade on the basis of this score, nor would we presume that she possesses the reading skills of a 5th-grader. It would simply mean that the student performed better than other 4th-graders on the test, and her raw score was the average score for 5th-graders (Urbina, 2004). From a curricular standpoint, it would not make sense to conclude that this student would perform well with 5th-grade tasks or subject matter, because the student has not yet been taught this subject matter (Angoff, 1984). What would be more informative would be to look at her standard score to see how well she performed when compared to her same-age peers (i.e., where did she fall in the distribution of other 4th-graders, or other 9-year-olds?) to identify norm-referenced strengths and weaknesses.

Standard Scores

Standard scores are used to express the distance of the student's score from the normative mean, in terms of standard deviation units (Urbina, 2004). The most commonly used standard score within the context of cognitive and achievement assessment is the deviation IQ score, in which raw scores are converted to a score with a mean of 100 and a standard deviation of 15. Thus, if a raw score converts to a standard score of 100, the student is performing right at the mean; if the raw score converts to a standard score of 85 (or 115), the student is performing one standard deviation below the mean (or one standard deviation above the mean for a score of 115).

An advantage of using standard scores is that they are comparable across different tests. That is, a deviation IQ score of 115 can always be interpreted as one standard deviation above the mean (or the 84th percentile rank), a score of 130 can always be interpreted as two standard deviations above the mean (or the 98th percentile rank), and so on, for any norm-referenced test that uses this measurement index (assuming the test produces scores that approximate the normal distribution). If a student scores 130 on a math achievement test and scores 130 on a reading comprehension test, then we can say that the student is in the same relative position on both tests. Note that we could not make this interpretation using simple raw scores for reasons discussed above. Further, standard scores are more appropriate scores to use in diagnostic decision-making than AEs and GEs, because standard scores do take into account the distribution of scores around the mean.

Standard scores also provide the advantage of describing a range of normal performance, allowing test users to picture students' performance or behavior along a continuum from deficient to advanced functioning, with a wide range of "typical" functioning in between (Lawrence, 1992). In contrast, students scoring either below or above their group-defined AE or GE are simply seen as deficient or advanced, respectively, even though most will score either below or above the average score (Bishop, 1997). Finally, standard scores can be manipulated statistically (i.e., added, subtracted, or averaged) because they are on an interval scale of measurement, whereas AEs and GEs cannot be manipulated mathematically because they are on an ordinal scale (Plante, 1998; Schulz & Nicewander, 1997). Due to their superior psychometric properties and norm-referenced interpretability, standard scores, rather than AEs and GEs, are considered more appropriate for high-stakes decisions, such as diagnosis, placement, and need for intervention (Bracken, 1988).

Despite their advantages over raw scores and AE scores, the major limitation associated with using standard scores for measuring growth over time is that actual increases in performance as measured by raw scores may be masked by the use of a score that is norm referenced (Fletcher et al., 1991; Lindsey & Brouwers, 1999). Thus, if a child maintains her standard score of 100 across three different data points, this does not mean that the child's development has stagnated; rather, her raw score is increasing at the same rate as her same-age peers' raw scores. Therefore, from a norm-referenced perspective, her position along the normative distribution has not changed. Her raw score, however, has increased, indicating individual growth in the skill being measured. Among children with disabilities, or who are otherwise at-risk, it may be especially important to assess change via raw scores, because these children may be developing at a slower rate than their peers who do not have disabilities. In this case, progress will be detected if raw scores are used but not if standard scores are used.


To provide some context for this study, it is informative to explore how PPVT scores have been used in peer-reviewed published journal articles. We conducted a content analysis using PsycInfo to identify all peer-reviewed journal articles published from 2000 to 2010 (articles extracted February 15, 2010) that used any version of the PPVT with child samples. This search revealed a total of 123 articles, with most published in 2009 (n = 17), 2001 (n = 16), 2008 (n = 13), and 2003 (n = 13). Two researchers reviewed each of the 123 identified articles independently to assess inter-rater agreement in terms of which type of score was used in each study. The initial rate of agreement was 95.1% (i.e., raters agreed on 117 out of 123 articles); a third researcher reviewed the remaining six articles to resolve discrepancies among raters. Results revealed that the majority of researchers used standard scores (n = 67, 54.5%), raw scores (n = 16, 13.0%), or some combination of multiple scores, such as standard and raw scores (n = 6, 4.9%). A much smaller number of articles used age equivalents (n = 3, 2.4%) or were classified as "other" (n = 2, 1.6%). One study included in the "other" category used stanine scores and the other combined scores from the PPVT and an expressive vocabulary test to create a total vocabulary score. The remaining articles (n = 29, 23.6%) did not indicate or define the type of scores employed in the study. Of these 29 articles, the type of scores used could be deduced from the researchers' statistical results in 14 studies, and standard scores were used in all 14 of these studies. The type of score used could not be determined for the remaining 15 studies. This is concerning given that it is unclear how one should interpret the data.

Two researchers also reviewed each of the 123 identified articles independently to assess interrater reliability in terms of whether the authors of each study provided a rationale or justification for their score selection in the description of their methods. The initial rate of agreement was 95.1% (i.e., raters agreed on 117 out of 123 articles); a third researcher reviewed the remaining six articles to resolve discrepancies among raters. Of those authors who clearly reported the type of score used in their analyses (n = 94, 76.4%), only 14 (14.9%) provided a rationale or justification for their score selection in the description of their methods. These justifications were most prevalent for raw scores (n = 9, 64.3%), followed by standard scores (n = 5, 35.7%). To illustrate, sample rationales for raw scores include: "In preparing the data for analysis, raw scores from the PPVT-R were used instead of the customary Standard Score Equivalent because the former have not had age factored out; thus, differences based on straightforward ability may be more apparent" (Cunningham & Graham, 2000, p. 40) and "Because age differences were the focus of the current investigation, the children's raw scores were used in all analyses" (Wolfe & Bell, 2007, p. 440). A sample justification for standard scores includes: "To obtain a standardized estimate of children's verbal intelligence, we utilized standard scores rather than raw scores in all analyses" (Lewis, Dozier, Ackerman, & Sepulveda-Kozakowski, 2007, p. 1419).

Several interesting trends can be seen in these results. First, the standard score is the most commonly used score in research studies using the PPVT. Second, almost one fourth of studies using the PPVT did not clearly describe the score used in statistical analyses. Although the type of score used could be deduced in about one half of these studies, research consumers should not be left to make these judgments. Third, approximately 85% of studies for which the score used was clearly identifiable did not include a rationale for why that score was used. This suggests that researchers may not be thinking about the implications of using certain scores and how these decisions could influence their conclusions. At the very least, researchers are not clearly articulating these decisional processes to the consumers of their results.


In light of the various types of scores being utilized in research and practice, this study posed the following research questions: (1) Do raw, AE, and standard scores produce comparable distributional characteristics with young children? and (2) How does the type of score used influence commonly employed statistical analyses when assessing young children? These questions have important implications for interpreting statistical results and may inform how scores are used and reported in research with young children. Along with the content review, these questions provide a nice illustration of the importance of considering the type of scores utilized in research studies.



Data were obtained from a larger intervention study data-set, in which data were collected from four Head Start centers in a high-poverty neighborhood in a large metropolitan city located in south Texas. Nearly two thirds of the families reported annual income averages below $20,000 and less than 5% of parents reported earning a college degree from a 4-year institution. From the four matched Head Start centers, 259 children (51.7% females) age 3 to 5 years 8 months were sampled, with an average age of 4.11 (SD = .62). Participants were predominantly of Mexican American origin (95%), with English often (67%) the preferred language spoken at home. These data were collected as part of a larger study to assess the impact of an intervention (n = 122 in the intervention group) designed to increase healthy eating and physical activity when compared to a control group (n= 137 in the control group).


The PPVT-III is a standardized measure of receptive vocabulary for use with individuals from age 2-6 to 90 years. Examinees are required to point to one of four pictures that best represents the meaning of a verbally presented stimulus word. In addition to its use within the context of clinical assessment, the PPVT-III frequently has been used as a measure of vocabulary and receptive language in research studies with young children. Scores on the PPVT-III possess strong internal consistency and test-retest reliability, with coefficients consistently larger than .90 for the age group under study (Campbell, 1998; Dunn & Dunn, 1997; Williams & Wang, 1997). To meet the goals of the present study, we employed the commonly used standardized score estimates (M = 100, SD = 15), age equivalent estimates (expressed as age in years and months where a particular raw score is the median score), and raw scores (total number of correct responses) of children's receptive vocabulary. AE scores on the PPVT-III range from 1 year 9 months to 22 years, which purports to capture the range in which receptive vocabulary is most likely to increase at a relatively consistent and progressive rate (Williams & Wang, 1997).


Scores on the PPVT-III were initially collected as part of an evaluation of a 12-week psychoeducational intervention promoting young children's physical health, early academic skills, and school readiness (the Healthy & Ready to Learn program), in which the PPVT-III was used as a measure of receptive vocabulary. As part of the intervention study, data were collected at pre- and posttreatment (12 weeks apart) by researchers trained on the PPVT-III standardized administration and scoring procedures. Researchers were required to practice administration and scoring to ensure their competency prior to data collection.


Prior to statistical analyses, it is critical to assess the score accuracy and distributional characteristics. As indicated by the histograms (see Figures 1 and 2), the raw, standard, and AE distributions differed considerably for the pretest and posttest data. These results demonstrate the noticeably larger skew of the AE scores due to floor effects, which represents an AE score of 1.75 (or 1 year and 9 months, the lowest possible AE score on the PPVT-III). Notice that these floor effects are unique to AE scores, as participants with a wide variety of raw and standard scores are given the same AE score. Unfortunately, these data suggest that despite having, at times, vastly different raw (M = 13.91, SD = 4.93, minimum[min.] = 3, maximum [max.] = 23) and standard (M = 65.58, SD = 8.88, min. = 46, max. = 82) scores, these participants (n -- 91) all had the same AE score at pretest of 1.75. This same trend was revealed at posttest for participants (n -- 43) with floor effects on the raw (M = 17.16, SD = 4.35, min. = 9, max. -- 23) and standard (M = 65.47, SD = 8.41, min. = 45, max. = 81) scores. Collectively, these results suggest that though raw and standard scores display considerable variability, the AE scores remain a constant of 1.75; this presents a potential problem for AE scores from a distributional and accuracy perspective, especially at the lower age ranges.

Not surprisingly, these data characteristics explain the change in correlations between the AE and other scores at various age ranges at pretest using linear and quadratic models (see Table 1). As expected, the younger sample produced the smallest correlations between these scores due to the floor effect, thus suggesting these scores are not completely interchangeable when used with young children. These results were replicated at posttest, but to a lesser degree due to the smaller number of floor effect cases. Figure 3 shows the correlation between AE and standard scores at pretest and posttest, and illustrates the floor effects observed with the AE scores.

To better understand how the distributional characteristics of scores influence statistical analyses conducted with these scores, data were analyzed using three multilevel models, (1) with children's growth (i.e., time) at Level-1 and the child-level variables (i.e., treatment) at Level 2. These models were used to examine participants' growth on the PPVT-III scores from pre- to posttest and to assess whether the type of PPVT-III scores utilized moderated the perceived treatment effect. For each analysis conducted, two separate models were fit sequentially to explore the impact of treatment variable. Model 1, an unconditional linear growth model, was fit to examine the amount of dependency in the outcome variables and establish baseline statistics related to changes in participants' growth (i.e., slopes) and starting points (intercept or initial growth between the two time points). Model 2 evaluated whether the treatment status significantly predicted student growth on each outcome variable.

Results from each PPVT variable are presented in Table 2. The average intercept ([[beta].sub.00]) and slope ([[beta].sub.10]) across participants were of less interest, as they estimate the average initial status (or pretest score) across groups and the overall average amount of change from pretest to posttest, respectively. However, it is worth noting that no group differences emerged at pretest ([[beta].sub.1]). The parameter estimates ([[beta].sub.11]) of primary interest tested whether the treatment group experienced significantly more growth (or change) compared to the control group from pretest to posttest. These parameter estimates are also accompanied by an effect size using the equations provided by Feingold (2009).

The results revealed relatively consistent findings across the three PPVT-III variables from a statistical significance standpoint (see [[beta].sub.11] in Table 2) using our sample, although the parameter estimates are considerably different due to the differences in measurement scales. Collectively, these results indicated no differences at pretest, and the growth-rates were relatively consistent across the three PPVT-III variables. This was further supported by the relatively consistent effect sizes (see [ES.sub.[beta]11]), which contradicts the notion that raw scores are more appropriate for measuring change given that scores are not adjusted for age.

It was interesting that growth rates were not more biased for AE scores, given the large number of participants exposed to floor effects at pretest. However, these results cannot be generalized to all data-sets, nor can it be inferred that the score used does not influence the result, given the differences in how these scores are derived. In fact, it might be advantageous for researchers to test the sensitivity of the type of score utilized, perhaps by analyzing data with different scores to assess the degree to which results are influenced by choice of score. At the very least, researchers should justify the type of score employed and consider the implications of using that type of score.


Perhaps the most significant finding from the data analysis is the notable floor effect with AE scores, which was not observed in either the raw or standard score distributions (see Figures 1 and 2). This pattern is significant because it suggests that AEs lack the precision of raw and standard scores in that many children with different raw and standard scores obtained the exact same AE score. Thus, using AEs may mask true differences in ability level that are seen with raw and standard scores, which makes it difficult to make distinctions among children at the low range of ability. This finding holds important implications for both practitioners and researchers and is especially salient within the context of assessing young children. For many constructs (e.g., reading, language skills), floor effects are most likely to occur at the younger end of the age spectrum due to less variability in scores at young ages (Bracken, 1988; Catts, Petscher, Schatschneider, Bridges, & Mendoza, 2009) and among children at lower ability levels, such as those with developmental disabilities (e.g., Dickson, Wang, Lombard, & Dube, 2006). To illustrate, Dickson et al. (2006) used PPVT-III AE scores to assess the receptive language skills of children, adolescents, and young adults with developmental disabilities. These scores demonstrated a marked floor effect, in which almost half of the sample obtained the lowest possible AE score, either by earning the lowest possible score or by failing to establish a basal score. Similar floor effects were observed with adolescents with Down syndrome on the Stanford-Binet, Fourth Edition (SB-IV) (Couzens, Cuskelly, & Jobling, 2004). Thus, our findings may generalize to other tests of ability and academic achievement and other samples of young children and children and adolescents at the lower end of the score distribution.

Although not well demonstrated with our example, the AE floor effect is especially relevant within the context of assessing growth over time, because growth in raw or standard scores may not be large enough to be seen with AE scores. In other words, the group of children scoring at the floor at pretest may again score at the floor upon posttest even though their raw scores and standard scores have increased from pretest to posttest, because the AE scale is not sensitive enough to detect this change. The use of AEs may mask true pre-post changes in ability levels, which also violates the assumption of normality in most cases.

Also of note is the impact of floor effects on any statistical analyses or comparisons based on the distribution of AE scores. For example, predictive validity analyses between test scores and some later outcome may be compromised by floor effects and skewed distributions, thus reducing correlation coefficients due to a lack of differentiation among children who score low on the test (see Catts et al., 2009). Similar difficulties will likely be observed with other correlational analyses, such as test-retest reliability and convergent validity analyses (e.g., correlations with scores on other measures of cognitive functioning and achievement), due to limited variability of the AE score distribution.

With that said, a surprising finding was that growth rates were not more underestimated for AE scores due to the large number of participants exposed to floor effects. This finding may be due to unique characteristics of our study, such as the sample and treatment used, length of the interval between pre- and posttest, and, perhaps most importantly, relatively small overall treatment effects. Thus, we cannot assume that this finding will generalize to other studies, as AE scores may have been more biased under different conditions. For example, it is feasible that larger treatment effects may be more attenuated by floor effects, or that samples with higher functioning participants may be less influenced by floor effects, or that floor effects would have more influence when comparing children with a diagnosis to children without a diagnosis. Thus, the impact on estimated effect sizes may be either increased or reduced based on these factors. More research will be necessary to examine the influences of these factors on the utility of AE scores for measuring growth.

Aside from the notable floor effect, results also suggest that raw scores, AEs, and standard scores should not be used interchangeably or interpreted as alternative expressions of one another, because they clearly provide different information by measuring children's performance in different ways. Indeed, one score may indicate performance slightly below average while another score suggests serious weaknesses. This is consistent with other studies that found low and/or widely variant correlations between AEs/GEs and standard scores (e.g., Hishinuma & Tadaki, 1997; Plante, 1998). For example, Hishinuma and Tadaki (1997) demonstrated that among some of the subtests on the Wechsler Individual Achievement Test (WIAT), students in lower grade levels could obtain a standard score only slightly below the mean of 100, but a GE score significantly lower than their actual grade placement. Similarly, Couzens et al. (2004) demonstrated that among a sample of children with Down syndrome, AE scores on subtests of the SB-IV increased over time, while children's standard scores (i.e., IQ scores) on the same measure decreased over time. Thus, interpreting results based on AE scores would suggest that these children were making progress over time, but interpreting results based on standard scores would suggest progressively wider discrepancies between these children and their same-age peers. Similarly, Gabriels, Ivers, Hill, Agnew, and McNeill (2007) found different patterns of change in adaptive behaviors (assessed with the Vineland Adaptive Behavior Scales) over time among a sample of children with autism spectrum disorders, depending on whether raw or standard scores were used.

Children in the high and low cognitive ability groups showed significant decreases in standard scores over time, but when raw scores were used, children in the high cognitive ability group showed an increase in adaptive behaviors over time while children in the low cognitive ability group stayed the same. Thus, the conclusions reached depended on which scores were used in the analyses.

Which, then, is the best score to use to measure growth among young children? The susceptibility of AE scores to floor effects (in addition to the other limitations of these scores discussed previously) makes them the least useful. The choice between raw and standard scores depends on what we want to know. If we want to assess criterion-related change, raw scores are most appropriate. The use of raw scores is especially important when assessing change among groups of young children who may be at risk for atypical levels or rates of development, including English language learners, children with disabilities, and children of low socioeconomic status (Vagh, Pan, & Mancilla-Martinez, 2009), because true change in performance may be masked when using standard scores that compare these children to their typically developing peers. On the other hand, if we want to assess growth from a norm-referenced perspective, then standard scores are most appropriate. Hammer, Lawrence, and Miccio (2008) advocated the use of raw and standard scores when assessing growth among young children, as each score informs us in different ways: change in individual children's knowledge or abilities (raw scores) and change in knowledge or abilities in comparison to other children's knowledge or abilities (standard scores).

In light of (1) the results of our content analysis suggesting many researchers do not clearly describe which test scores they use in data analyses, (2) our statistical analyses showing how floor effects may attenuate correlation coefficients, and (3) previous studies illustrating that conclusions are strongly influenced by which scores researchers choose to use, we urge researchers to report which scores they use in their analyses and to provide a sound rationale for this decision. This rationale should be based on factors such as whether criterion-referenced or norm-referenced interpretation is more appropriate given the research question, the nature of the sample (e.g., age distribution, clinical or at-risk sample vs. "normal" sample), and the distribution of scores (e.g., presence of floor effects). Recall that our content analysis of studies using PPVT scores revealed that all of the studies that did not clearly indicate which scores were used, and for which we were able to deduce which scores were used, employed standard scores. This suggests that standard scores may be the presumptive "default" score. But researchers must consider the type of score when evaluating growth and how that decision influences the interpretation of their data.

DOI: 10.1080/02568543.2014.883453


A preliminary version of this article was presented at the annual meeting of the National Association of School Psychologists, February 2011, San Francisco, California.


Angoff, W. H. (1984). Scales, norms, and equivalent scores. Princeton, NJ: Educational Testing Service.

Bishop, D. V. M. (1997). Uncommon understanding: Development and disorders of language comprehension in children. Hove, UK: Psychology Press.

Bracken, B. A. (1988). Ten psychometric reasons why similar tests produce dissimilar results. Journal of School Psychology, 26, 155-166. doi: 10.1016/0022-4405(88)90017-9

Campbell, J. (1998). Test review: Peabody Picture Vocabulary Test-Third edition. Journal of Psychoeducational Assessment, 16, 334-338. doi:10.1177/073428299801600405

Catts, H. W., Petscher, Y., Schatschneider, C., Bridges, M. S., & Mendoza, K. (2009). Floor effects associated with universal screening and their impact on the early identification of reading disabilities. Journal of learning Disabilities, 42, 163-176. doi: 10.1177/0022219408326219

Chadwick, O., Cuddy, M., Kusel, Y., & Taylor, E. (2005). Handicaps and the development of skills between childhood and early adolescence in young people with severe intellectual disabilities. Journal of Intellectual Disability Research, 49, 877-888. doi: 10.1111/j.1365-2788.2005.00716.x

Couzens, D., Cuskelly, M., & Jobling, A. (2004). The Stanford Binet Fourth Edition and its use with individuals with Down syndrome: Cautions for clinicians. International Journal of Disability, Development and Education, 51, 39-56. doi: 10.1080/1034912042000182193

Cunningham, T. H., & Graham, C. R. (2000). Increasing native English vocabulary recognition through Spanish immersion: Cognate transfer from foreign to first language. Journal of Educational Psychology, 92, 37-49. doi: 10.1037/0022-0663.92.1.37

Dickson, C. A., Wang, S.S., Lombard, K. M., & Dube, W. V. (2006). Overselective stimulus control in residential school students with intellectual disabilities. Research in Developmental Disabilities, 27, 618-631.

Dunn, L. M., & Dunn, L. M. (1997). Peabody Picture Vocabulary Test-Third Edition. Circle Pines, MN: American Guidance Service.

Feingold, A. (2009). Effect sizes for growth-modeling analysis for controlled clinical trials in the same metric as for classical analysis. Psychological Methods, 14, 43-53. doi:10.1037/a0014699

Fletcher, J. M., Francis, D. J., Pequegnat, W., Raudenbush, S. W., Bornstein, M. H., Schmitt, F., . . . Stover, E. (1991). Neurobehavioral outcomes in diseases of childhood: Individual change models for pediatric human immunodeficiency viruses. American Psychologist, 46, 1267-1277. doi:10.1037/0003-066X.46.12.1267

Gabriels, R. L., Ivers, B. J., Hill, D. E., Agnew, J. A., & McNeill, J. (2007). Stability of adaptive behaviors in middleschool children with autism spectrum disorders. Research in Autism Spectrum Disorders, 1, 291-303.

Hammer, C. S., Lawrence, F. R., & Miccio, A. W. (2008). Exposure to English before and after entry into Head Start: Bilingual children's receptive language growth in Spanish and English. International Journal of Bilingual Education and Bilingualism, 11, 30-56.

Hishinuma, E. S., & Tadaki, S. (1997). The problem with grade and age equivalents: WIAT as a case in point. Journal of Psychoeducational Assessment, 15, 214-225. doi:10.1177/073428299701500303

Lawrence, C. W. (1992). Assessing the use of age-equivalent scores in clinical management. Language, Speech, and Hearing Services in Schools, 23, 6-8.

Lewis, E. E., Dozier, M., Ackerman, J., & Sepulveda-Kozakowski, S. (2007). The effect of placement instability on adopted children's inhibitory control abilities and oppositional behavior. Developmental Psychology, 43, 1415-1427. doi: 10.1037/0012-1649.43.6.1415

Lindsey, J. C., & Brouwers, P. (1999). Intrapolation and extrapolation of age-equivalent scores for the Bayley II: A comparison of two methods of estimation. Clinical Neuropharmacology, 22, 44-53.

McCauley, R. J., & Demetras, M. J. (1990). The identification of language impairment in the selection of specifically language-impaired subjects. Journal of Speech & Hearing Disorders, 55, 468-475.

McCauley, R. J., & Swisher, L. (1984a). Psychometric review of language and articulation tests for preschool children. Journal of Speech and Hearing Disorders, 49, 34-42.

McCauley, R. J., & Swisher, L. (1984b). Use and misuse of norm-referenced tests in clinical assessment: A hypothetical case. Journal of Speech and Hearing Disorders, 49, 338-348.

McConnell, S. R., Priest, J. S., Davis, S. D., & McEvoy, M. A. (2002). Best practices in measuring growth and development for preschool children. In A. Thomas & J. Grimes (Eds.), Best practices in school psychology-IV (pp. 1231-1246). Bethesda, MD: National Association of School Psychologists.

Payne, D. A. (1997). Applied educational assessment. Belmont, CA: Wadsworth.

Pearson Assessments. (2010). Interpretation problems of age and grade equivalents. Retrieved from http://www. terpretationAgeGradeEquivalents.htm

Plante, E. (1998). Criteria for SLI: The Stark and Tallal legacy and beyond. Journal of Speech, Language & Hearing Research, 41, 951-957.

Reynolds, C. R. (1981). The fallacy of "two years below grade level for age" as a diagnostic criterion for reading disorders. Journal of School Psychology, 19, 350-358.

Riccio, C. A., Sullivan, J. R., & Cohen, M. J. (2010). Neuropsychological assessment and intervention for childhood and adolescent disorders. Hoboken, NJ: Wiley.

Schulz, E. M., & Nicewander, W. A. (1997). Grade equivalent and IRT representations of growth. Journal of Educational Measurement, 34, 315-331.

Spector, J. E. (1999). Precision of age norms in tests used to assess preschool children. Psychology in the Schools, 36, 459-171. doi: 10.1002/(SICI) 1520-6807

Thorndike, R. M. (2005). Measurement and evaluation in psychology and education (7th ed.). Upper Saddle River, NJ: Pearson.

Urbina, S. (2004). Essentials of psychological testing. Hoboken, NJ: Wiley.

Vagh, S. B., Pan, B. A., & Mancilla-Martinez, J. (2009). Measuring growth in bilingual and monolingual children's English productive vocabulary development: The utility of combining parent and teacher report. Child Development, 80, 1545-1563.

Williams, K. T., & Wang, J. (1997). Technical references to the Peabody Picture Vocabulary Test-Third Edition (PPVT-III). Circle Pines, MN: American Guidance Service.

Wolfe, C. D., & Bell, M. A. (2007). Sources of variability in working memory in early childhood: A consideration of age, temperament, language, and brain electrical activity. Cognitive Development, 22, 431-455. doi:10.1016/j.cogdev.2007.08.007

Jeremy R. Sullivan, Suzanne M. Winter, Daniel A. Sass, and Nicole Svenkerud

University of Texas at San Antonio, San Antonio, Texas

Submitted June 8, 2012; accepted July 24, 2012.

Address correspondence to Jeremy R. Sullivan, Department of Educational Psychology, University of Texas at San Antonio, 501 West Cesar E. Chavez Boulevard, San Antonio, TX 78207-4415. E-mail:


(1.) Data were also analyzed using a 2 x 2 (Time x Treatment) ANOVA and analysis of covariance with pretest as the covariate. Not surprising based on the ICCs, the results were nearly identical to the multilevel model.

Linear and Quadratic Correlations Between Peabody Picture
Vocabulary Test-Third Edition Scores at Pretest for Each Age Level

                            Linear      Quadratic

                          AE    Raw     AE    Raw

                             ([n.sub.T] = 108,
                             [n.sub.FE] = 60, 56%)
3-year-olds   Raw        .901    --    .963    --
              Standard   .808   .955   .829   .968

                             ([n.sub.T] = 127,
                             [n.sub.FE] = 29, 23%)
4-year-olds   Raw        .981    --    .989    --
              Standard   .916   .958   .925   .964

                             ([n.sub.T] = 24,
                             [n.sub.FE] = = 2, 8%)
5-year-olds   Raw        .996    --    .996    --
              Standard   .981   .988   .983   .990

Note. [n.sub.T] and [n.sub.FE] represent the total sample size
for the analysis and the number of participants with floor effect
scores. The percent of participants in each sample with floor
effects was also documented.

Parameter Estimates and Effect Sizes for Each of the Three MLM Analyses

                        [[beta]    [[beta]    [[beta]
                        .sub.00]   .sub.10]   .sub.01]

PPVT (Raw score)        30.90 **   9.58 **     -0.23
PPVT (Standard score)   79.73 **   3.38 **     -0.33
PPVT (Age equivalent)   2.69 **    0.58 **     -0.06

                        [[beta]    [ES.sub.
                        .sub.11]   [beta]11]   ICC

PPVT (Raw score)        3.98 **      0.24      0.09
PPVT (Standard score)    2.97 *      0.22      0.00
PPVT (Age equivalent)   0.31 **      0.28      0.03

Note. ES = effect size; ICC = intraclass correlation; MLM =
Multilevel Model; PPVT = Peabody Picture Vocabulary Test-Third

[[beta].sub.00], [[beta].sub.10], [[beta].sub.01], and
[[beta].sub.11] are the estimated intercept, time effect, treatment
effect, and time by treatment effect. Recall the time by treatment
effect ([[beta].sub.11]) is of primary interest, as it tests
whether the treatment condition (treatment vs. control) differs
over time. [ES.sub.[beta]11] represents the overall effect (i.e.,
effect size) associated with [[beta].sub.11] ICC measures the
percent of explainable variability in growth rates due to the
treatment effect.

* p statistically significant at 0.05,
** p statistically significant at 0.0125 (.05/4).
COPYRIGHT 2014 Association for Childhood Education International
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2014 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Sullivan, Jeremy R.; Winter, Suzanne M.; Sass, Daniel A.; Svenkerud, Nicole
Publication:Journal of Research in Childhood Education
Article Type:Report
Geographic Code:1USA
Date:Apr 1, 2014
Previous Article:Teachers' self-efficacy and knowledge of healthy nutrition and physical activity practices for preschoolers: instrument development and validation.
Next Article:Children's views of older people.

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters