Printer Friendly

What makes a test difficult? Exploring the effect of item content on students' performance.

Item difficulty on tests administered to 25 undergraduate students in an introductory statistics course was systematically varied by manipulating the item content using a multiple-reversal, mixed within-between design. The hypothesis that test questions containing examples that are easier to relate to will produce higher scores than those that are abstract was not supported. In fact, the relationship between intended and empirical item difficulty was found to be paradoxical (students tended to perform better on items intended to be more difficult) and influenced by several background variables. The findings of the present experiment argue for an empirical evaluation of even face-valid item writing principles.

Key Words: item difficulty, multiple choice, statistics education, academic testing, item writing, quantitative reasoning


There is limited research available on how to reliably manipulate difficulty level of test items meant to measure quantitative reasoning skills (Daniel & Embretson, 2010). In the absence of established methods to vary conceptual and computational complexity, it is difficult to develop assessment instruments with a known difficulty gradient that allows for an efficient evaluation of mathematic knowledge. Studies on empirically validated item generation models that enable test developers to control for item difficulty typically focus on level of cognitive complexity or the extent to which item stimulus features are relevant to the measurement of the target constructs.

Mayer, Larkin and Kadane (1984) developed a cognitive model of mathematical problem solving that assumes two global stages of processing: representation and execution. Their two-stage processing model starts with the conceptual understanding of the problem and ends in a procedural execution of a mathematical solution. The overarching skill is translating a problem into an equation and then solving it through the application of mathematical formulas. Webb's (2007) notion of Depth of Knowledge proposes a similar model by introducing a hierarchy that captures the levels of processing required by an item ranging from recall of facts, through interpretation of data and strategic thinking through complex reasoning. In actual practice, however, item complexity is typically determined by expert rating (Moreno, Martinez & Muniz, 2006; Wyse & Viger, 2011) rather than systematic empirical investigations (Crehan & Haladyna, 1991).

There is a paucity of studies that experimentally manipulate relevant aspects of test items to isolate the components that determine their difficulty level. Besides the factors described above, the wording of items in general, the number and plausibility of distracters in multiple choice (MC) paradigms, the complexity of calculations are other possible determinants of item difficulty. Also, the type of examples used in either MC questions or word problems may play a role. In their review of literature, Erdodi and Lavicza (2008) identified an ability-anxiety-attitude triad as an explanatory model for statistics performance. They suggested that making the material relatable to students has the potential to address both anxiety and negative attitude towards statistics.

As an extension of that concept to the assessment process, the present study proposed to evaluate whether the extent to which the type of examples used in statistics tests influences student performance. It was hypothesized that students would score higher on test items in which the narrative contains topics and elements that resonate with their daily experiences versus test items comprised of material that is unfamiliar and hence, difficult to relate to.



The sample consisted of 25 undergraduate students (10 males, 15 females, [M.sub.age] = 20.7 years, [SD.sub.age] = 1.9, range: 18-26) enrolled in an introductory statistics course for the social sciences. On average, they completed 3.2 years of college at the beginning of the study (SD = 1.5, range: 2-8). As a pre-requisite, students were required to have successfully completed a basic course in mathematics and an introductory psychology course.


Four tests were administered throughout the semester. Tests were divided into a MC and a computation section. MC questions had four or five answer choices, and were designed to assess the conceptual understanding of statistical principles. The computation section required students to construct frequency distributions, graph data, calculate descriptive statistics or perform hypothesis tests based on a given data set.

Two versions of the tests (A and B) were developed that varied on the intended difficulty level. The wording of the tests was altered to study the effect of the content on test performance. The examples used to construct MC items or word problems (WP) were drawn either from topics close to the typical college student such as grades, partying, relationships, life after graduation (meant to establish a sense of familiarity and easier processing of the information--labeled easy) or abstract topics from natural and medical sciences, as well as economics (meant to create an emotional distance and require more effortful processing--labeled hard).


Students were randomly assigned to either the A or B group on the day of the first test. They retained their group membership for the rest of the semester. Prior to each test, the class roster paired with the previously assigned group (A or B) was displayed using an overhead projector to ensure consistency across tests thus, the integrity of the research design.

The intended difficulty level varied in counterbalanced order: on Test 1, group A received the easy version and group B the hard version. On Test 2, intended difficulty level was reversed: group B received the easy version and group A the hard version, and so on. This design lends itself for a combination of between-group (A vs. B on each test) and within-group (comparing performance across tests for each group) analyses.

Students were permitted to use a hand calculator and a 3" x 5" handwritten flash card during the tests. They were allowed to write any information on these cards that they deemed important to know for the tests. Later in the semester, they had access to statistical software during the test. The purpose of these measures was to reduce test anxiety and the demand for rote memorization of definitions, formulas, and cumbersome computations by hand. Research shows that such accommodations can improve student performance during academic testing (Skidmore and Aagaard, 2004).

Data Analysis

The study used a mixed within-between subjects, single blind design. The independent variable (IV) was the relatability of item content, while the dependent variable (DV) was student test score. Within-group analyses compared individual performance on the easy versus hard versions of the tests. Between-group analyses compared test scores as a function of item content within each test. ANOVAs, independent and dependent t tests were used to evaluate the statistical significance of the observed differences. Effect size was expressed in partial [[eta].sup.2] and Cohen's d. Data analysis was performed using SPSS 11.5.


The two groups did not differ on the final percentage grade: t(22) = 0.35, p = .73 ([M.sub.A] = 74.7, [SD.sub.A] = 15.8, [M.sub.B] = 76.6, [SD.sub.B] = 10.5). This finding suggests that the overall effects of the IV were neutralized over the four tests. In other words, being assigned to one group versus the other did not result in an advantage or disadvantage at the end.

None of the between-group contrasts comparing the easy vs. hard versions of the tests were significant on either the MC or WP (p-values ranged .25-.96) and neither were contrasts between MC and WP within each test (p-values ranged .26-.96). These negative findings were partly due to large within-group variability and small sample size. Within group A, a repeated measures ANOVA was non-significant on MC [F(3, 11) = 0.83, p = .45]; post hoc contrast marginally significant between tests 3-4], but significant on WP [F(3, 11) = 3.21, p < .05, [[eta].sup.2] = .23 (very large effect); post hoc contrast significant between tests 1-4 and 3-4]. Table 1 provides a summary of the findings--hard version of the test is marked in bold. Within group B, a repeated measures ANOVA was significant on MC [F(3, 7) = 3.24, p < .05, [[eta].sup.2] = .32 (very large effect); post hoc contrast significant between tests 3-4] and marginally significant on WP [F(3,7) = 2.32,p = .10, [[eta].sup.2] = .25 (very large effect); post hoc contrast significant between tests 2-4]. Table 2 provides a summary of these results.

Table 3 reviews the number of predicted vs. observed contrasts. Only some of the expected comparisons reached significance. None of the five significant contrasts were in the predicted direction (easy > hard). Four were in the opposite direction (easy < hard), while one occurred where no difference was predicted (hard> hard). All these unpredicted significant contrasts were against Test 4, which produced the lowest mean score of all tests and the only significant post hoc contrast following the ANOVA on the average of Tests 1-4 collapsed across groups A and B: F(3,19) = 4.85,p < .01, [[eta].sup.2] = .20 (very large effect).

Figure 1 visually displays the fluctuation of mean scores across tests, format and difficulty level. On Test 1, the only apparent difference was between test formats (MC vs. WP): within group A, the contrast reached

significance [t(13) = 1.88, p < .05, d = .50 (medium effect)], while within group B it approached it [t(9) = 1.46, p = .09, d = .46 (medium effect)]. No significant contrast was found on Test 2. Within group B, the MCWP contrast approached significance [t(8) = 1.66,p = .06, d = .55 (medium effect)] on Test 3. Finally, none of the contrasts reached significance on Test 4.


The present study was designed to investigate the effect of relatability of item wording on student performance in an undergraduate statistics course. The narrative content of the test items was systematically varied between familiar and abstract or complex topics that the typical college student finds difficult to relate to.

The absence of a between-group effect on all four tests in conjunction with the inconsistent within-group effects signals a discrepancy between the intended and actual item difficulty. Overall, the IV did affect student performance, but this influence was subtle, unpredictable and vulnerable to other factors. As Figure 1 (and the underlying ANOVA) suggests, the switch from one test to the next (and the related changes in the complexity of the material and students' preparedness) is a stronger predictor of test scores than intended item difficulty. Although within-group contrasts were more sensitive to the effect of the IV, they were clearly driven by the overall effect of the unusually low Test 4 scores.

In summary, despite careful manipulation of item content, none of the 16 planned contrasts showed the predicted effect. On the contrary, four out of five significant contrasts were in the opposite direction (easy < hard). Two possible explanations for these paradoxical findings are advanced.


First, the mean scores on Test 4 are noticeably and uniformly lower across groups and test formats. This likely reflects an exponential increase in the conceptual complexity of the course material: at this point, students were expected to understand the difference among analysis of variance, non-parametric tests, effect size estimation and apply that knowledge to hypothesis testing. In other words, the unusually poor performance on Test 4 can be viewed as an artifactual finding that overrode the effect of the IV and generated spurious relationships among the DV and experimentally inactive factors of the research design. The fact that comparisons that did not involve Test 4 failed to reach statistical significance supports this interpretation.

Second, it is possible that the a priori logic of item construction was ultimately flawed. The operating assumption was that if the item content uses elements of reality that speak to the experience of young (18-22) college students, the familiarity effect will alleviate test anxiety, create a more favorable attitude towards the test and thus improve students' cognitive performance. The findings provide no support for this hypothesis whatsoever. Even after discounting the four significant contrasts in the opposite of the predicted direction as a statistical artifact, three of the remaining four contrasts still show an easy < hard pattern at face value (although these differences are not statistically significant). Therefore, these results may in fact suggest that creating a familiarity effect in test items can in fact work against test performance: student may develop a false sense of confidence, thus underestimate the true difficulty level of the item and fail to put forth the effort they otherwise dedicate to questions that appear challenging at face value.

The present study investigated the relationship between apparent item complexity (i.e. the relatibility of test material) and student performance in an undergraduate statistics course. The IV had the opposite than the predicted effect on test scores: overall, students tended to perform better on versions of the tests that were designed to be more difficult--a classic example of a Type III error. Although this finding may appear counterintuitive at first, they can be accounted for by theoretical models of teaching, testing and the non-linear nature of knowledge acquisition. Cognitive variables manipulated in experimental designs are known to covary with background variables, and as such are an acknowledged threat to the validity of the findings (Daniel & Embretson, 2010). The high level of within-group variability can also be interpreted as an indirect evidence of the undue influence of construct-irrelevant variance that contaminates the measurement process. Examinees are expected to differ in their ability to decipher cues embedded in the verbal texture of the test questions as a function of their reading comprehension, mastery of test taking strategy ("testmanship"), test anxiety and other, test taker characteristics that are poorly understood (Martinez, 1999).

The present study has several strengths: a sound experimental [multiple reversal (ABAB) design] with a combination of between- and within-group contrasts, random assignment of participants to experimental condition, and a rationally driven systematic manipulation of the test stimuli. The study also has a number of weaknesses: a small sample size, as well as few test items and conditions.

Data analysis converges in two main conclusions. First, unmeasured variables can exert a robust influence on the DV to the point where they overpower the IV(s) in educational research. Second, a priori assumptions about the relationship between variables sometimes turn out to be in the opposite direction. This observation serves as a humbling reminder that academic teaching and assessment are stochastic processes, and creates an imperative to empirically test even face-valid conceptualizations of human behavior and cognitive processes before the proposed mechanism is widely accepted and applied. It also identifies a need for theory based and data driven methods to develop measurement instruments with known parameters that yield predictable and interpretable results. Further research is needed to expand the existing knowledge base on the principles of item construction and their interactions with contextual variables.


Crehan, K. D., & Haladyna, T. M. (1991). The validity of two item-writing rules. Journal of Experimental Education, 59, 183-192.

Daniel, R. C., & Embretson, S. E. (2010). Designing cognitive complexity in mathematical problem-solving items. Applied Psychological Measurement, 34 (5), 348-364.

Erdodi, L., & Lavicza, Z. (2008, August). Statistics'r'Us: From Affect Size to Effect Size --Reshaping Students" Attitude from Aversion and Anxiety to Curiosity and Confidence. Paper presented at the Annual Convention of the Mathematical Association of America, Madison, WI.

Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34 (4), 207-218.

Mayer, R. E., Larkin, J., & Kadane, J. B. (1984). Acognitive analysis of mathematical problem solving ability. In R. Sternberg (Ed.),Advances in the psychology of human intelligence (Vol. 2, pp. 231-273). Hillsdale, NJ: Lawrence Erlbaum.

Moreno, R., Martinez, R. J., & Muniz, J. (2006). New guidelines for developing multiple-choice items. Methodology, 2 (2), 65-72.

Skidmore, R. L., & Aagaard, L. (2004). The relation ship between testing condition and student test scores. Journal of Instructional Psychology, 31, (4), 304-312.

Webb, N. L. (2007). Issues related to judging the alignment of curriculum standards and assessments. Applied Measurement in Education, 20 (1), 7-25.

Wyse, A. E., & Viger, S. G. (2011). How item writers understand depth of knowledge. Educational Assessment, 16, 185-206.

Laszlo Erdodi, Department of Psychology, Eastern Michigan University.

Laszlo Erdodi is now at Dartmouth College, Geisel School of Medicine, Department of Psychiatry, Neuropsychology Service.

Correspondence concerning this article should be addressed to Laszlo Erdodi at
Table 1
Within-Subject Contrasts on Multiple Choice (MC) and Word
Problem (WP) Sections across Tests 1 through 4 (Group A).
Hard version marked in bold.

                       Group A

Test           MC         p        [[eta].sup.2]

1        M    58.6       0.45          0.07
        SD    26.7
2        M    69.2          Significant
        SD    32.5           post hoc
3        M    68.3          contrasts:
        SD    13.1          3-4, p =.05
4        M    53.6
        SD    23.7

               Group A

Test           WP       p      [[eta].sup.2]

1        M    71.6     <.05        0.23
        SD    24.6
2        M    62.4     Significant post hoc
        SD    30.1         contrasts:
3        M    71.1        1-4, p <.01
        SD    52.5        3-4, p <.05
4        M
        SD    26.5

Table 2
Within-Subject Contrasts on Multiple Choice (MC) and Word Problem (WP)
Sections across Tests 1 through 4 (Group B). Hard version marked in

                       Group B

Test           MC      p      [[eta].sup.2]

1        M    58.0    <.05         .32
        SD    27.4
2        M    72.7    Significant post hoc
        SD    34.4        contrasts:
3        M    65.3       3-4, p < .01
        SD    13.7
4        M    43.2
        SD    19.7

                   Group B

Test           WP      p 12    [[eta].sup.2]

1        M    70.2     .10          .25
        SD    17.2
2        M    74.8       Significant post
        SD    22.2        hoc contrasts:
3        M    80.0         2-4, p < .01
        SD    19.0
4        M    56.5
        SD    27.2

Table 3
Predicted Versus Observed Significant Within-Subject Contrasts Across
Tests 1 through 4

Predicted        Test     Observed Significant
Significant     Format          Contrasts

1-2                              Groups
1-4                          A          B
2-3               MC        3-4        3-4
3-4               WP       14, 34      2-4
COPYRIGHT 2012 George Uhlig Publisher
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2012 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Erdodi, Laszlo A.
Publication:Journal of Instructional Psychology
Article Type:Report
Geographic Code:1USA
Date:Sep 1, 2012
Previous Article:Factorial validation of the Seven-Component Model of the Work Profile Questionnaire Emotional Intelligence (WPQei) in a Turkish educational setting.
Next Article:Model-based assessment of conceptual representations.

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters