Printer Friendly

Assessment of thinking levels in students' answers.

Authors' Note: This research was supported in part by a grant to J.J. Pear from the Social Sciences and Humanities Research Council of Canada. D.E. Crone-Todd was supported by a fellowship from the Social Sciences and Humanities Research Council of Canada. The authors gratefully acknowledge Ms. Sabrina Berry's assistance with this research project.


Having first developed a method, based on Bloom's taxonomy (1956), for assessing the thinking levels required by study questions in computer-mediated courses (Crone-Todd, Pear & Read, 2000), we developed a method for assessing the levels at which students answer the questions. Reliability measures between two independent assessment groups were high (i.e., > 80%). The assessment procedure can serve diagnostic and research purposes in determining how to enable students to increase their thinking levels in post-secondary courses.


Assessment of Thinking Levels in Students' Answers

One of the most important goals of post-secondary education is to promote the use of critical, or higher-order, thinking skills. To this end, educators must find ways to identify, teach, and encourage the use of these skills in their courses.

One of the largest hurdles in this process is developing a precise operational definition, or set of definitions, for what is meant by "higher-order thinking". There is, however, a lack of consensus concerning the definition of this construct. For example, higher-order thinking may be "reasoned argumentation" (Newman, 1991a, b), comparing elements in terms of sameness (Carnine, 1991), application of concepts or principles (Hohn, Gallagher, & Byrne, 1990; Semb & Spencer, 1976), making discipline-related judgments that are effective (Paul & Heaslip, 1995), or argumentation that is systematic and active (Mayer & Goodchild, 1990). It seemed to us that all of these definitions include various components of what is considered "higher-order thinking", or thinking that requires combining elements in different ways than those provided in a textbook or other course materials.

A set of definitions that appears to incorporate all of the definitions above is Bloom's (1956) taxonomy of objectives in the cognitive domain. The taxonomy, which incorporates behavioral definitions of cognitive processes, has been used in a variety of educational settings. Despite its popularity, however, those using the taxonomy for research purposes have encountered problems with its reliability (e.g., Calder, 1983; Gierl, 1997; Kottke & Schuster, 1990; Roberts, 1976; Seddon, 1978; Seddon, Chokotho, & Merritt, 1981). Recently, Crone-Todd, Pear, & Read (2000) used a modified version of Bloom's (1956) taxonomy in the cognitive domain to identify the thinking levels required by study questions in a computer-aided personalized system of instruction (CAPSI) course. The purpose of the study was to begin the development of a more reliable measure of higher-order thinking in CAPSI-taught using guided study questions (e.g., Pear & Crone-Todd, 1999; than had been previously reported in the literature. Following the taxonomy, the thinking levels were: (a) Level 1 - Knowledge, (b) Level 2 - Comprehension, (c) Level 3 - Application, (d) Level 4 - Analysis, (e) Level 5 - Synthesis, and (f) Level 6 - Evaluation. Briefly, in the modified taxonomy, Level 1 corresponds to rote learning, Level 2 involves the ability to state an answer in one's own words, Level 3 is the ability to apply what one has learned to new problems or situations, Level 4 is the ability to break down concepts into smaller components, Level 5 is the ability to combine concepts to create new knowledge, and Level 6 is the ability to rationally argue or discuss a position with regard to a given topic. Levels 1 and 2 may be considered lower-order thinking (because they do not involve generation of new concepts or knowledge), while levels 3 through 6 may be consider higher-order thinking (see Crone-Todd et. al., Table 1). Hence, the higher-order levels involve the definitions that are similar to the ones explored by researchers other than Bloom, discussed above.

Following Williams' (1998) exhortation to operationally define constructs in education, Crone-Todd et al. (2000) developed a precise, step-by-step procedure for determining the level of any given question. They applied the procedure to the study questions in two CAPSI-taught courses and showed that high agreement on thinking levels could be obtained by independent groups of three raters. Independent groups, rather than individuals, were used because discussion among raters appeared to increase the reliability of the assessments. The present study undertook the next step, which is to develop a reliable method for assessing the thinking levels at which the questions are answered. Note that this assessment is different from the assessment of question levels in several important respects. First, in order to determine the level of an answer, one must determine whether the answer is correct both with respect to terminology and content. Second, because we wish to give the student the benefit of the doubt with regard to the adequacy of his or her answer, the process of determining the answer level proceeds in roughly the reverse order of the process of determining the question level. Thus, while the question is assessed at the lowest possible level at which the student can answer it, the answer is assessed at its highest possible level.

In the following we detail how we developed the answer-level assessment procedure and tested its reliability. This study parallels and extends Crone-Todd et al. (2000), which details the development of the question-level assessment procedure.


The data for this study consisted of the archived answers provided by four randomly selected students on two midterm examinations and a final examination from an undergraduate computer-mediated Behavior Modification Principles course, taught from January to April 2000. Both midterm examinations consisted of three short-answer essay-type questions, and the final examination consisted of 10 such questions. The examination questions were sampled from study questions that had been assessed in the Crone-Todd et al. (2000) study. The assessment focused on components of questions rather than individual questions, since a given question may have more than one component and the components may be at different levels. For example, a question that asks, "With examples, distinguish between rule-governed and contingency-shaped behavior" is broken into the following three components for assessment purposes: (1) An example of rule-governed behavior; (2) an example of contingency-shaped behavior; and (3) a clear distinction made between the two types of behavior. The analyses included 64 examination questions, which yielded 168 such components. The number of components assessed for each level are as follows: (a) Level l: 49; (b) Level 2: 92; (c) Level 3: 11; (d) Level 4: 14; (e) Level 5: l; and (f) Level 6: 0. The fact that levels 5 and 6 were not represented on the examinations is not unusual; we have found that typically few students do well on these types of questions and most are frustrated by them. Thus, in essence, levels 5 and 6 were not assessed in this study.

As a guide to assessing the answers, the assessors used a flowchart (see Figure 1). The flow-chart, along with assessment instructions, were revised by raters in the present study as had been done for the flow chart in the Crone-Todd et al. study to increase the precision of the assessment, as determined by the level of agreement obtained. There were two independent groups of raters, each consisting of two research assistants. One group was comprised of a doctoral candidate (the second author) and a third-year undergraduate student (the third author), while the other group was comprised of a third-year undergraduate student and an individual with a B.A. in Psychology (the fourth author). See <>.

Answers were assessed for each component of a question. As seen in Figure 1, answer components were assessed only if they used appropriate terminology and were correct. Raters were given the questions for the exams and the student's answers, but not the student identification, question levels, feedback, or grade on the exams. After each rater had independently assessed the exams, each group met separately to discuss and come to an agreement on the answer levels. The two groups then met and compared (without changing) their assessments for the purpose of obtaining an estimate of the intergroup reliability. The groups then discussed disagreements and came to a final agreement on the levels of the answers. Raters had to agree on a given student's answers before assessing the next student's answers. This process was repeated for the assessment of all four students' examinations.


For the purpose of calculating reliability, levels 1 and 2 (lower level thinking) and levels 4, 5, and 6 were combined to increase the instances in each category and to reduce the fineness of the discriminations required to obtain agreement. (Note that level 0 corresponds to an answer that used inappropriate terminology or that was otherwise incorrect.) Table 1 shows total number of agreements and disagreements in each category, the point-to-point agreement or percent of agreements (i.e., agreements divided by agreements plus disagreements and multiplied by 100) between groups for each level, the value of the between-groups Kappa statistic for each level, and the interpretation of the Kappa values obtained (Landis & Koch, 1977). The main advantage of the Kappa statistic is that it takes chance agreements into account (Kazdin, 1982). Note that the point-to-point agreements are all above 80% and that the Kappa values are all in the substantial range.

See issue's website <>.


The results show that the reliability of the answer assessment method described in this paper is high. While the Crone-Todd et al. (2000) study provided a reliable method for assessing questions, the present study extends this research by advancing the methodology. Reliable methods now exist for distinguishing both questions and answers at levels 1 and 2 from levels 3 and 4. While more work is needed on reliably assessing finer differences between answer levels, the present study in combination with the Crone-Todd et al. (2000) study represents an advance in higher-level assessment. This advance in methodology has advantages for both researchers and teachers.

As researchers, we are in a better position to study how answer levels, and therefore thinking levels, may be increased. For instance, it has been suggested (Solman & Rosen, 1986) that having too many higher level questions in a course may lead to higher drop out rates and lower average scores. If this is the case, then thinking level may be adversely affected by too many higher-level questions. On the other hand, it may be possible to systematically incorporate higher- and lower-level questions in a way that facilitates higher-level thinking. Hence, one empirical issue to address is the optimal ratio of higher- to lower-level questions in a given course as the course progresses. It may be, for example, that as this ratio increases, or if it increases too rapidly as a course progresses, student performance decreases.

There are other research questions that can be answered with this assessment methodology used as a base. For example, the methodology will permit us to identify early in a course which individuals are having difficulty answering the higher level questions, even though they may be mastering the lower level questions (and thus showing appropriate motivation). Once such individuals are identified, we would then be able to study what types of remedial procedures would be required to help raise these individuals' thinking levels to the point at which they could answer more of the higher-level questions.

The methodology incorporated in this study would also permit us to study how the addition of various supports (e.g., class discussions, specific teaching of the levels, feedback on the levels) might generate higher-level thinking in students' answers to the questions. This might be extended to facilitating students generating answers above the level of the question (e.g., Crone-Todd, 2001). The two assessment procedures go hand-in-hand because we need both in order to determine whether a student has answered a given question at, above, or below the level of the question.

The importance of this study for teachers is grounded in the research. If higher education involves learning to think at the higher levels identified in the modified Bloom's taxonomy, then the assessment methodology described here provides a direction for helping institutions of higher education fulfill their purpose. Learning systems, such as CAPSI, that are designed with links to theory and empirically validated research (i.e., "grounded designs" Hannafin, Hannafin, Land & Oliver, 1997), provide one approach for achieving the aims of educators who wish to help their students develop higher-order thinking skills.


Bloom, B.S. (1956). Taxonomy of educational objectives: Cognitive and affective domains. New York: David McKay.

Calder, J.R. (1983). In the cells of the "Bloom Taxonomy". Journal of Curriculum Studies, 15, 291-302.

Carnine, D. (1991). Curricular interventions for teaching higher order thinking to all students: Introduction to the special series. Journal of Learning Disabilities, 24 (5), 261-269.

Crone-Todd, D.E. (in progress). Increasing the levels at which undergraduate students answer questions in a computer-aided personalized system of instruction course. Unpublished doctoral dissertation, University of Manitoba, Winnipeg, Manitoba, Canada.

Crone-Todd, D.E., Pear, J.J., & Read, C.N. (2000). Operational definitions for higher-order thinking objectives at the post-secondary level. Academic Exchange Quarterly, 4, 99 - 106. [There is an error in Figure 1 of this article; for the correct figure, see]

Gierl, M. J. (1997). Comparing cognitive representations of test developers and students on a mathematics test with Bloom's taxonomy. The Journal of Educational Research, 91, 26-32.

Hannafin, M.J., Hannafin, K.M., Land, S.M., & Oliver, K. (1997). Grounded practice and the design of constructivist learning environments. Educational Technology Research and Development, 45, 101-117.

Hohn, R.L., Gallagher, T., & Byrne, M. (1990). Instructor-supplied notes and higher-order thinking. Journal of Instructional Psychology, 17 (2), 71-74.

Kazdin, A.E. (1982). Single-case research designs: Methods for clinical and applied settings. New York: Oxford University Press.

Kottke, J. L., & Schuster, D. H. (1990). Developing tests for measuring Bloom's learning outcomes. Psychological Reports, 66, 27-32.

Landis, J., & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159-174.

Mayer, R., & Goodchild, F. (1990). The critical thinker. Santa Barbara, CA: Wm. C. Brown Publishers.

Newman, F.M. (1991a). Promoting higher order thinking in social studies: Overview of a study of 16 high school departments. Theory and Research in Social Education, 19, 324-340.

Newman, F.M. (1991b). Classroom thoughtfulness and students' higher order thinking: Common indicators and diverse social studies courses. Theory and Research in Social Education, 19, 410-433.

Paul, R.W. (1985). Bloom's taxonomy and critical thinking instruction. Educational Leadership, 42(8), 36-39.

Paul, R.W., & Heaslip, P. (1995). Critical thinking and intuitive nursing practice. Journal of Advanced Nursing, 22, 40-47.

Pear, J. J., & Crone-Todd, D. E. (1999). Personalized system of instruction is cyberspace. Journal of Applied Behavior Analysis, 32, 205-209.

Roberts, N. (1976). Further verification of Bloom's taxonomy. Journal of Experimental Education, 45, 16-19.

Seddon, G. (1978). The properties of Bloom's taxonomy of educational objectives for the cognitive domain. Review of Educational Research, 48, 303-323.

Seddon, G. M., Chokotho, N. C., & Merritt, R. (1981). The identification of radex properties in objective test items. Journal of Educational Measurement, 18, 155-170.

Semb, G., & Spencer, R. (1976). Beyond the level of recall: An analysis of complex educational tasks in college and university instruction. In L.E. Fraley & E. A.

Solman, R., & Rosen, G. (1986). Bloom's six cognitive levels represent two levels of performance. Educational Psychology, 6, 243-263.

Varas (Eds.), Behavior Research and Technology in College and University Instruction (pp. 115-126). Gainesville, FL: Department of Psychology, University of Florida.

Williams, R. L. (1999). Operational definitions and assessment of higher-order cognitive constructs. Educational Psychology Review, 411-427.

J.J. Pear is a Professor of Psychology, conducting research in basic and applied behavior analysis. He has co-authored a popular textbook on behavior modification, authored a textbook on learning, and has been awarded the Fred S. Keller Behavioral Education Award by the American Psychological Association (2002) for distinguished contributions to education. D.E. Crone-Todd is a doctoral student in psychology, completing her dissertation on higher-order thinking. In addition, she has published on the use of computer-aided personalized system of instruction. K.M. Wirth is a 4th year student in psychology, and has presented reviews of personalized system of instruction at conferences. H.D. Simister recently graduated from the psychology program, and will be pursuing a graduate program in psychology.
COPYRIGHT 2001 Rapid Intellect Group, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2001, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

Article Details
Printer friendly Cite/link Email Feedback
Author:Simister, Heather D.
Publication:Academic Exchange Quarterly
Geographic Code:1USA
Date:Dec 22, 2001
Previous Article:Evaluating online learners in applied psychology.
Next Article:Evaluating pollutsim: computer supported roleplay-simulation.

Related Articles
Operational Definitions for Higher-Order Thinking Objectives at the Post-secondary Level.
Making Informed Choices: A Model for Comprehensive Classroom Assessment.
Technology: servant or master of the online teacher *?
The Scholar project.
Formative classroom assessment using cooperative groups: Vygotsky and random assignment.
Making assessments work: your district just overhauled its assessments. Are you sure these improvements are reaching your students?
The library game: engaging unengaged freshmen.
Science literacy: a collaborative approach.

Terms of use | Privacy policy | Copyright © 2021 Farlex, Inc. | Feedback | For webmasters |