Building inter-rater reliability into performance assessment.Abstract The purpose of this paper is to examine issues ranging from strategic aspects of assessment planning to the more fundamental aspects of assessment such as measuring student evaluation processes. Assessment activities in the capstone course of a college of business were used as the focus of a study on inter-rater reliability Inter-rater reliability, Inter-rater agreement, or Concordance is the degree of agreement among raters. It gives a score of how much , or consensus, there is in the ratings given by judges. among instructors and as a foundation for a discussion on assessment processes. The details of establishing inter-rater reliability for one instrument are provided, and conclusions are drawn concerning the value and limitations of this type of performance assessment. Introduction The main driver behind outcomes assessment in schools of business has been the accountability movement (Apostolou, 1999; Burke The name Burke (from Irish Gaelic de Burca, of Norman origin). In English the meaning of the name Burke is "fortified hill." See also Berkley. Places Australia
n a process of formal recognition of a school or institution attesting to the required ability and performance in an area of education, training, or practice. agencies, such as AACSB AACSB Association to Advance Collegiate Schools of Business (formerly American Assembly of Collegiate Schools of Business) AACSB American Assembly of Collegiate Schools of Business , are consequently held responsible for ensuring that business schools engage in assessment. Much effort has gone into what exactly business school administrators must do with assessment in order to get accredited accredited recognition by an appropriate authority that the performance of a particular institution has satisfied a prestated set of criteria. accredited herds cattle herds which have achieved a low level of reactors to, e.g. (Kerby & Weber Weber, river, United States Weber (wē`bər), river, c.125 mi (200 km) long, rising in the Uinta Mts., N central Utah, and flowing north and northwest to join the Ogden River at Ogden. The combined stream flows to the Great Salt Lake. , 2000; Sinning & Dykxhoorn, 2001). Accountability v. Assessment While accountability has provided a motivation behind assessment, it is important to distinguish the two concepts. Accountability refers to indicators used as a basis for resource allocation resource allocation Managed care The constellation of activities and decisions which form the basis for prioritizing health care needs across divisions or for individual incentives. Assessment refers to indicators used for continuous improvement. Accountability indicators provide a means for outsiders to evaluate the quality of an institution or program, while assessment indicators provide a means for insiders to strengthen quality within the institution. Outcomes assessment has been lauded by some as a means to simultaneously provide a basis for both program improvement and accountability (McCoy & Chamberlain Chamberlain may refer to:
v. con·fused, con·fus·ing, con·fus·es v.tr. 1. a. To cause to be unable to think with clarity or act with intelligence or understanding; throw off. b. assessment processes with accountability indicators. To be effective, the tools required for each are different (Wellman, 2000). Accreditation bodies are looking for Looking for In the context of general equities, this describing a buy interest in which a dealer is asked to offer stock, often involving a capital commitment. Antithesis of in touch with. evidence that assessment data is being systematically collected and, more importantly, being widely disseminated disseminated /dis·sem·i·nat·ed/ (-sem´i-nat?ed) scattered; distributed over a considerable area. dis·sem·i·nat·ed adj. Spread over a large area of a body, a tissue, or an organ. and used effectively (Kimmell, Marquette & Olsen, 1998). Alternatives, such as seeking quantitative criteria from all universities on a series of standardized tests A standardized test is a test administered and scored in a standard manner. The tests are designed in such a way that the "questions, conditions for administering, scoring procedures, and interpretations are consistent" [1] , for example, would probably do little to improve student learning and might instead actually impair im·pair tr.v. im·paired, im·pair·ing, im·pairs To cause to diminish, as in strength, value, or quality: an injury that impaired my hearing; a severe storm impairing communications. student preparation for future careers (Henninger, 1994). Apostolou's (1999) thorough review of the assessment literature relevant to accounting programs suggests that there is no one best assessment measure and that multiple measures of outcomes should be used to obtain feedback on programs. Most research on the validity and reliability of assessment measures has focused on general education outcomes. Very little has been published on the successful development of instruments to measure learning in business programs specifically. Method Faculty at a large public university in Southern California Southern California, also colloquially known as SoCal, is the southern portion of the U.S. state of California. Centered on the cities of Los Angeles and San Diego, Southern California is home to nearly 24 million people and is the nation's second most populated region, agreed to participate in this study to evaluate inter-rater reliability of a capstone course assessment instrument. Data were collected from three faculty members, two of whom were instructors in a capstone business course and one faculty member who was not an instructor of the course. One faculty-rater, a course instructor, teaches accounting, and two faculty-raters, one course instructor and one outside rater rat·er n. 1. One that rates, especially one that establishes a rating. 2. One having an indicated rank or rating. Often used in combination: a third-rater; a first-rater. , teach primarily in management. The capstone course emphasizes integrating business concepts from each of the business functional areas covered in the shared core required of all business administration majors and applying these concepts in a simulated business game. Students are organized into cross-functional teams In business, a cross-functional team is a group of people with different functional expertise working toward a common goal. It may include people from finance, marketing, operations, and human resources departments. formed at the beginning of each course on the basis of their declared areas of emphasis in the business administration program. Student teams develop a two-year business plan mid-way through the sixteen quarterly decisions of the simulation. The business plan was evaluated based on the quality of writing, the depth of analysis, and the effectiveness with which software and quantitative techniques were used in developing sales forecasts Sales forecast A key input to a firm's financial planning process. External sales forecasts are based on historical experience, statistical analysis, and consideration of various macroeconomic factors. , production schedules, and pro forma financial statements Pro forma financial statements A firm's financial statements as adjusted to reflect a projected or planned transaction. "What-if" analysis. . The faculty-raters evaluated a total of 28 projects completed by student teams from sections of the capstone business course taught in one of several terms from Fall 1999 through Spring 2000. Criteria (i.e., writing, analysis, forecasting, production schedules, and financial statements) were evaluated by faculty-raters on a 4-point scale where 4 indicated complete and excellent work and 1 indicated serious problems or missed requirements. Data Analysis After evaluating each project, faculty-raters recorded scores on grading sheets. The grading sheets were collected and analyzed an·a·lyze tr.v. an·a·lyzed, an·a·lyz·ing, an·a·lyz·es 1. To examine methodically by separating into parts and studying their interrelations. 2. Chemistry To make a chemical analysis of. 3. for inter-rater reliability across all three faculty-raters and for the two instructor-raters. Inter-rater reliability was calculated by counting occurrence of exact agreements of scores among raters. The number of agreements was then divided by the sum of agreements and disagreements (Bijou, Peterson, & Ault, 1968). Exact agreements were considered as those in which raters provided the same score for evaluation criteria. For example, to calculate inter-rater reliability in a scenario in which all three raters evaluated a team's writing with a score of 3, then 3 exact agreements would be counted. If, however, two raters gave the team a score of 3, but one rater scored the team's writing with a 2, then 2 exact agreements would be counted as the scores 3, 3, and 2 indicate only two raters were in agreement. An additional comparison was made by counting the number of agreements among raters when raters were within at least 1 point of the highest score. For example, if one rater evaluated a team's writing with a score of 3, but two raters gave the team a score of 2, then 3 agreements would be counted as the scores 2, 2, and 3 are within 1 point of each other and the highest score. If, however, three raters gave respective scores of 2, 3, and 4, this would count as 2 agreements as the score of 4 is the highest score, and the scores 3 and 4 are within one point of each other. The score of 2 is not within 1 point of the highest score and would not be counted as an agreement. After inter-rater reliability was calculated, the frequency of reliability scores was counted. Results and Findings Reviewing the data in the broadest terms, that is looking across 140 possible data points (i.e., 5 factors comprising the evaluation criteria for 28 student projects), the two course instructors scored exact agreements with each other 72 times, or 51 percent. When counting scores from all faculty-raters, there were 36 scores of exact agreement or 26 percent. Adding an outside faculty-rater decreased the frequency of 100 percent agreement considerably. In the rating scenario in which agreement was considered as any score within I point of the highest, frequency of perfect reliability scores increased. There were 129 scores of agreement within I point of the highest score between the two instructor-raters (92 percent) and 114 matches among all three faculty-raters (82 percent). Again, there was a drop in frequency of matches from 92 to 82 percent upon the addition of an outside faculty member. Of 140 possible scores, the outside faculty-rater and instructor-rater, both with backgrounds in Management, agreed with each other 60 times or 43 percent. When counting agreements as any score within 1 point of the highest, there were 128 agreements or 91 percent. Comparing the scores of the Accounting instructor-rater with those of the outside Management faculty-rater, there was exact agreement 67 out of 140 times, or 48 percent. When considering agreement as scores within 1 point of each other, the outside Management faculty-rater and the Accounting instructor-rater agreed 129 out of 140 times, or 92 percent. Overall, the frequency of agreement is quite similar between sets of raters regardless of discipline or experience teaching the course. The two instructor-raters, one in Accounting and one in Management, had a frequency of exact agreement 51 percent of the time while the two Management raters had exact agreement 44 percent of the time. When looking at agreement of raters' scores within 1 point of each other, the two instructor-raters agreed 92 percent of the time while the two Management raters agreed 91 percent of the time. In reviewing specific evaluation criteria such as writing, analysis, and forecasting, it was found that writing had the lowest frequency of exact agreement between the two instructor-raters, with agreement only 13 out of 28 times (i.e., 1 score for each of 5 criteria for 28 possible scores), or 46 percent. Evaluation of forecasting had the lowest frequency of perfect reliability scores for all three faculty-raters with 5 of 28 scores resulting in exact agreement or 18 percent. However, when using the less restrictive definition of agreement, that is, any score within 1 point of the highest, forecasting had the highest frequency of agreement, 100 percent. Evaluation of production schedules had the highest frequency of agreement in both rating scenarios with 16 of 28 possible scores in exact agreement between the two instructor-raters, or 57 percent, and 11 of 28 possible scores in exact agreement among all three faculty-raters, or 39 percent. Discussion Broad perspectives of the assessment process were delineated de·lin·e·ate tr.v. de·lin·e·at·ed, de·lin·e·at·ing, de·lin·e·ates 1. To draw or trace the outline of; sketch out. 2. To represent pictorially; depict. 3. and a specific study of inter-rater reliability among faculty in regard to student projects was completed. In the following, we will examine questions which arise out of this research: What value was created and was this method of assessment cost-effective? What are the broader implications for developing reliable instruments for performance assessment? How do we create a culture of evidence with information feedback loops to faculty that ensures that the effort taken in data collection is not wasted? What are the implications for performance assessment and accountability? Value In terms of costs for this embedded Inserted into. See embedded system. assessment instrument, the amount of faculty time involved in filling out the rubrics was minimal. Additional time was needed to gather the data. In future, sampling could be used to examine results from one term per year, for example. In a capstone course with approximately six to eight sections of 30 students each per term, running four terms per year, this provides a sufficiently large In mathematics, the phrase sufficiently large is used in contexts such as:
Achieving Reliability Once inter-rater reliability has been measured, the next question is what level of reliability is required for Professors to have confidence in using this instrument and how much effort is required to achieve this? In this study, student projects from a team-taught course were evaluated by two instructor-raters and a third faculty-rater who had not taught the course. The frequency of reliability scores of 100 percent, indicating exact matches among all three faculty members, was quite low. Focusing on frequency of reliability scores of only two instructor-raters improved the results, but only slightly. In fact, the frequency of perfect reliability scores between the two instructor-raters was close to that of the frequency of perfect reliability between the outside faculty-rater and either instructor-rater. Frequency of reliability scores was compared for the two course instructors and among three faculty-raters. In addition, comparisons were made for each pair of raters. Of course, adding a third rater decreased the frequency of agreement, but in comparing frequency of agreement between pairs of raters there was little difference. That is, frequency of agreement was similar between the two instructor-raters, the outside Management faculty-rater and the Management instructor-rater, and the outside Management faculty-rater and the Accounting instructor-rater. The similarity Similarity is some degree of symmetry in either analogy and resemblance between two or more concepts or objects. The notion of similarity rests either on exact or approximate repetitions of patterns in the compared items. between pairs of raters can be looked at in a positive or negative light. On the positive side, these findings may indicate a reasonable similarity among faculty in evaluating student work. Faculty can be rotated rotated turned around; pivoted. rotated tibia see rotated tibia. in and out of the capstone course without lowering the level of reliability currently attained. On the negative side, it appears that experience with a course, developing evaluation criteria, and background of the faculty-rater made little difference to reliability of scoring student work. Whether these findings are positive or negative has to do with the expectations faculty have for reliability. Typically the rater is the focal point focal point n. See focus. of measuring student learning, but the characteristics of the criteria being evaluated also made a difference in frequency of agreement. In this study, more well-defined criteria and explicit requirements, such as those required to develop production schedules, had a higher frequency of agreement among raters. Agreement among raters on other criteria such as forecasting and writing was much lower. The necessity and effort to improve rater reliability is heightened depending upon the nature of the work being evaluated. Careful definition of evaluation criteria and training of all faculty-raters is the first step in developing an accurate and consistent measure of student learning. Ideally, training at the beginning of a course and refresher training Refresher training is a form of updating military knowledge of the reservist troops. After one has completed the conscription service, he or she can be called for refresher training for some amount of days. meetings among instructor-raters would take place to ensure that evaluation criteria is interpreted similarly among raters. Discussion of differences in perspective among faculty-raters during training sessions to ensure that background, area of expertise, or personal objectives for the course were voiced and a reasonable standard by which to measure student work was developed would improve reliability. Culture Meaningful assessment involves continuous improvement and the development of a culture of evidence. As mentioned by Sinning and Dykxhoorn (2001), faculty involvement in assessment is extremely important if it is to be successful in the long-run. Because of accreditation pressures to have assessment processes in place, there is a tendency for deans or associate deans to take this task upon them for the sake of expediency ex·pe·di·en·cy n. pl. ex·pe·di·en·cies 1. Appropriateness to the purpose at hand; fitness. 2. Adherence to self-serving means: . This can lead to a heavy reliance on standardized tests, which measure only a small fraction of what students learn in obtaining their degree. Performance assessment, which can measure a much broader array of skills, requires intensive faculty involvement. It needs to be institutionalized in·sti·tu·tion·al·ize tr.v. in·sti·tu·tion·al·ized, in·sti·tu·tion·al·iz·ing, in·sti·tu·tion·al·iz·es 1. a. To make into, treat as, or give the character of an institution to. b. in academic governance Governance makes decisions that define expectations, grant power, or verify performance. It consists either of a separate process or of a specific part of management or leadership processes. Sometimes people set up a government to administer these processes and systems. processes, so that this work is done on a steady basis. As a learning organization, knowledge must be stored and effectively transmitted from one group of faculty working on assessment to the next, without gaps in time which require faculty to reinvent re·in·vent tr.v. re·in·vent·ed, re·in·vent·ing, re·in·vents 1. To make over completely: "She reinvented Indian cooking to fit a Western kitchen and a Western larder" assessment work already done before. Accountability The process of assessment can clearly provide incremental Additional or increased growth, bulk, quantity, number, or value; enlarged. Incremental cost is additional or increased cost of an item or service apart from its actual cost. improvements in teaching effectiveness and curricular revision. It is not clear, however, whether it is useful or destructive to use the actual outcomes for accountability purposes. In the case of the instrument used here, a decision would have to be made concerning what is a satisfactory outcome. Should we aim for an average score of 3 on this instrument? Or, recognizing the stresses of intra-team interactions and time pressure, should a 2.5 average be sufficient? Is it worth spending extra time on the particular activities we are measuring, at the expense of other learning activities, simply to push the score on this instrument higher? By defining the standards of what is acceptable, we control the definition of success or failure. In determining standards for accountability purposes, it is important to make clear who is being held accountable. Are we holding faculty accountable for ensuring that students in their programs achieve high scores on a certain set of instruments and allocating rewards accordingly? Or are we holding students accountable for their own learning and refusing to graduate students who do not achieve certain scores on selected instruments? The importance of achieving reliability increases the difficulty of pushing performance assessment to an accountability indicator. Apostolou (1999) has suggested that, given the increased public scrutiny upon standardized standardized pertaining to data that have been submitted to standardization procedures. standardized morbidity rate see morbidity rate. standardized mortality rate see mortality rate. exams for their potential bias or lack of fairness, using performance assessment for accountability purposes poses a threat of litigation An action brought in court to enforce a particular right. The act or process of bringing a lawsuit in and of itself; a judicial contest; any dispute. When a person begins a civil lawsuit, the person enters into a process called litigation. . Clearly, refusing to give degrees to students unable to attain certain scores on particular tests or exercises opens up a host of potential problems. Yet achieving ironclad ironclad, mid-19th-century wooden warship protected from gunfire by iron armor. The success of the ironclad when first employed by the French in the Crimean War sparked a naval armor and armaments race between France and Great Britain. reliability may entail entail, in law, restriction of inheritance to a limited class of descendants for at least several generations. The object of entail is to preserve large estates in land from the disintegration that is caused by equal inheritance by all the heirs and by the ordinary rubrics that are overly narrow and that are expensive to change, thereby skewing education in inappropriate directions and stifling innovation. While it is feasible to develop reliable instruments for accountability purposes, it does not appear to be desirable or cost-effective to do so. As more comprehensive measures of student learning outcomes are developed, invariably in·var·i·a·ble adj. Not changing or subject to change; constant. in·var i·a·bil faculty members from varied backgrounds will be
requested to share in the evaluation of student work. As workload The term workload can refer to a number of different yet related entities. An amount of laborWhile a precise definition of a workload is elusive, a commonly accepted definition is the hypothetical relationship between a group or individual human operator and task demands. increases and more faculty become involved, there may be a tendency to forego detailed training or refresher training on rating criteria and evaluation processes. This study indicates clearly that reliability cannot be taken for granted Adj. 1. taken for granted - evident without proof or argument; "an axiomatic truth"; "we hold these truths to be self-evident" axiomatic, self-evident obvious - easily perceived by the senses or grasped by the mind; "obvious errors" . References Apostolou, B.A. (1999). Outcomes assessment. Issues in Accounting Education, 14(1), 177-198. Bijou, S.W., Peterson, R.F., & Ault, M.H. (1968). A method to integrate descriptive and experimental field studies at the level of data and empirical concepts. Journal of Applied Behavior Analysis, 1 (2), 175-191. Reprinted in Methodological and Conceptual Issues in Applied Behavior Analysis Some of the information in this article may not be verified by . It should be checked for inaccuracies and modified to cite reliable sources. Applied behavior analysis (ABA) 1968-1988 from the Journal of Applied Behavioral behavioral pertaining to behavior. behavioral disorders see vice. behavioral seizure see psychomotor seizure. Analysis. (1989). The Society for the Experimental Analysis of Behavior The experimental analysis of behavior is the name given to school of psychology founded by B. F. Skinner, and based on his philosophy of radical behaviorism. A central principle was the inductive, data-driven[1] , (Iwata, Bailey, Fuqua, Need, Page, & Reid, Eds.), 4, 83-99. Burke, J.C. & S. Modarresi. (1999). Performance. Change, 31(6),16-23. Henninger, E.A. (1994). Outcomes assessment: The role of business school and program accrediting agencies. Journal of Education for Business, 69(4), 296-298. Kerby, D. and S. Weber. (2000). Linking mission objectives to an assessment plan. Journal of Education for Business, 75(4), 202-209. Kimmell, S.L., R.P. Marquette, & D.H. Olsen. (1998). Outcomes assessment programs: Historical perspective and state of the art. Issues in Accounting Education, 13(4), 851-868. McCoy, J. P. & D. Chamberlain. (1994) The status and perceptions of university outcomes assessment in economics. Journal of Economic Education, 25(3), 358-365. Sinning, K.E. & H.J. Dykxhoorn. (2001). Processes implemented for AACSB accounting accreditation and the degree of faculty involvement. Issues in Accounting Education, 16(2), 181-204. Wellman, J. V. (2000). Accreditors have to see past 'learning outcomes'. The Chronicle chronicle, official record of events, set down in order of occurrence, important to the people of a nation, state, or city. Almanacs, The Congressional Record in the United States, and the Annual Register in England are chronicles. of Higher Education higher education Study beyond the level of secondary education. Institutions of higher education include not only colleges and universities but also professional schools in such fields as law, theology, medicine, business, music, and art. , 9/22/00. Laura L. Whitcomb, California State University, Los Angeles Angela M. Young, California State University, Los Angeles Laura Whitcomb received her Ph.D. from Indiana U, Bloomington. Her teaching and research interests include strategic management, int'l business, and cross-cultural organizational studies Organizational studies, organizational behaviour, and organizational theory are related terms for the academic study of organizations, examining them using the methods of economics, sociology, political science, anthropology, communication studies, and psychology. . Angela Young received her Ph.D. from Florida State U. Her teaching and research interests include human resource mgmt., organizational relationships, and training and development. |
|
||||||||||||||||||

is true for sufficiently large
i·a·bil
Printer friendly
Cite/link
Email
Feedback
Reader Opinion