According to Jim Gallagher: how to shoot oneself in the foot with program evaluation.One of the most pressing needs in gifted education Gifted education is a broad term for special practices, procedures and theories used in the education of children who have been identified as gifted or talented. Programs providing such education are sometimes called Gifted and Talented Education (GATE) or , and in education in general, is how to conduct a responsible evaluation. We are increasingly being asked to be "accountable," which essentially means that we are carrying out our educational programs as we promised. Yet our attempts to follow through on this evaluation requirement often result in our damaging our own programs, or at least underestimating their effects. This paper attempts to identity several of the most common ways in which we may do damage to ourselves and our programs through unintended errors in the evaluation. One distinguishing feature of program evaluation Program evaluation is a formalized approach to studying and assessing projects, policies and program and determining if they 'work'. Program evaluation is used in government and the private sector and it's taught in numerous universities. is that the data collected will presumably pre·sum·a·ble adj. That can be presumed or taken for granted; reasonable as a supposition: presumable causes of the disaster. lead to a decision about a particular individual project or program. This purpose separates it from research studies, even though similar instruments and methodology may be used, since research is seeking knowledge with no aspirations aspirations npl → aspiraciones fpl (= ambition); ambición f aspirations npl (= hopes, ambition) → aspirations fpl for immediate relevance. The evaluation decision can be to continue to support the program if evidence of the program's effectiveness has been obtained. On the other hand, if evidence is lacking for the program's efficiency, calls may be heard for reducing or eliminating the program. The practical implications of such negative evaluations obviously are very important for the persons involved, so it is important that the evaluation be conducted in a valid fashion. The newly established Institute for Educational Sciences in the US Department of Education has stressed the importance of evidence-based practice in education. Table 1 presents the goals of this research institute and the procedures they expect professionals applying for funds to follow. The establishment of the efficacy of programs and practices are high priorities. Summative Adj. 1. summative - of or relating to a summation or produced by summation summational additive - characterized or produced by addition; "an additive process" vs. Formative Evaluation Formative evaluation is a type of evaluation which has the purpose of improving programmes. It goes under other names such as developmental evaluation and implementation evaluation. One of the most important procedures in program evaluation is to distinguish between summative evaluation and formative evaluation. Each is important for a comprehensive evaluation but the purposes, and even the tools, used by these two different procedures may be quite different. In summative evaluation, the data is being collected for someone else's purposes, often a funding agency that wishes to discern dis·cern v. dis·cerned, dis·cern·ing, dis·cerns v.tr. 1. To perceive with the eyes or intellect; detect. 2. To recognize or comprehend mentally. 3. whether or not its money is being spent wisely. Consequently such evaluation will be done by a person or persons outside the program. Sometimes legislators call for summative evaluation on such programs as Head Start, or the Javits program, or the Eisenhower Math program, to either defend them, or to justify cutting them, or even abolishing them (Zigler & Styfco, 1994). Understandably, summative evaluation makes program people very uncomfortable and wary. Their program's future and very existence may be at stake in such an evaluation. In summative evaluation the emphasis is on product, or output. If Head Start claims to prepare young students for school, the summative evaluation question is "Are those students better prepared for school in reality?" If high cognitive performance is the Javits program goal then where are the products that illustrate a gain in students' ability to produce creative products, or to solve difficult problems? In contrast, formative evaluation is designed for the program staff's own purposes. Its basic goals are to find out whether the program is being effective in order to change procedures to increase efficiency. "How can I make this program better?" is the basic question. In formative evaluation, the emphasis is on processes. How am I carrying out the program? Are their inefficiencies that are reducing program performance? Can I cut costs to allow for more to be done with the same amount of money? While any change, even change for the better, is still a threatening element in its own right, change for the purpose of improving performance is a much better motivator than is a threat (Fullan, 2001). In formative evaluation you may be concerned with how well teachers are being trained to deliver "problem based learning (PBL PBL Problem-Based Learning PBL Phi Beta Lambda PBL Performance Based Logistics PBL Planetary Boundary Layer PBL Publishing and Broadcasting Limited (Australia) PBL Philippine Basketball League PBL Peripheral Blood Leukocyte )," for example. You may wish to interview the teachers to determine in what ways they are having difficulty in carrying out the PBL or to determine what has been successful so that the next application of the method would be more efficient. If the project's goals are to improve student inquiry in science then a key question in a formative evaluation would be, do the teachers involved know enough about science to be able to handle a curriculum focusing on inquiry? Are the teachers skilled at asking questions or do they need workshops, tutoring, etc. to improve their question-asking mastery? In this formative formative /for·ma·tive/ (for´mah-tiv) concerned in the origination and development of an organism, part, or tissue. process, data is being used as a positive force, not the potential hammer of summative evaluation (Coleman, 2003). Furthermore, formative evaluation is more likely to reach its goal than summative evaluation. Consider if a summative evaluation yields negative information, as has happened with programs like Head Start or children with disabilities (Westinghouse Learning Corporation, 1969), yet the parents can be close to fanatical fa·nat·i·cal adj. Possessed with or motivated by excessive, irrational zeal. fa·nat i·cal·ly adv. in their beliefs that the
programs are helping their children (Curice, 2001; Zigler & Styfco,
1994). How many political leaders will step forward and demand the
abolition The destruction, annihilation, abrogation, or extinguishment of anything, but especially things of a permanent nature—such as institutions, usages, or customs, as in the abolition of Slavery.In U.S. of a particular program in the face of angry and outraged parents? Also, as we shall see, there may be good reasons for questioning the negative findings of such summative evaluations. Two Steps to Evaluation A similar common problem emerges when we try to shortcut (1) In Windows, a shortcut is an icon that points to a program or data file. Shortcuts can be placed on the desktop or stored in other folders, and double clicking a shortcut is the same as double clicking the original file. what is a two-step evaluation process into one. We certainly want to include student performance in any evaluation since student performance is the acid test as to whether a new or experimental teaching unit is having an effect. Many of the innovators innovators people who will try new things. early innovators important figures in the farming or client community because they are the leaders in the introduction of new techniques and management systems. will provide special instruction in the new methods to a group of teachers who will be the focus of our investigation. Let us say that it is problem based learning for example. In the evaluation we will look at the student performance on problem-based-learning samples to see if the students have expanded their skills in the program. But we have overlooked a key step. We must first demonstrate that the teachers have learned the new techniques of problem based learning following their instruction in the new method! If the teachers cannot demonstrate that they can use the special method, and we know there are almost always some teachers for a variety of reasons who will not master the new method, then what is the use of continuing further with collecting data from these teachers? How can we expect treatment fidelity from those teachers when they haven't even mastered the new content or procedures themselves. Treatment fidelity is a term used to answer the question "Did the teacher present the content and process that she said she was going to teach?" Sometimes teachers will confidently assert that they are teaching problem based learning in their classrooms. Evaluation is designed to demonstrate that they are, in fact, teaching problem based learning and doing it in an accepted manner. Consider the effect on the program evaluation if the teacher is wrong. You will be evaluating student performance in what is, in essence, a nontreatment or placebo placebo (pləsē`bō), inert substance given instead of a potent drug. Placebo medications are sometimes prescribed when a drug is not really needed or when one would not be appropriate because they make patients feel well taken care of. . Obviously the innovative program won't achieve its goals in such circumstances CIRCUMSTANCES, evidence. The particulars which accompany a fact. 2. The facts proved are either possible or impossible, ordinary and probable, or extraordinary and improbable, recent or ancient; they may have happened near us, or afar off; they are public or (Gallagher, Cook, & Shoffner, 2003). To think about this nontreatment situation in other settings, consider that some eminent Eminent may refer to:
dos·age n. 1. Administration of a therapeutic agent in prescribed amounts. that is recommended and sometimes forgets to administer the drug entirely. What will be your opinion on the effectiveness of the new drug? Probably not too positive even if, in reality, the drug might be very effective. Without a close examination of the treatment process we have not demonstrated treatment fidelity or the appropriate administration of the treatment. Instead, we may be responsible for a "nontherapeutic dosage" (Gallagher, 2000). We may be underestimating the effectiveness of many of our educational programs because we have ignored treatment fidelity, or testing the proper way to administer the new curriculum or thinking skills. Well, what can we do about it? There are several ways to reassure re·as·sure tr.v. re·as·sured, re·as·sur·ing, re·as·sures 1. To restore confidence to. 2. To assure again. 3. To reinsure. ourselves. One, we can use classroom observers who will rate the teachers on the key elements of the experimental treatment. If question asking is a key element in the new program then the amount and kind of questions asked by the teacher can be documented. A certain standard can be applied on the kind and amount of questions, and the teachers who achieve those standards are now a part of the program to be assessed because they have demonstrated the treatment that we hoped for. On the other hand teachers who do not meet the performance standard should not be a part of the evaluation of the program, because they have performed less than satisfactorily, even as the nurse or doctor did in misapplying the new drug. Alternatively, the teachers who failed to meet the standard might be assigned as·sign tr.v. as·signed, as·sign·ing, as·signs 1. To set apart for a particular purpose; designate: assigned a day for the inspection. 2. to some additional training until they meet the criterion of the program to be evaluated. We cannot stress too strongly that when we include in the program evaluation the students in classrooms where the teachers are not teaching appropriately, whether it be "problem based learning," or the "new language arts language arts pl.n. The subjects, including reading, spelling, and composition, aimed at developing reading and writing skills, usually taught in elementary and secondary school. ," or "creative imagery," we are sapping the effectiveness of the program. Even if we demonstrate, through the use of classroom observation or videotape videotape Magnetic tape used to record visual images and sound, or the recording itself. There are two types of videotape recorders, the transverse (or quad) and the helical. analysis of their classroom activities, that the teachers know how to present problem based learning we are still not done. The further question is, "Do they use the new instructional method in their classroom?" We should not just assume that because the teacher knows the new content or methods that they will use them in their own classrooms. There can be a number of reasons why the teachers might not use the experimental program. It might be too difficult for them to use with a particular class. Perhaps the principal or supervisor in their school is against the use of the new content or procedures. Classroom videotapes can be one way of reassuring re·as·sure tr.v. re·as·sured, re·as·sur·ing, re·as·sures 1. To restore confidence to. 2. To assure again. 3. To reinsure. us that the "treatment" is in place. Treatment Drift There always is the problem of treatment drift. Over time some teachers will revert re·vert v. 1. To return to a former condition, practice, subject, or belief. 2. To undergo genetic reversion. to the style of teaching with which they are most comfortable and abandon the new methods, even if they are convinced that they are still using the special method. So, once again, some demonstration of the new content or procedures in the classroom is necessary before that class becomes a part of the program evaluation. If we routinely include all of the classrooms of the teachers who attended the training workshops we run the serious risk of underestimating our programs' effect by including poor transmission of the experimental program. Performance Evidence Faced with a group of Doubting Thomases in your school or in your community or state, where are you going to get the evidence to put their concerns to rest? We could ask the students, of course, about how they liked the new program and such information can be a part of a total evaluation but the more hardheaded hard·head·ed adj. 1. Stubborn; willful. 2. Realistic; pragmatic. hard head skeptics will demand objective
evidence of gains. When faced with such a question we often thrash thrash - To move wildly or violently, without accomplishing anything useful. Paging or swapping systems that are overloaded waste most of their time moving data into and out of core (rather than performing useful computation) and are therefore said to thrash. around to find some measuring instrument that might satisfy the critics.
One of the more common answers has been to use standardized standardized pertaining to data that have been submitted to standardization procedures. standardized morbidity rate see morbidity rate. standardized mortality rate see mortality rate. measures such as the Stanford Achievement Test or the Iowa Test of Basic Skills The Iowa Test of Basic Skills (ITBS) are a set of standardized tests given annually to school students in the United States. These tests are given to students beginning in kindergarten and progressing until Grade 8 to assess educational development. (ITBS ITBS Iowa Test of Basic Skills ITBS Iliotibial Band Syndrome ITBS Industrial Technologies Business Solutions ), or even the Statewide tests already in use in your community. This seems to be just right, to the casual observer. They are respected instruments that produce numbers that should convince the critics of the utility of the program. But we are likely to be confounded by the results that often appear on these standardized tests A standardized test is a test administered and scored in a standard manner. The tests are designed in such a way that the "questions, conditions for administering, scoring procedures, and interpretations are consistent" [1] . Our treatment or experimental group shows no gain over a comparison group or class that is receiving the usual curriculum, even if the teachers or students are raving rav·ing adj. 1. Talking or behaving irrationally; wild: a raving maniac. 2. Exciting admiration: a raving beauty. n. about the virtues of the new method. What happened? Let us say that we have a new way of teaching inquiry in science so we will look at the results on the subtest on Science in the ITBS. If there is no measurable gain on the scores of the subtest by the students, does this mean that the student hasn't learned anything about inquiry in science? Maybe it means that what you were teaching the student isn't covered by that Science subtest! If this disjunction disjunction /dis·junc·tion/ (-junk´shun) 1. the act or state of being disjoined. 2. in genetics, the moving apart of bivalent chromosomes at the first anaphase of meiosis. in coverage is true, you really should have no reason to believe that the students would improve on the items in a science test that were far distant from your special program (Coleman, 2003). The use of these subtests of standard achievement tests in which the content of the test has little to do with your special curriculum program has been one of the most common ways to underestimate your apparent effectiveness. Well, if you can't use these standard achievement tests, what can you do to show people what the students have learned? One approach is to design tasks that require the application of the skills you have taught the students. If you have taught scientific inquiry, you might use a technique used by the University of Connecticut The University of Connecticut is the State of Connecticut's land-grant university. It was founded in 1881 and serves more than 27,000 students on its six campuses, including more than 9,000 graduate students in multiple programs. UConn's main campus is in Storrs, Connecticut. known as the Diet Coke Diet Coke (sometimes known as Diet Coca-Cola, Coca-Cola Light or Coke Light) is a sugar-free soft drink produced and distributed by The Coca-Cola Company. test. The question the student faces is "How do you design procedures that will demonstrate whether bees have a special liking for Diet Coke?" If the student has mastered some of the skills essential to the scientific method, they should be able to present a reasonable set of procedures whereby you can test the proposition that bees like Diet Coke. Such student productions may require the use of judges to evaluate the answers given by the students, and some training of the judges themselves on what is important for the students to include, but you will have a convincing demonstration that your students mastered the key elements of your science instruction. What Determines Significance of Findings? To answer this question we must take a short trip into the world of statistics since these are the tools used to help us determine whether our findings have significance or not. Over the years there has been a standard of statistical significance that has been accepted by most social scientists without much thought or consideration. The .05 level of statistical significance has been the standard for determining whether a result can be considered as real or not. Essentially this means that the chance that the difference is a real one is 95 chances out of a hundred, which also would mean that there are 5 chances out of a hundred that such a conclusion is incorrect. Setting the standard at the .05 level as the boundary line for scientists is choosing that five times out of a hundred of being wrong is an acceptable standard. Such a standard was set many years ago by Sir Ronald Fisher (1954) in his work on agricultural research and has survived to this day. Such a standard recognizes that there are two types of errors that can be made in a statistical analysis: 1. Type I error: You find a statistical difference between groups when, in fact, there isn't one. 2. Type II error: You find no difference between groups when, in truth, there is one. Setting the standard high (.05 or .01) accepts the notion that a Type I error is the most important of these two errors to consider. But is that really true in program evaluation? If we have a Type II error, that means we have missed on some possibly promising intervention A procedure used in a lawsuit by which the court allows a third person who was not originally a party to the suit to become a party, by joining with either the plaintiff or the defendant. or innovation. Once it has been officially declared in the journals that there is nothing significant in this intervention program, the tendency is for investigators to abandon that line of investigation, because who wants to be testing a program that has been proven to be nonsignificant non·sig·nif·i·cant adj. 1. Not significant. 2. Having, producing, or being a value obtained from a statistical test that lies within the limits for being of random occurrence. ? The reason for pursuing this point is because there is good reason to believe that there are many cases of Type II errors being made in educational research and educational evaluation Educational evaluation is the evaluation process of characterizing and appraising some aspect/s of an educational process. There are two common purposes in educational evaluation which are, at times, in conflict with one another. . This would mean that we are throwing away interventions and programs that need to be considered. The standard indicator of significance is the F test and that is determined by the formula: Between variance/Within variance The discrepancy between what a party to a lawsuit alleges will be proved in pleadings and what the party actually proves at trial. In Zoning law, an official permit to use property in a manner that departs from the way in which other property in the same locality If we increase the within or error variance we obviously decrease the chances of getting a large F ratio. And there are lots of ways of increasing the error variance. We already have noted some. If the teacher is not teaching the proposed intervention, that increases the error variance. If the measuring instruments are not measuring what you intended, that increases the error variance. If you have students who are just not responding to the teacher you are increasing the error variance and all of that contributes to obtaining a F ratio that does not reflect the tree effects of the intervention. We have shot ourselves in the foot! One answer to this dilemma is to find different ways of judging significance. One of these is the concept of Effect Size, which merely reports the differences between groups divided by the standard deviation In statistics, the average amount a number varies from the average number in a series of numbers. (statistics) standard deviation - (SD) A measure of the range of values in a set of numbers. of the sample, and report that number, allowing the investigator and the reader to make up their minds about how meaningful such an effect size would be. Generally, a .00 to .20 is considered no effect, .20 to .50 a modest effect, a .50 to .80 or stronger effect. The effect size allows us to present all of the data instead of a significant-nonsignificant dichotomy di·chot·o·my n. pl. di·chot·o·mies 1. Division into two usually contradictory parts or opinions: "the dichotomy of the one and the many" Louis Auchincloss. . Our evaluations of educational practice would mean establishing the true presence of the proposed treatment in the classrooms involved, using measuring instruments appropriate to the treatment involved, and using statistical approaches appropriate to the task. Currently less than one fourth of the articles testing new techniques even mention treatment fidelity much less take it into account in their analyses (Gresham, MacMillan, Beck-Frankenberger, & Bocan, 2000). But we still use Professor Fisher's standards of statistical significance in exploratory research Exploratory research is a type of research conducted because a problem has not been clearly defined. Exploratory research helps determine the best research design, data collection method and selection of subjects. when they were meant for testing hypotheses or theoretical constructs, raising the chance of Type II error. Short budgets and skeptical critics will make sure that demands for accountability will continue and increase. This article does not provide an excuse for dodging these responsibilities but rather encouragement for doing the necessary tasks effectively so as to present a true picture of the program being evaluated. REFERENCES Coleman, M. (2003). U-STARS report. Chapel Hill, NC: FPG FPG Fasting plasma glucose, see there Institute. Curice, J. (2001). A fresh start for Head Start. Washington DC: The Brookings Institute. Fisher, R. (1954). The design of experiments (6th ed.). New York New York, state, United States New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of : Haefive. Fullan, M. (2001). The new meaning of educational change. (3rd ed.). New York: Teachers College Press. Gallagher, J. (2000). Unthinkable thoughts: Education of gifted students. Gifted Child gifted child Child naturally endowed with a high degree of general mental ability or extraordinary ability in a specific domain. Although the designation of giftedness is largely a matter of administrative convenience, the best indications of giftedness are often those Quarterly. 44. 5-12. Gallagher, J., Cook, E., & Shoffner, M. (2003). Project insight II: Program evaluation. Chapel Hill, NC: FPG Institute. Gresham, F., MacMillan, D., Beck-Frankenberger, M., & Bocan, K. (2000). Treatment integrity in learning disabilities intervention research: Do we really know that treatments are implemented? Learning Disabilities Research and Practice, 15, 198-205. Westinghouse Learning Corporation. (1969). The impact of Head Start: An evaluation of the effects of Head Start on children's cognitive and affective affective /af·fec·tive/ (ah-fek´tiv) pertaining to affect. af·fec·tive adj. 1. Concerned with or arousing feelings or emotions; emotional. 2. development. Athens: Ohio University Ohio University, main campus at Athens; state supported; coeducational; chartered 1804, opened 1809 as the first college in the Old Northwest. There are additional campuses at Chiillicothe, Lancaster, and Zanesville, as well as facilities throughout the state. . Zigler, E., & Styfco, S. (1994). Head Start ... in a constructive context. American Psychologist The American Psychologist is the official journal of the American Psychological Association. It contains archival documents and articles covering current issues in psychology, the science and practice of psychology, and psychology's contribution to public policy. , 49, 127-132.
Table 1
Institute for Educational Sciences Research Goals
IES Research Goals (2005)
Goal 1 Identify existing programs, practices, and policies that
may have an impact on student outcomes and the factors
that may mediate or moderate the effects of these programs,
practices, and policies
Goal 2 Develop programs, practices, and policies that are
potentially effective for improving outcomes
Goal 3 Establish the efficacy of fully developed programs,
practices, or policies that either have evidence of
potential efficacy or are widely used but have not been
rigorously evaluated
Goal 4 Provide evidence on the effectiveness of programs,
practices, and policies implemented at scale
Goal 5 Develop or validate data and measurement systems and tools
Note. From Institute of Education Sciences Request for Application
# NCER-06-08, June 27, 2005 (CFDA #84.305).
http://www.ed.gov/about/offices/list/ies/programs.html
|
|
||||||||||||||||||

i·cal·ly adv.
Printer friendly
Cite/link
Email
Feedback
Reader Opinion