Does accountability work?Margaret Raymond and Eric Hanushek Eric A. Hanushek is the Paul and Jean Hanna Senior Fellow at the Hoover Institution of Stanford University and an expert on education policy. His main area of interest is the economics of education, focusing on controversial areas of education policy including the class size harshly criticize (see "High-Stakes Research," Feature, Summer 2003) our study of high-stakes testing A high-stakes test is an assessment which has important consequences for the test taker. If the examinee passes the test, then the examinee may receive significant benefits, such as a high school diploma or a license to practice law. policies. Before reporting the results from our study, the New York New York, state, United States New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of Times journalist obtained feedback from our study's external reviewers as well as from scholars and advocates who support high-stakes testing. Raymond and Hanushek ask the media, "Why not bring in some outside expertise to review such a report before heralding its arrival?" Actually, the media did. Our study analyzed data across multiple indicators of academic achievement, not simply the National Assessment of Educational Progress The National Assessment of Educational Progress (NAEP), also known as "the Nation's Report Card," is the only nationally representative and continuing assessment of what America's students know and can do in various subject areas. (NAEP NAEP National Assessment of Educational Progress NAEP National Association of Environmental Professionals NAEP National Association of Educational Progress NAEP National Agricultural Extension Policy NAEP Native American Employment Program ). Yet Raymond and Hanushek's review looked only at the results from our analysis of NAER NAER National Association for Emergency Response ignoring the effects that high-school graduation exams have had on college-admissions tests like the SAT and on participation and performance in Advanced Placement courses. They also ignored the fact that high-school graduation exams have resulted in increased dropout (1) On magnetic media, a bit that has lost its strength due to a surface defect or recording malfunction. If the bit is in an audio or video file, it might be detected by the error correction circuitry and either corrected or not, but if not, it is often not noticed by the human rates and an increasing use of the General Educational Development, or GED GED abbr. 1. general equivalency diploma 2. general educational development GED (US) n abbr (Scol) (= general educational development) → , tests as a substitute for a high-school diploma. Do these consistently negative effects matter when assessing high-stakes testing? We think so. Raymond and Hanushek discard our findings on the basis that our methods were flawed. All of our findings were derived using one of the strongest designs in empirical research--the archival time-series analysis Time-series analysis Assessment of relationships between two or among more variables over periods of time. , a method that some claim is second in quality only to a true controlled experiment "Controlled Experiment" is an episode of the original The Outer Limits television show. It first aired on 13 January, 1964, during the first season. Introduction A martian controller is assigned to investigate the phenomenon of murder on Earth. . An archival time-series analysis is simple enough that readers do not need a background in statistics to understand the underlying logic. Readers need not get caught up in more-complicated analyses, such as significance testing, effect sizes, and even regression--statistical methods that Raymond and Hanushek criticize us for not using. However, many statistical textbooks recommend against using complicated statistical methods with archival time series analyses. Raymond and Hanushek throw the bias card into their critique, writing, "When a report is commissioned by an organization like the Great Lakes Great Lakes, group of five freshwater lakes, central North America, creating a natural border between the United States and Canada and forming the largest body of freshwater in the world, with a combined surface area of c.95,000 sq mi (246,050 sq km). Center for Education Research and Practice, a Midwestern group sponsored by six state affiliates of the National Education Association, it would seem to call for a reasonable dose of skepticism." Not mentioned by Raymond and Hanushek is the fact that the research was originally funded by the Rockefeller Foundation Rockefeller Foundation, philanthropic institution established (1913) by John D. Rockefeller, Sr., to promote "the well-being of mankind throughout the world." During its first 14 years the foundation received $183 million from Rockefeller. and was published in a peer-reviewed scholarly journal six months before the consortium of teacher unions released this version of the study. The fact that teacher unions backed the study had no impact on its conclusions. Raymond and Hanushek claim that the "accumulated literature" supports the conclusion that "student performance on the available measures, usually state tests, improves after accountability reforms are introduced." We believe that is patently false. We conducted a thorough review of the literature on high-stakes testing and found very few articles that would support such a proposition. AUDREY AMREIN, DAVID BERLINER David C. Berliner is an educational psychologist and professor of education at Arizona State University. Berliner received a Doctorate of Education from Stanford University. Arizona State University Arizona State University, at Tempe; coeducational; opened 1886 as a normal school, became 1925 Tempe State Teachers College, renamed 1945 Arizona State College at Tempe. Its present name was adopted in 1958. Tempe, Arizona Margaret Raymond and Eric Hanushek respond: The assertion that "archival time-series analysis" is second in quality only to a true controlled experiment is ludicrous. Long ago, in their classic discussion of research design, Donald Campbell and Julian Stanley said that the time-series design "rarely has accepted status in the enumerations of available experimental designs in the social sciences" The obvious inability of simplistic sim·plism n. The tendency to oversimplify an issue or a problem by ignoring complexities or complications. [French simplisme, from simple, simple, from Old French; see simple historical approaches to establish "experimental isolation"--to rule out other factors that might have influenced the observed outcomes--opens up results from such analyses to significant interpretative questions. Another problem with Amrein and Berliner's study is that they did not define an adequate comparison group. Instead, they compared student-performance trends in (some of) the states that adopted high-stakes testing with the average gain among states participating in NAEP--a trend that partially reflects the gains among high-stakes states, thereby corrupting the analysis. Amazingly, they make no attempt to defend this faulty approach. Instead, they trumpet the fact that they reached similar conclusions when they applied the same troubled analysis to other measures of student performance, such as SAT scores and drop-out rates. When we applied Amrein and Berliner's own time-series methodology to the data (with an appropriate comparison group of states that have not adopted highstakes testing), their conclusions were completely reversed. Yet Amrein and Berliner don't even address this. Their response ignores the egregious errors in implementation that we identified, namely the fact that they threw out a majority of the state observations, miscoded outcome information, and completely confused the sequence of test introduction and achievement measurement in several states. We know of no legitimate statistical text that argues it is irrelevant to use tests of statistical significance to guard against random fluctuations in the data--in this case, scores on tests of student performance. Each administration of the NAEP involves a different group of students, a different set of test questions, and a different testing environment. Across test administrations, these differences can lead to random changes in scores that bear little relation to actual changes in students' knowledge and skills. The purpose of tests of statistical significance is to determine whether results reflect genuine changes in performance or simply random fluctuation. That four of Amrein and Berliner's colleagues from education schools approved of their report says more about the quality of the standards for research at too many schools of education than about the validity of this particular study. The disregard for standard scientific principles reveals why so little has been learned about effective educational practices. |
|
||||||||||||||||

Printer friendly
Cite/link
Email
Feedback
Reader Opinion