Randomly accountable: failing to account for natural fluctuations in test scores could undermine the very idea of holding schools accountable for their efforts--or lack thereof. (Research).
THE ACCOUNTABILITY DEBATE TENDS TO DEVOLVE devolve v. when property is automatically transferred from one party to another by operation of law, without any act required of either past or present owner. The most common example is passing of title to the natural heir of a person upon his death. INTO A BATTLE between the pro-testing and anti-testing crowds. But when it comes to the design of a school accountability system, the devil is truly in the details. A well-designed accountability plan may go a long way toward giving school personnel the kinds of signals they need to improve performance. However, a poorly designed scheme, which ignores the statistical properties of schools' average test scores, may do more harm than good.
The recent debate over the reauthorization of the federal Elementary and Secondary Education Act “Title I” redirects here. For other uses of "Title I", see Title I (disambiguation).
The Elementary and Secondary Education Act (ESEA) (Pub.L. 89-10, 79 Stat. 77, ) is a United States federal statute enacted April 111965. (ESEA ESEA Elementary and Secondary Education Act
ESEA E-Sports Entertainment Association
ESEA Eurocopter South East Asia ) is a case in point. From his first days in office, President Bush promised to make education reform a centerpiece of his administration, using the reauthorization of the ESEA as an opportunity to give the state-led accountability movement a dramatic shove forward. Within six months of his raking raking
of an elephant—see back raking. office, both houses of Congress had passed bills that imposed new federal standards for the states' accountability efforts.
However, both bills were seriously flawed flaw 1
1. An imperfection, often concealed, that impairs soundness: a flaw in the crystal that caused it to shatter. See Synonyms at blemish.
2. . They created standards that, over time, would have identified nearly every school in the nation as"low performing," forcing them to spend precious resources developing unnecessary school-improvement plans. A tide of paperwork would have crowded out time for learning. This almost turned the most significant federal foray into Verb 1. foray into - enter someone else's territory and take spoils; "The pirates raided the coastal villages regularly"
encroach upon, intrude on, obtrude upon, invade - to intrude upon, infringe, encroach on, violate; "This new colleague invades my education policy in decades into an embarrassment. Changes were made by a House-Senate conference committee, so the law, as enacted, remedied the most glaring glar·ing
1. Shining intensely and blindingly: the glaring noonday sun.
2. Tastelessly showy or bright; garish.
3. problems, but created others. The saga illustrates the difficulties of designing an effective accountability system.
The House and Senate Bills
At the heart of both bills was a detailed formula for determining when a school is making "adequate yearly progress Adequate Yearly Progress, or AYP, is a measurement defined by the United States federal No Child Left Behind Act that allows the U.S. Department of Education to determine how every public school and school district in the country is performing academically. ." The consequences for schools that failed to meet their performance targets were progressively severe-after one year, districts would be required to offer public school choice to all the students in a school; after several years, districts would be required to replace school staff, convert the school into a public charter school, or hand the school over to a private contractor.
The problem is that such consequences place too much weight on single-year changes in test scores at the school level, Either bill would have required an increase in the proportion of students scoring above the proficient pro·fi·cient
Having or marked by an advanced degree of competence, as in an art, vocation, profession, or branch of learning.
An expert; an adept. level in both math and reading, each and every year. However, test scores at the school level often fluctuate for reasons other than any underlying change in a school's performance. Such volatility arises from two sources, The first is variation due to differences in the groups of students being tested each year. Even if the students are being drawn from the same families and the same neighborhoods, the average performance of a school can fluctuate from year to year depending on the attitudes and abilities of the students in each cohort cohort /co·hort/ (ko´hort)
1. in epidemiology, a group of individuals sharing a common characteristic and observed over time in the group.
2. , The average elementary school elementary school: see school. contains only 68 students per grade level. With a sample this small, having five particularly bright students (or a few students with undiagnosed learning disabilities) in any one year can lead to large fluctuations in a school 's test scores from one year to the next. The Department of Labor measures the monthly unemployment rate with a sample of nearly 60,000 households, Congress was proposing that the Department of Education measure the performance of the typical elementary school grade with a sample neatly 1/1000 the size.
The second source of variation is one-time factors that lead to temporary fluctuations in test performance. Some of these factors are likely to be unrelated to the educational practices of a school. For instance, a dog barking bark 1
1. The harsh sound uttered by a dog.
2. A sound, such as a cough, that is similar to a dog's bark.
v. barked, bark·ing, barks
1. on the day of the test, a severe flu season
1. Advantageous; helpful: favorable winds.
2. Encouraging; propitious: a favorable diagnosis.
3. chemistry between a teacher and a particular group of students or teacher turnover, Whatever the source of variation, single-year changes in test performance are very unreliable indicators of where a school is headed over the long term.
Consider the examples of North Carolina North Carolina, state in the SE United States. It is bordered by the Atlantic Ocean (E), South Carolina and Georgia (S), Tennessee (W), and Virginia (N). Facts and Figures
Area, 52,586 sq mi (136,198 sq km). Pop. and Texas, Between 1994 and 1999, these states were the educational envy of the nation, raising proficiency pro·fi·cien·cy
n. pl. pro·fi·cien·cies
The state or quality of being proficient; competence.
Noun 1. proficiency - the quality of having great facility and competence rates in math and reading by 2 to 5 percentage points in the average year. However, the vast majority of schools in those states exhibited much less consistent progress: less than 2 percent of schools witnessed an increase in math and reading proficiency each and every year for those five years. Indeed, we estimate that between 98 and 100 percent of the elementary schools in North Carolina and Texas would have failed the House and Senate's initial definitions of annual yearly progress at least once between 1994 and 1999.
Furthermore, both bills would have compounded the error by requiring annual increases in test scores for every racial subgroup sub·group
1. A distinct group within a group; a subdivision of a group.
2. A subordinate group.
3. Mathematics A group that is a subset of a group.
tr.v. in a school. The intent was admirable ad·mi·ra·ble
ad : to ensure that schools do not ignore minority children, But this provision was likely to have harmed its intended beneficiaries, by arbitrarily sanctioning sanc·tion
1. Authoritative permission or approval that makes a course of action valid. See Synonyms at permission.
2. Support or encouragement, as from public opinion or established custom.
3. schools that enroll students from several different racial or ethnic subgroups. Suppose that a school is solidly on the path to improvement, with a 70 percent chance of increasing the proficiency of any racial subgroup in a given year. A school with two racial subgroups in its student body would have a less than 50-50 chance of achieving an increase for both groups in a given year-because the year-to-year fluctuations are nearly independent for each racial group (therefore the probability is .70 times .70, or .49). The odds would be even longer for a school with three racial subgroups (.70 times .70 times .70, or .34). Since African-American and Latino students are more likely to attend schoo ls with more than one racial group, they are more likely to see their education disrupted arbitrarily.
A number of states have established accountability programs that track the performance of racial and ethnic subgroups separately. For example, California requires schools to meet certain growth targets for all "numerically significant" subgroups in a school. In order to be numerically significant, a group must either represent at least 15 percent of the student body and have more than 30 students or have more than 100 students regardless of what percentage they are, There are eight different groups that can qualify as numerically significant, depending on the number of students in each group in a school: African-American, American Indian American Indian
or Native American or Amerindian or indigenous American
Any member of the various aboriginal peoples of the Western Hemisphere, with the exception of the Eskimos (Inuit) and the Aleuts. (or Alaska Native), Asian, Filipino, Hispanic, Pacific Islander Pacific Islander
1. A native or inhabitant of any of the Polynesian, Micronesian, or Melanesian islands of Oceania.
2. A person of Polynesian, Micronesian, or Melanesian descent. See Usage Note at Asian. , white non-Hispanic, and "socioeconomically disadvantaged" students.
We calculated the likelihood of a California school's winning a Governor's Performance Award by size and by the number of numerically significant subgroups. Among the smallest quintile quin·tile
1. The astrological aspect of planets distant from each other by 72° or one fifth of the zodiac.
2. Statistics The portion of a frequency distribution containing one fifth of the total sample. of elementary schools, 47 percent of racially heterogeneous schools (those with four or more racial subgroups) won performance awards, versus 82 percent of similarly sized but racially homogeneous The same. Contrast with heterogeneous.
homogeneous - (Or "homogenous") Of uniform nature, similar in kind.
1. In the context of distributed systems, middleware makes heterogeneous systems appear as a homogeneous entity. For example see: interoperable network. schools. This is particularly ironic given the fact that overall growth in performance was slightly higher for more integrated schools between 1999 and 2000. Moreover, the reason for a school's failure to win an award was often not that African-American and Latino students were lagging Lagging
Strategy used by a firm to stall payments, normally in response to exchange rate projections. behind, but that white non-Hispanic students experienced slower growth in achievement: the average school with multiple racial subgroups witnessed larger gains for African-American and Latino students than for white students.
Separate achievement targets for racial and ethnic subgroups seem to be neither necessary nor especially effective in coaxing schools to focus on the performance of racial and ethnic minorities. In North Carolina, where there are no separate racial targets, African-American and Latino students experienced slightly higher improvements in proficiency than white non-Hispanic youth. Until this year, the rating system in Texas specified separate targets for racial subgroups that accounted for more than 10 percent of the student body (and more than 30 students). However, African-American and Latino students saw the same improvements in their test scores whether or not they attended schools with enough minority students to require a separate racial target.
The conference committee's compromise bill remedied some problems but created new ones. Earlier versions of the legislation rated schools according to according to
1. As stated or indicated by; on the authority of: according to historians.
2. In keeping with: according to instructions.
3. their year-to-year improvements in the share of their students who achieve a certain proficiency level. Now schools will simply need to have a certain minimum percentage of their students (and of racial, ethnic, and socioeconomic so·ci·o·ec·o·nom·ic
Of or involving both social and economic factors.
of or involving economic and social factors
Adj. 1. subgroups within each school) deemed "proficient" each year. The initial minimum proficiency rate will be the greater of the proficiency rate of the 20th-percentile school or the average statewide proficiency rate of the lowest-scoring subgroup. In many states, the effective minimum will be the proficiency rate of the 20th-percentile school, However, more than 20 percent of all schools are likely to fail, because the threshold will apply not only to the school as a whole but also to all the subgroups in a school, If any racial, ethnic, or socioeconomic subgroup within a school fails, the school fails. As a result, a disproportionate dis·pro·por·tion·ate
Out of proportion, as in size, shape, or amount.
dispro·por nu mber of the schools that enroll disadvantaged minority subgroups are likely to fail. The minimum proficiency rate that schools are required to meet will be raised gradually to 100 percent over the next 12 years.
The main beneficiaries of the conference committee's changes will be suburban schools whose initial rates of proficiency are above the minimum, for they will no longer be penalized pe·nal·ize
tr.v. pe·nal·ized, pe·nal·iz·ing, pe·nal·iz·es
1. To subject to a penalty, especially for infringement of a law or official regulation. See Synonyms at punish.
2. for temporary downward fluctuations in scores. The primary losers will be schools with initially low levels of proficiency for any subgroup. They will now be required to achieve a 10-percentage-point increase in proficiency for those subgroups to avoid the sanctions Sanctions is the plural of sanction. Depending on context, a sanction can be either a punishment or a permission. The word is a contronym.
Sanctions involving countries:
One flaw in the new formula is that it provides a strong incentive for states to lower the score students must exceed on their state tests in order to achieve "proficiency." The problem is that redefining proficiency simply because of the new federal requirements may create a credibility problem for the standards movement in a number of states.
Another problem not remedied in the final bill is that any federal definition of adequate yearly progress is likely to conflict with at least one of the state accountability plans that are already in place. There are three common variants in state accountability systems: some states, such as North Carolina, Arizona, and Tennessee, rate their schools with a measure of a school's value-added, using the growth in performance for a given group of students since the end of the preceding school year; other states, such as Texas and Illinois, rate their schools on the percentage of students scoring above certain thresholds; still other states, such as California, rate their schools based on their change in rest scores from one year to the next. (A fourth category of states rates schools based on some mixture of value-added, levels, or changes.) Thus states that have been rewarding schools based on value-added measures or on changes in scores may be required to sanction sanction, in law and ethics, any inducement to individuals or groups to follow or refrain from following a particular course of conduct. All societies impose sanctions on their members in order to encourage approved behavior. the very schools they have been rewarding.
The next battleground is likely to be the issue of how many students it takes to create a separate racial, ethnic, or socioeconomic subgroup for accountability purposes. The legislation only requires that there be a sufficient number of students to yield statistically reliable information in order for the subgroup to count separately. The higher the threshold--say, requiring a subgroup to represent at least 15 percent of the student body, as opposed to 5 or 10 percent--the lower the failure rate will be for schools with small percentages of disadvantaged minority students.
State and federal officials ought to keep three basic principles in mind in designing test-based accountability systems:
+ Multiple years of data are required to measure improvements in performance reliably.
Children arrive at school with widely varying levels of preparation. Even a mediocre me·di·o·cre
Moderate to inferior in quality; ordinary. See Synonyms at average.
[French médiocre, from Latin mediocris : medius, middle; see medhyo- school can expect high test scores if its students come from wealthy backgrounds. As a result, policymakers in many states have attempted to level the playing field by focusing on improvements in test scores, However, improvements are very difficult to discern dis·cern
v. dis·cerned, dis·cern·ing, dis·cerns
1. To perceive with the eyes or intellect; detect.
2. To recognize or comprehend mentally.
3. with a couple of years' worth of data, for two reasons. First, schools differ much less in the extent to which they improve test scores from year to year than they do in their beginning level of performance. Second, any measure of change in performance is likely to amplify the effect of sampling variation and other one-time factors that lead to fluctuations in performance. In other words Adv. 1. in other words - otherwise stated; "in other words, we are broke"
put differently , identifying improvements in performance as opposed to levels of performance in a single year is like looking for Looking for
In the context of general equities, this describing a buy interest in which a dealer is asked to offer stock, often involving a capital commitment. Antithesis of in touch with. a smaller needle in a bigger haystack. If policymakers intend to measure and reward improvements in test performance at the school level, they will need to rely on multiple years of data.
Improvements can be measured in two basic ways: the improvement in performance for a given group of students from one year to the next (known as a value-added approach), or the improvement in performance across different groups of students (which we will refer to as cross-cohort changes). The improvement of scores for at least two contiguous grades (for example, grades 4 and 5) from one year to the next is a mixture of value-added changes (the 4th grade students who become 5th graders) and cross-cohort changes (the 4th grade students this year are a different group from the 4th grade students last year).
Kane and Staiger have analyzed an·a·lyze
tr.v. an·a·lyzed, an·a·lyz·ing, an·a·lyz·es
1. To examine methodically by separating into parts and studying their interrelations.
2. Chemistry To make a chemical analysis of.
3. the statistical properties of value-added and cross-cohort changes in test scores, using data from North Carolina (see Figure 1). (Full citations are available at www.educationnext.org.) We measured value-added with the average change in combined reading and math scores for a school's students between the end of 3rd grade and the end of 4th grade; we measured cross-cohort changes with the change in 4th grade scores from one year to the next, Among median-size schools in North Carolina, roughly half of the variance between schools in value-added in 4th grade math and reading was due to sampling variation and other one-time factors. For the smallest quintile of schools, the percentage of variance due to non-persistent factors was even higher (58 percent), while for the largest quintile of schools the percentage was somewhat lower (29 percent). Cross-cohort changes in mean test scores from one year to the next were measured even more unreliably. More than three-quarters of the vari ance in the annual change in mean test scores among the smallest quintile of schools was due to one-time, non-persistent factors. This percentage was only slightly smaller (73 percent) for the largest quintile of schools. Such volatility can wreak wreak
tr.v. wreaked, wreak·ing, wreaks
1. To inflict (vengeance or punishment) upon a person.
2. To express or gratify (anger, malevolence, or resentment); vent.
3. havoc when rewards and punishments are doled out Adj. 1. doled out - given out in portions
apportioned, dealt out, meted out, parceled out
distributed - spread out or scattered about or divided up on the basis of changes in test scores; school personnel are at risk of being punished or rewarded for results that are beyond their control.
Therefore, when policymakers seek to reward schools for improvements in test scores, they should do so based on multiple years rather than a single year of data, Moreover, while a simple arithmetic average of improvements over multiple years would be an improvement, there are even more efficient ways to pool information over time. For instance, building on work by McClellan and Staiger (1999) in rating hospital performance, we have proposed a simple technique for pooling information over time, which improves on a simple arithmetic mean (mathematics) arithmetic mean - The mean of a list of N numbers calculated by dividing their sum by N. The arithmetic mean is appropriate for sets of numbers that are added together or that form an arithmetic series. by taking into account the amount of "signal" and "noise" in a given measure of performance. For instance, for large schools, for which we would expect less noise in any given year's measure, the proposed method would place more weight on more recent scores; for small schools, the method would place more equal weights on each of several years' worth of scores,
+ Incentives targeted at schools with test scores at either extreme--rewards for those with very high scores or sanctions for those with very low scores--affect primarily small schools and provide very weak incentives for large schools.
Each year since 1997, North Carolina has recognized the 25 elementary and middle schools in the state with the highest scores on the "growth composite," a measure reflecting the average gain in performance among students enrolled at a school, Winning schools receive financial awards.
One indicator of the volatility of test scores is the rarity of repeat winners, Between 1997 and 2001, 101 awards were handed out for schools ranking in the top 25. (One year, two schools tied at the cut-off cut-off Anesthesiology The point at which elongation of the carbon chain of the 1-alkanol family of anesthetics results in a precipitous drop in the anesthetic potential of these agents–eg, at > 12 carbons in length, there is little anesthetic activity, .) These 101 awards were won by 90 different schools, with only 9 schools winning twice and only 1 school winning three times, No school was in the top 25 in all four years.
Of the 840 elementary schools we analyzed, 59 were among the top 25 at some point between 1997 and 2000 (the top 25 each year included middle schools, which we are not analyzing here). Among all the schools, the average gain score was not strongly related to school size, but the variance between schools was much larger for small schools, The variance in mean gain scores among schools in the smallest size decile decile
one of the groups when a series of ranked data is divided into ten equal parts, or dividing points between such groups. See also quartile. was nearly five times the variance among the largest decile of schools (.048 compared with .011). As a result, schools in the smallest decile were much more likely to be among the top 25 schools at some point over the period: Even though their mean gains were not statistically different, the smallest schools were 23 times more likely to win a top-25 award than the largest schools.
For the very same reason, small schools are also overrepresented o·ver·rep·re·sent·ed
Represented in excessive or disproportionately large numbers: "Some groups, and most notably some races, may be overrepresented and others may be underrepresented" among those with extremely low test scores. Beginning in 1997, the state assigned assistance teams to intervene in schools that performed poorly on state tests and failed to meet their growth targets from the previous year. All but one of the elementary schools assigned an assistance team were among the smallest 40 percent of schools, (The smallest decile of schools would have received an even larger share of the assistance teams, except for a rule requiring that the proportion of students scoring below grade level be statistically significantly less than 50 percent.)
This year, the state of California distributed $100 million to teachers in schools that started with test scores in the bottom half of schools in 1999 and achieved large gains in performance between 1999 and 2000. A thousand teachers in schools with the largest improvements received $25,000 bonuses on average. Small schools in California were considerably more likely to win one of these awards than were larger schools. Given the importance of sampling variation and the fact that the largest bonuses were reserved for teachers in schools with the most extreme increases in test scores, this is hardly a surprise.
A threshold at either extreme is likely to be irrelevant for large schools, since they are unlikely to experience such large swings in performance regardless of their efforts. If the marginal costs Marginal cost
The increase or decrease in a firm's total cost of production as a result of changing production by one unit.
The additional cost needed to produce or purchase one more unit of a good or service. of improving are also higher at large schools, the problem of weak incentives for large schools would only be compounded. A remedy would be to establish different thresholds for schools of different sizes. For example, grouping schools according to size (as is done in high-school sports) and giving awards to the top 5 percent in each size class would tend to even out the incentives (and disparities) between large and small schools. An alternative solution would be to establish thresholds closer to the middle of the test-score distribution, where the disparity dis·par·i·ty
n. pl. dis·par·i·ties
1. The condition or fact of being unequal, as in age, rank, or degree; difference: "narrow the economic disparities among regions and industries" for large and small schools is less extreme.
Helen Ladd and Charles Clotfelter in 1996 and David Grissmer et al. in 2000 reported evidence suggesting that schools respond to incentives by raising student performance. However, the long-term effects of incentives may be quite different from their short-term effects. Even if teachers are not sufficiently aware of the statistical forces at work to recognize their rather limited influence on test scores in the short run, they may well become aware of this over time. If their best efforts are rewarded with failure one year and less work the following year is rewarded with success, they are likely to form negative opinions regarding the value of their efforts.
* When evaluating the impact of policies on changes in test scores over time, the natural fluctuations in test scores must he accounted for.
In 1997, North Carolina identified 15 elementary and middle schools with poor performance in both levels and gains and assigned "assistance teams" of three to five educators to work in these schools. The next year, all of the schools had improved enough to escape being designated "low performing." The state Department of Public Instruction ascribed the improvements to the efforts of the assistance teams; the assistance teams were lauded in Education Week's annual summary of the progress of school reform efforts in the states as well. However, given the amount of sampling variation and other non-persistent fluctuations in test-score levels and gains, schools with particularly low test scores in one year would be expected to bounce back in subsequent years.
The schools that were assigned assistance teams seem to have had a particularly bad year the year they received the sanction. In the year before assignment, such schools had an average 4th grade combined reading and math test score that was .67 student-level standard deviations In statistics, the average amount a number varies from the average number in a series of numbers.
(statistics) standard deviation - (SD) A measure of the range of values in a set of numbers. below the average school. This reveals that they were weak schools the year before being sanctioned. However, in the year of assignment, their average score was even lower, .79 student-level standard deviations below the average school. The year after assignment, their scores seemed to rebound to .52 student standard deviations below the mean. One is likely to greatly overestimate o·ver·es·ti·mate
tr.v. o·ver·es·ti·mat·ed, o·ver·es·ti·mat·ing, o·ver·es·ti·mates
1. To estimate too highly.
2. To esteem too greatly. the impact of assistance teams by taking the change in performance in the year after assignment.
There are real differences in performance at the school level. And schools that are not improving should be identified for intervention. However, one year's worth of test-score data is insufficient to discern such differences in a meaningful way. States should be allowed to experiment until the nation finds the ideal way to determine which schools are making adequate yearly progress. We understand the impulse to create a system that requires specific remedies sooner rather than later. However, impatience is an insufficient excuse for bad education policy.
[FIGURE 10 OMITTED]
Thomas J. Kane is a professor of policy studies and economics at the School of Public Policy and Social Research at the University of California The University of California has a combined student body of more than 191,000 students, over 1,340,000 living alumni, and a combined systemwide and campus endowment of just over $7.3 billion (8th largest in the United States). at Los Angeles Los Angeles (lôs ăn`jələs, lŏs, ăn`jəlēz'), city (1990 pop. 3,485,398), seat of Los Angeles co., S Calif.; inc. 1850. . Douglas O. Staiger is an associate professor of economics at Dartmouth College Dartmouth College, at Hanover, N.H.; coeducational; chartered 1769, opened 1770, the ninth colonial college (see Wheelock, Eleazar). Originally a men's college, Dartmouth began admitting women in 1972. . Jeffrey Geppert is a senior research analyst at the National Bureau of Economic Research The National Bureau of Economic Research (NBER) is a "private, nonprofit, nonpartisan research organization" dedicated to studying the science and empirics of economics, especially the American economy. in Palo Alto, California “Palo Alto” redirects here. For other uses, see Palo Alto (disambiguation).
Palo Alto (IPA: /ˌpæloʊˈʔæltoʊ/, from Spanish: palo: "stick" and alto: "high", i.e. . A portion of this article is drawn from a chapter that will appear in the Brookings Papers on Education Policy 2002 (Brookings, 2002).