# DYNAMIC TREATMENT EFFECTS OF TEACHER'S AIDES IN AN EXPERIMENT WITH MULTIPLE RANDOMIZATIONS.

I. INTRODUCTIONIn 2012, there were over 1.2 million teacher's aides employed in the United States. While the primary function for most is to perform administrative and noninstructional tasks, many spend a considerable amount of time instructing students individually or in small groups. (1) Despite their prevalence and stated purpose, little research has been performed examining their effect on academic achievement. This is an important question because teacher's aides are often seen as a less-expensive alternative to small classes (Hough 1993) or as a stopgap measure to deal with teacher shortages (Stafford 1962).

Compared to an otherwise identical class without a teacher's aide, it can be argued that their use may increase academic achievement through a number of different channels: more time for teachers to spend on lesson plans or instruction since their administrative and other noninstructional burdens are decreased, more individual attention for students from an aide, a higher level of classroom discipline due to the presence of an additional adult, among others. Despite these and other potential methods through which test score benefits can accrue, most of the research examining the effectiveness of teacher's aides in raising test scores has found that they do not appreciably raise student achievement. An analysis of the Bay City Michigan Teacher's Aide experiment, wherein a number of teacher's aides were allocated across classrooms in the city's school districts, finds no benefit to academic outcomes when comparing the teacher's aide classes to the classes without aides of the same size (Park 1956). Goralski and Kerl (1968) uncover minor benefits to reading readiness scores in kindergarten in an education experiment involving nine classrooms when one teacher's aide is assigned to a classroom; additional benefits beyond those obtained from having a single aide were not found when instead using five per classroom. Using data from Project STAR, an education experiment that took place over a number of years in Tennessee in the mid-1980s involving over 11,000 students and 80 schools, Krueger (1999) does not find any benefits to teacher's aides on average percentile test scores in his models which pool data over all grades. Using a hierarchical linear model on the same data, Gerber et al. (2001) find very few statistically significant differences in reading and word recognition scores between teacher's aide classes and normal classes when examining treatment duration, and none for mathematics. The lack of strong findings has led some authors to not take into account the presence of teacher's aides when performing inference on education data (e.g., Chetty et al. 2011; Ding and Lehrer 2010; Jackson and Page 2013). (2)

Many educational jurisdictions cap class sizes at different numbers according to grade. Implicit in the support for this policy is the belief that the timing of inputs may matter for outcomes in horizons that lie beyond the immediate future. Taking these considerations into account suggests that looking only at the contemporaneous effect of inputs may paint an incomplete picture under some circumstances. For example, Krueger (1999) finds sizeable contemporaneous benefits to the average percentile of test scores across mathematics, reading, and word recognition when students are assigned to small classes. Using the same data but instead employing a dynamic model that accounts for past inputs, Ding and Lehrer (2010) conclude that in most cases, the benefits of small classes do not persist in the following year at a level that is statistically significant.

In this article, I examine the effects of teacher's aides on academic achievement, a subject which has received relatively little attention in the literature. I employ a modified version of the novel estimation strategy used in Ding and Lehrer (2010) that explicitly takes into account past inputs and includes interaction terms in treatment status. This specification allows me to estimate the short- and medium-run effects of teacher's aides as well as to investigate issues related to dosage and timing. For example, I can determine whether there are delayed benefits of exposure to a teacher's aide, and whether they persist in the longer term even after the student is no longer in a classroom with a teacher's aide; moreover, these benefits may also depend on when treatment was received. I conduct the analysis using a subset of data from Project STAR wherein students were randomized into classes with either a regular class or a regular class with a full-time teacher's aide in kindergarten and randomized again in first grade. The strength of using this subsample is that bias typically introduced into analyses because of attrition and noncompliance is minimized or absent due to the experimental protocol. This is the first study to take advantage of this rerandomization in Project STAR to study teacher's aides. (3) Because of the preponderance of statistically insignificant results in the literature on teacher's aides, I perform calculations to verify whether statistically insignificant or weakly significant results are due to a lack of power. (4) I also examine whether benefits exist from multiple treatments of the full-time teacher's aide intervention; where the results are statistically insignificant, I employ a recently developed test of arbitrary bounds to determine whether the effects of additional doses constitute precisely estimated zeros. I then perform a cost-benefit analysis comparing full-time teacher's aides to smaller classes.

The following is a summary of the results. I find that being assigned to a full-time teacher's aide classroom in kindergarten produces no statistically significant gains in kindergarten test scores for mathematics and reading, but may provide a delayed benefit as there are statistically significant benefits in first grade for these subjects. Exposure to a full-time aide in first grade provides benefits to first-grade test scores, but the benefits to mathematics are imprecisely estimated; this occurs whether or not the child was treated in kindergarten. These results should be interpreted as a lower bound of the benefit of supplying a full-time teacher's aide to a classroom since the untreated classes for first grade in these data still had access to a part-time teacher's aide. (5) That some of the results are significant at only the 10% level despite the estimates being consistently positive motivates a power analysis to determine whether the sample size is sufficiently large to find effects of 0.1 and 0.05 test score standard deviations at the 5% level with a power of 0.8; I find that the sample is too small to do so. This finding suggests that small positive effects of teacher's aides on academic achievement cannot be ruled out in this analysis. Arbitrary bounds testing finds that we cannot credibly say that additional treatments provide economically insignificant effects except perhaps in the case of reading scores. Examining the effects of teacher's aides on various subpopulations, the effects are found to be more pronounced for White students and for students who did not receive a free or reduced price lunch; no statistically significant results are found for those who received a subsidized lunch, which I use as a proxy for being of low socioeconomic status. The benefits to Black students appear to be rather limited and are imprecisely estimated in most cases. The results of the empirical analysis on the effect of teacher's aides suggests that they should be taken into account when examining educational achievement when possible, and that further research is needed to learn more. The cost-benefit analysis performed here shows that teacher's aides may be competitive with small classes in this dimension; at worst, they still provide a net social benefit over the long term.

This article is organized as follows. Section II describes the education production function (EPF), the regression equations, and the method through which the dynamic average treatment on the treated (DATT) estimates are derived. The Project STAR data are detailed in Section III. In Section IV, I estimate the dynamic average treatment effects for the full sample and various subgroups as well as conduct power analyses and tests for economic insignificance. I perform a cost-benefit analysis comparing full-time teacher's aides with small classes in Section V. The paper concludes with a discussion of the results and its implications for research and policy in Section VI.

II. MODEL

A. Theory and Estimation

The standard specification of an EPF typically includes only contemporaneous inputs; when past inputs are present in the EPF, they are usually included through means of a lagged test score that is intended to serve as a sufficient statistic for them. However, I explicitly include past inputs in the model in order to allow for the estimation of sequences of interventions. I specify the EPF as

(1) [A.sub.ig] = f ([x.sub.ig], [w.sub.ig], [s.sub.ig], [T.sub.ig], [u.sub.ig]),

where [A.sub.ig] is the intellectual human capital of student i in grade g, x is a vector of teacher characteristics, w is a vector of family inputs, s is a vector of school inputs (excluding treatment status of whether the student is in a teacher's aide class), T is a vector of treatment assignments, and u is the vector of unobservables. The vectors contain both current and past inputs; for example, [x.sub.ig] contains the teacher characteristics for the current and completed grades (i.e., [x.sub.ig] = [x.sub.ig] [x.sub.i,g-1], ...).

The estimation strategy here closeiy follows Ding and Lehrer (2010). (6) I set up the econometric model as follows. Let [X.sub.ig] be a matrix containing a constant term and the control variables x and w for person i in grade g, and denote kindergarten as grade g = k. The a vectors denote the estimated effects of the controls, and the [beta] coefficients the effects of the treatments. For a given coefficient [[gamma].sub.lm] where [gamma] = {[alpha], [beta]}, l denotes the level of achievement that is affected by the input, and m is the time period of the input; for example, [[beta].sub.1k] is the estimated effect of the teacher's aide treatment in kindergarten on first-grade academic achievement. Note that I refer to the [beta] coefficients as structural coefficients, as we use them in the construction of the dynamic treatment effects in the next section. Because the EPF is cumulative in the inputs and an interaction term in treatment assignment is present, each grade requires a different specification. I specify the kindergarten and first-grade achievement equations as

(2) [A.sub.ik] = [X.sub.ik][[alpha].sub.kk] + [[beta].sub.kk][T.sub.ik] + [[epsilon].sub.ik],

(3) [A.sub.i1] = [X.sub.i1][[alpha].sub.11] + [X.sub.ik][[alpha].sub.1k] + [[beta].sub.11][T.sub.il] + [[beta].sub.1k][T.sub.ik] + [[beta].sub.1,1k][T.sub.i1] + [[epsilon].sub.i1],

where [[epsilon].sub.ig] contains a school by grade fixed effect [s.sub.i], the unobservable inputs [u.sub.i], and a random component. Teacher characteristics x include three dummy variables: whether the teacher is Black, (7) whether the teacher is inexperienced (defined here as having fewer than 3 years of experience), and whether the teacher has a graduate degree. (8) Family inputs w are summarized by a dummy variable indicating whether the student receives a free or subsidized school lunch. School inputs 5 are school by grade fixed effects. Since treatment is binary, [T.sub.ig] is a dummy variable equal to 1 if student i is assigned to a full-time teacher's aide classroom in grade g and 0 otherwise. I proxy human capital [A.sub.ig] using standardized test scores whose distributions have been transformed to have mean zero and variance one.

The above specification of the system of equations assumes that the error terms are additively separable of the inputs and treatments and that there are no pretreatment effects. (9) Readers should note the lack of any attempt to control for the presence of unobservables in the regression equation (e.g., such as the inclusion of student fixed effects to control for unobserved ability): while certainly relevant factors in education production, they are not an obstacle in obtaining unbiased and consistent estimates of the treatment effect parameters because treatment is randomly assigned at every grade in the sample. (10) The coefficients of interest (those on the [T.sub.ik] and [T.sub.i1] variables) in the system of Equations (2) and (3) are identified using the experimental variation in Project STAR that was induced via the random assignment of both students and teachers to classrooms of different types in each grade.

The regression equations can be thought of as value-added models. Using the terminology of Rothstein (2010), a VAM1 model for achievement in first grade corresponds to Equation (3) with the restrictions [[alpha].sub.kk] = [[alpha].sub.1k], [[beta].sub.kk] = [[beta].sub.1k], and [[alpha].sub.1, 1k] = 0; that is, the effects of past inputs are permanent, and there are no interaction effects between treatment in first grade and treatment in kindergarten. The same estimates as a VAM3 model are obtained if, in addition to the VAM1 conditions, the changes in unobserved inputs are assumed to be uncorrected with the included variables. (11) The model is perhaps closest to a VAM2 model, which includes a lagged term in achievement. This specification carries with it the assumption that all past inputs, both observable and unobservable, decay at the same constant rate (Todd and Wolpin 2003); the model employed here instead allows for the nonuniform decay of the effect of observable inputs.

B. Dynamic Treatment Effects

In an experiment composed of two sequential stages wherein each person is randomly assigned to the treatment group or the control group in the first period and again in the following period, there are four possible treatment paths. Denote t(a, b) to be the treatment sequence of an individual where a is the assigned treatment in the first period and b is the assigned treatment in the second. For i [member of] [a, b], let i = 1 if treatment was received and i = 0 otherwise. Then, a person assigned treatment in both periods would be denoted as receiving the treatment sequence t(1, 1), a person assigned treatment in the second period but not in the first is denoted as experiencing the sequence t(0, 1), and so forth.

Using this notation, I can define the dynamic treatment effects of interest. In this article, the causal effects of the sequences of interventions are DATT estimates similar to those derived in Ding and Lehrer (2010). Define [tau](a, b)(c, d) to be the DATT of receiving treatment sequence [tau](a, b) instead of receiving the treatment sequence t(c, d) for those that have undergone the treatment sequence t(a, b). To put it more simply, x(a, b)(c, d) is the average net difference in the outcome of interest from taking treatment path t(a, b) instead of taking the treatment path t(c, d), but only for those who have taken the treatment path t(a, b). (12)

A number of examples follow to illustrate the use of this notation. [tau](1, 1)(0, 0) would be the effect of receiving treatment in both periods for those who have received treatment in both periods. [tau](1, 1)(1, 0) is the benefit of treatment in the second period for those who have had treatment in both periods. [tau](0, 1)(0, 0) is the benefit of the treatment in the second period for those who only received treatment in the second period. [tau](0, 1)(1, 0) is the effect of receiving treatment in the second period instead of the first period for those who received treatment in the second period. For these examples, the DATT estimates are calculated as:

[mathematical expression not reproducible]

where the values of [beta] are obtained from the system of Equations (2) and (3). A key distinction is the difference between [tau](1,1)(1,0) and [tau](0, 1)(0, 0): the former has an interaction term because the effect of the intervention in the second period may vary because of the treatment received in the first period, while the latter does not. The standard errors of the DATT are calculated as per usual using the rules for sums of random variables. (13) Being composed of the individual estimated parameters from the system of Equations (2) and (3), the causal effects of various treatment paths are also cleanly identified using the experimental variation in Project STAR.

Using this methodology allows us to investigate the effects of teacher's aides that have gone unexplored in previous literature, such as medium-run effects and the importance of timing considerations for various treatment combinations. For example, consider [tau](1, 0)(0, 0), which is the estimated effect of a teacher's aide in kindergarten on first-grade test scores relative to not receiving treatment at all. A positive and statistically significant estimate of this parameter tells the story that teacher's aides in kindergarten provide delayed benefits in the form of elevated first-grade test scores. Another example is a scenario wherein budget constraints permit only two full-time teacher's aides interventions in the first 4 years of school. Then, if the outcome of interest is accumulated human capital in third-grade [A.sub.i3], one could compare the effect of various two dosage interventions to see which treatment sequence ultimately raises it the most: t(1, 1,0, 0), t(1,0, 1,0), and so forth.

The primary advantage of constructing the dynamic average treatment effect estimates in the fashion outlined in this section rather than using the alternative methodology of simply estimating a model with a dummy variable for every treatment path is that the effect of any treatment sequence relative to a given counterfactual path can easily be estimated using the above approach. Issues of power are less of a concern since both econometric strategies require the estimation of the same number of coefficients at each level: for example, for second grade, there are [2.sup.3] = 8 possible treatment paths, for a total of seven dummy variable coefficients (we drop the untreated path to use as the reference category); for the method outlined here, the coefficients on the treatments and their interactions are {[[gamma].sub.2k], [[gamma].sub.21], [[gamma].sub.22], [[gamma].sub.2,k1], [[gamma].sub.2,k2], [[gamma].sub.2,12], [[gamma].sub.2,123]}, which als0 totals seven parameters.

That many different DATT estimates for various treatment and counterfactual paths can be calculated may raise concerns related to multiple testing. Since every DATT estimate is derived using a linear combination of the structural coefficients, and these coefficients have a set covariance matrix, the usual multiple comparison problem is not present because no more random "draws" are being made to fish for statistical significance: the same components are being used repeatedly to calculate the DATT and their significance levels. (14)

III. DATA

A. Description

The data employed in this study come from a cohort of students that participated in Project STAR, an education experiment that took place in Tennessee that ran from 1985 until 1989. (15) The experiment was legislated into existence and funded by the state government at a cost of approximately $12 million over 5 years (16); this figure included the data analysis and reporting that took place in the fifth year. The primary goal of the experiment, as its acronym (StudentTeacher Achievement Ratio) implies, was to determine the effect of class size on student achievement in primary education (Finn et al. 2007). Across the state, 79 schools signed up for the experiment; data were also gathered from 21 nonparticipating schools to use as a benchmark. To qualify for participation in Project STAR, schools required enough students to support at least three different classes per grade, had to commit to participation for 4 years, and agreed to implement the experimental protocol. Students and teachers were randomly assigned within schools to one of three class types: a small class (13 to 17 students), a regular class (22 to 25 students), or a regular class with a full-time teacher's aide. (17) Teacher's aides were also randomly matched with teachers (Gerber et al. 2001). However, regular classes in first through third grade still had a part-time teacher's aide available to assist the class from approximately 25% to 33% of the time on average. (18) It was initially intended that students stay in their assigned class type from kindergarten through third grade; however, at the beginning of first grade, students in regular or regular with aide classes were randomly permanently reassigned between these two class types. (19) An examination of 1,581 students enrolled in kindergarten found that compliance was almost perfect, with only about 0.3% of students violating their initial treatment assignment (Krueger 1999). However, in first grade and beyond, there were some problems with noncompliance, with a number of students switching in or out of small classes. Noncompliance was primarily due to parental complaints or discipline problems (Krueger 1999). At the end of each year, all participating students were given a battery of academic and nonacademic tests. Among the academic tests used were the Stanford Achievement Tests (SAT); for these tests, the scores were scaled using item response theory and were designed to be comparable across grades (Finn et al. 2007). (20) In addition to test scores, data were collected on teacher and student observable characteristics.

This paper makes use of only a subsample of the Project STAR data. I use the SAT mathematics and reading scores as the outcome variables of interest, and normalize them to have mean zero and a standard deviation of one. (21) To investigate the effects of teacher's aides in kindergarten on academic achievement (Equation (2)), I use students that entered the experiment in that grade and were assigned to either regular classes or classes with a full-time teacher's aide. To estimate Equation (3), I include students that remained in the sample from the previous grade that did not leave Project STAR and did not violate treatment assignment by entering a small class within the same school. (22) I exclude the students that are not present in the sample in kindergarten to more credibly estimate the full sequence of dynamic effects (Ding and Lehrer 2010). Students with missing information on the control variables or who do not have at least one valid test score for the grade of interest are excluded from the analysis. Note that those who transferred to a different Project STAR school that were randomized into a small class are considered to have attrited from the sample.

Summary statistics of the data are displayed on Table 1. Of note is that approximately half the students in the sample were on free or reduced price lunches. (23) The number of minority teachers was quite low; moreover, in the sample of interest, there were only Black or White teachers. Over a third of the teaching staff had earned a Master's degree or more. Overall, the proportions are well balanced across all treatment paths. (24)

Table 2 displays the various treatment paths that students underwent in the experiment. (25) Recall that [T.sub.ig] denotes whether student i in grade g experienced treatment (i.e., assignment to a full-time aide classroom rather than a regular class). Here, I additionally define [L.sub.ig], = 1 to signify that the student left the sample in grade g, and [N.sub.ig] = 1 to denote that the student did not comply with their treatment assignment in grade g because they changed into a small class within the same school, which is not possible according to the rerandomization protocol. A downward move on the table indicates a treatment path wherein students did not receive treatment in the previous period, while an upward move indicates that they did. For example, 587 students left the experiment in first grade after having received treatment in kindergarten, 737 students experienced the treatment path t(0,0), and 761 students completed the treatment sequence t(1, 0). Overall, the transition tree shows that all the paths are well balanced, and similar numbers of students from both class types in the first period did not comply with their treatment assignment by entering a small class within the same school or attrited from the sample altogether in the second period.

An initial look at the test score distributions according to treatment is shown in Figure 1. Examining the first-grade test scores, we see that students who experienced treatment paths with no full-time teacher's aides usually scored lower than those who were treated at least once. The benefits to mathematics scores seem to be mild, while the gains for reading appear to be quite pronounced. Overall, the initial evidence supplied by these graphs suggests that there may be mild to moderate benefits from teacher's aides on academic achievement.

B. Threats to Validity

The general consensus of the literature is that the randomization process of Project STAR was successful; Krueger (1999) provides perhaps the most in-depth analysis of this topic of discussion. Examining observable dimensions outside those provided by the Project STAR data by merging it with tax records, Chetty et al. (2011) find that the covariates are also balanced between small and regular class types along many other observable traits such as parental assets and demographics. (26) However, to date, there has been no analysis of whether the rerandomization that took place between kindergarten and first grade for those that were initially assigned to the regular and regular with full-time aide class types was successful.

To investigate the outcome of the randomization process, I perform F-tests of equality to determine whether there are any statistically significant differences along any observable dimension between the different treatment paths in the experiment. (27) The results of this exercise are displayed in Table 3. None of the tests of equality rejects at the 10% level, indicating that the randomization procedure produced a sample of students and teachers whose observable traits were balanced along the different treatment paths in both grades. (28)

As an additional test of randomization, I examine the possibility of pretreatment or anticipation effects. The idea here is to test whether different treatment paths with the same kindergarten treatment have the same effect in kindergarten, that is, whether there are pretreatment effects depending on which group they are assigned to in the following grade. Since I assume that there are no pretreatment effects given the experimental protocol, we should not find any statistically significant different differences. To test for this, I run the following regression:

(4) [A.sub.ik] = [X.sub.ik][[alpha].sub.1k] + [[alpha].sub.2k][T.sub.i01] + [[alpha].sub.3k][T.sub.i10] + [[alpha].sub.4k][T.sub.i11] + [[epsilon].sub.ik],

where [T.sub.i01] is a dummy variable equal to 1 if a person ultimately undergoes the treatment sequence t(0, 1), and similarly for the other T variables. If there are no pretreatment effects, we do not expect to find any statistically significant difference between [[alpha].sub.3k] and [[alpha].sub.4k] since they received the same treatment in kindergarten and have not experienced their second treatment assignment yet, nor do we expect [[alpha].sub.2k] to be statistically significant for the same reason (note that both the baseline category t(0, 0) and t(0, 1) experience the same treatment in the first period). None of these tests produce statistically significant results at the 10% level, giving us further confidence that randomization was successful. Note that in this particular circumstance, anticipation effects may be credibly ruled out ex ante: the decision to rerandomize the classes was made after the kindergarten year, after an initial analysis showed no statistically significant differences were found between regular and full-time teacher's aide classes (Finn et al. 2007, 9).

Noncompliance with class type assignment is a bugbear in most studies that employ Project STAR data, with moderate numbers of students switching class types in each grade (Krueger 1999). In this data, 4.9% of students switch from a full-time teacher's aide class in kindergarten to a small class in first grade, while 5.7% of those in the regular kindergarten classes do so. I do not find a statistically significant difference in these noncompliance rates between the two class types (p = .535). For purposes of this study, these students effectively attrited from the sample, (29) since they do not provide any information in identifying the difference between the regular and full-time teacher's aide treatments. (30) There is no information as to how many students defied their assignment by entering a full-time teacher's aide class instead of a regular class or vice versa; I believe the number of students that have done this to be low. (31)

Differential attrition is a serious threat to analyses of experiments with multiple treatment stages that have only an initial randomization stage at the first treatment because it may result in the treated and untreated groups no longer being comparable in periods after the randomization takes place. Attrition is quite high in the data employed here, with 29.6% of those who joined a STAR school in kindergarten failing to continue the experiment into first grade. The experimental design employed in this paper largely assuages concerns related to attrition, however, since the randomizations at the beginning of each grade should effectively balance both the observable and unobservable traits of the students in both kindergarten and first grade. (32) Nonetheless, I formally test and fail to find a statistically significant difference in the attrition rate between students who were enrolled in a class with a full-time teacher's aide and those in a regular classroom (p = .275). In terms of attrition due to unobservable characteristics, previous research has found that attrition patterns across schools that participated in STAR did not systematically differ from those that did not (Ding and Lehrer 2010). I examine the possibility of differential attrition in the data according to observable characteristics using a procedure developed by Becketti et al. (1988) by running the following regression:

(5) [A.sub.ik] = [[beta].sub.0] + [X.sub.ik][[beta].sub.1] + [[beta].sub.2][L.sub.i1] + [L.sub.i1][X.sub.ik][[beta].sub.3] + [[theta].sub.l] + [[epsilon].sub.ik],

where X is a row vector of observable characteristics, [L.sub.i1] is a dummy variable taking the value of 1 if the student leaves the sample in first grade and 0 otherwise, and [[theta].sub.l] is a school fixed effect for school l. If the coefficients on the interaction terms in [[beta].sub.3] have a jointly statistically significant effect, then selection on observables is present: based on known characteristics, those who left the experiment in first grade had different achievement scores in kindergarten compared to those who remained. I find that there is no evidence of differential attrition for mathematics [p-.2A0) or reading (p = .614). Performing a similar test but only looking at those who switched to a small class within the same school after being initially assigned to a regular or full-time teacher's aide class in kindergarten, we similarly find no evidence of this type of noncompliance (p = . 805 for math and p = .181 for reading).

One limitation of this study is the lack of information on the observable characteristics of the teacher's aides in the sample. This should be of limited importance because causal estimates are still obtained due to random assignment; in this case, the additional controls would serve only to reduce the residual variance. There is reason to believe that including such covariates would not accomplish much were they available: very little of a student's predicted variation in academic achievement can be explained by observable teacher characteristics such as educational qualifications; instead, unobservable teacher quality is a far more important factor (Rivkin, Hanushek, and Kain 2005). It appears likely that a similar pattern should be present for teacher's aides.

IV. EMPIRICAL ANALYSIS

A. Main Results

The structural coefficients that are required to construct the DATT estimates are displayed in Table 4. Because the first-grade equation has three variables of interest that are randomly assigned (namely, [T.sub.ik], [T.sub.i1], and [T.sub.i1] [T.sub.ik]), one may wish to use multiple testing robust inference. Examining the Simes (1986) <7-values, we see that the structural coefficients for the full-time teacher's aide treatments in first grade are precisely estimated for reading scores, but the g-values fall outside of traditional levels of statistical significance for mathematics test scores. (33)

Table 5 presents the coefficient estimates of the DATT parameters that were outlined in Section II.B. There does not appear to be a statistically significant effect on academic achievement in the kindergarten grades between the full-time aide and regular class types, and the coefficients are small in absolute magnitude.

In first grade, there is a clear pattern of positive treatment effects of a full-time aide relative to the counterfactual path with no treatment, and on average, the size of the coefficients appears to be roughly between half and two-thirds of what is typically seen in the small class literature for Project STAR (e.g., Mueller 2013). For first-grade mathematics test scores, the only statistically significant result is on the treatment path t(1, 0), but this significance disappears once a multiple testing correction is employed (since this coefficient is based on [[beta].sub.ik]). We can conclude that the benefits to mathematics test scores, if they exist at all, are likely small. The regression results for reading indicate that full-time teacher's aides do provide benefits to reading test scores. Relative to experiencing no treatment, all treatment paths with at least one treatment provide a first-grade test score benefit of 0.089 standard deviations at a minimum. These effects are precisely estimated. Overall, there does appear to be sufficient evidence that a full-time teacher's aide raises achievement relative to regular classes, but the benefits may be confined to reading scores.

Of particular note is the existence of a delayed benefit from having a full-time teacher's aide in kindergarten on first-grade reading test scores when one is assigned to a regular class in first grade, and that the benefit is not much smaller than having a full-time teacher's aide in first grade but not in kindergarten. What makes this phenomenon particularly puzzling is that this occurs despite the absence of a detectable benefit from a full-time teacher's aide to kindergarten test scores because one cannot then appeal purely to increased cognition that persists into the next grade. These facts raise some interesting questions as to why these delayed benefits of a full-time teacher's aide occur. One possibility is that first-grade teachers without a full-time aide exert additional effort toward those students who had one in kindergarten, but a cogent reason as to why they would behave this way does not appear to be present. I instead put forth the theory that the delayed benefit is primarily due to improved noncognitive skills. Chetty et al. (2011) advance the idea that student noncognitive skills are improved when they are assigned to a small class; I suggest that similar effects happen for full-time teacher's aide classes. Moreover, I believe that it may also be partly due to improved cognitive skills that were developed in kindergarten, since it is possible that there are benefits in that grade that are too small to estimate precisely that then persist into the next grade. A possibility as to why this occurred is that kindergarten curricula tend to focus more on play type rather than academic activities, and as such there is perhaps a dynamic that increased attention provides less of an immediate benefit in this grade. Krueger (1999, 512) provides evidence for this sort of phenomena: the small class intervention in first grade provides between 50% and 80% more of a test score benefit in first grade compared to kindergarten.

The effects of small classes in the Project STAR literature typically range from 0.10 to 0.20 standard deviations; these effects are frequently described as large. That some of the estimates of the dynamic average treatment effects on the treated are weakly significant or statistically insignificant despite most of the estimates in Table 5 being positive suggests that the regressions may be suffering from a lack of power. In Table 6, I calculate the sample sizes needed to obtain a power of 0.80 to detect an effect at a given magnitude at the 95% confidence level. (34) I select effect sizes that would represent approximately half the small class size effect: [tau] = 0.10 and [tau] = 0.05. In kindergarten, we can be fairly sure that a full-time teacher's aide does not have large positive effect of [tau] = 0.10 or higher on academic achievement compared to being in a regular classroom. However, the samples are too small to credibly rule out small positive effects of [tau] = 0.05 for kindergarten. For first-grade test scores, we see that the sample sizes are too small to reliably detect effects of these sizes. This is convincing evidence that the lack of precision is more often than not expected given the sample employed. (35)

B. Multiple Treatments versus One Treatment

The minor differences in the estimated benefits between the treatment path t(1, 1) and both t(1, 0) and t(0, 1) relative to the counterfactual path with no treatment motivates the following question: how much of a benefit are students experiencing from an additional treatment? Can it be credibly claimed that the difference is economically insignificant?

To answer these questions, I estimate the DATT using the counterfactual paths that experience only a single treatment, as well as compare the benefit of a single treatment in the first period with that of a single treatment in the second. The results are displayed in Table 7. We cannot conclude that two exposures to a full-time aide is any better or worse than one regardless of the timing since the coefficients on [tau](1, 1)(1,0) and [tau](1, 1)(0, 1) are imprecisely estimated for both subjects. A surprising result is that, despite the large estimated difference of the effect on first-grade reading test scores between treatment in kindergarten and treatment in first grade, the difference between these two treatment paths cannot be conclusively deciphered for reading since [tau](0, 1)(1, 0) is estimated imprecisely. (36) We also do not find any statistically significant differences between one-time treatments in mathematics.

That most of the estimates of the benefits of an additional treatment after having received one are small in magnitude raises the question of whether one can comment on their economic significance. Given the discussion above, I use the approximate lower bound of half the small class size treatment effect (0.05 standard deviations) as a threshold for economic significance, and perform an arbitrary bounds test for the estimates of the previous table. (37) The outcomes of these tests are displayed in Table 8. (38)

According to the table, there is insufficient evidence to conclude that most of the DATTs constitute a null effect. In the case of [tau](1, 1)(1, 0) for reading, we can claim that for those who had a full-time teacher's aide in both kindergarten and first grade, the benefit of a full-time aide in first grade is economically insignificant; however, a similar conclusion cannot be reached for [tau](1, 1)(0, 1). These results suggest that there is weak initial evidence that an additional treatment beyond the first does not raise achievement in first-grade reading scores.

Why does there not appear to be any significant benefit from continuous treatment in kindergarten and first grade relative to receiving treatment only once in either of these grades? One possibility is that one treatment imparts much of the benefit: Krueger (1999, 521) found that the effect of an initial exposure to a teacher's aide classroom is approximately three times larger than for additional treatments. (39) In the case here, it can be hypothesized that multiple treatments still bring additional benefits, but their magnitude is too small to be economically meaningful and to statistically detect. A second idea is the possibility of John Henry effects: either the student or the teachers are adjusting their inputs (such as effort) in response to being assigned to a second (worse) treatment. For example, perhaps people who are not receiving a second treatment are obtaining additional tutoring after school that they otherwise would not have in order to compensate for being in the control group for the current period. I find this explanation to be lacking because we would expect John Henry effects to be strongest when students are not exposed to full-time teacher's aide classrooms at all since there are two periods where they are not being treated, but we nonetheless find statistically significant effects of treatment in any period when compared to this baseline group. I believe the evidence strongly points to the first possibility as being the likely reason.

C. Subsample Heterogeneity

In examining the effects of assignment to a small class using data from Project STAR, Krueger (1999) found at times substantial heterogeneity in the effects according to subgroup. For this reason, I examine whether the benefits of a full-time teacher's aide differed by subpopulation such as race and socioeconomic status. Examining a proxy for social status, I begin by regressing samples composed of students that received a free or reduced price lunch in kindergarten and those that did not, the results of which are displayed on Table 9. We see that the coefficients tend to be larger than those in the full sample for the no free lunch children and they are also more precisely estimated. However, there is only one statistically significant difference at the 5% level or better between the two groups for these subjects: the difference in [tau](0, 1)(0, 0) for reading.

I next conduct a similar analysis, this time dividing by race (Table 10): Black and non-Black students (note that over 99% of non-Black students are White in this sample). There are no statistically significant differences between the estimated DATTs for Blacks and non-Blacks for mathematics. For reading, there is a difference in kindergarten and in the first grade [tau](0, 1)(0, 0) DATT. That the coefficient for kindergarten reading scores in the regression with Black students is only significant at the 10% level and that the first-grade reading score coefficients vary wildly indicates that any potential benefit for kindergarten for this subgroup should be met with skepticism.

The results of this section are in stark contrast with the current research on small classes--that literature showed that Blacks and students who receive free or reduced price lunches benefit even more from smaller classes (Krueger 1999); by contradistinction, the people in this analysis that appear to benefit the most from teacher's aides are non-Blacks and those who do not receive free or reduced price lunches.

The reason why the results point to more effective treatment for students of high socioeconomic status and Whites is unclear, especially given the contrasting results for small classes. One possibility is a John Henry effect wherein White or high socioeconomic status students who were assigned to a full-time teacher's aide classroom obtained additional tutoring because they were not assigned to a small classroom, whereas Black students (who on average tend to be of lower socioeconomic status compared to Whites) and students of low socioeconomic status could not afford to engage in this compensatory behavior. I do not believe this to be the reason because, following this logic, the John Henry effect would have been even stronger for those who neither had a full-time teacher's aide nor assignment to a small classroom, which is not the case here.

V. POLICY ANALYSIS

The power analysis conducted in Section IV suggests that large positive effects of full-time teacher's aides can likely be ruled out, so a cost-benefit analysis of their use relative to shrinking class sizes will produce nontrivial results. I follow in the vein of Krueger and Whitmore (2001) and Mueller (2013) and perform an analysis of the cost-effectiveness of full-time teacher's aides versus that of smaller classes by calculating an internal rate of return. (40) For this exercise, an internal rate of return is calculated as the discount rate r such that the following equality holds:

(6) [4.summation over (t=1)][C.sub.t]/[(1 + r).sup.t] = [58.summation over (t=17)][W.sub.t][beta[delta]/[(1 + r).sup.t],

where [C.sub.t] is the cost of the intervention (either a small class or a full-time teacher's aide) in period t, [W.sub.t] is the wage in period t, [beta] is the increase in wage due to a one standard deviation increase in human capital (proxied by test score performance), and [delta] is the gain in test score standard deviations due to the intervention.

For this analysis, I choose to examine the case of 4 years of the full-time teacher's aide and small class interventions, starting from kindergarten and ending at the completion of third grade. I assume that students will enter the labor force at age 21 (rather than age 18 as has been done in previous literature); I do this to take into account additional schooling and for interruptions in work during peoples' careers for reasons such as parental leave and unemployment. (41) I choose age 62 as the retirement age, which is the current average age of retirement as of 2014 (Gallup 2014) (42) The rest of the figures are chosen as follows. Chetty et al. (2011) find p to be approximately .20, so I use this parameter value in this analysis. (43) For the wage Wt, I use a calculation based on average weekly earnings by age group using data from the Bureau of Labor Statistics (see Bureau of Labor Statistics 2016a); I examine the cases of no wage growth and 1% annual wage growth. To measure the cost per year for a teacher's aide, I first take the median salary of an elementary school teacher's aide, which is $24,900 (Bureau of Labor Statistics 2016b), and augment it by 25% in order to take into account benefits and additional overhead. (44) I divide this number by the average number of students in primary school classrooms in Project STAR that were not in small classes, which is approximately 23.24 in order to come to a per pupil figure. Thus, the cost of a teacher's aide per student per year is $1,339. For the annual cost of small classes, I take the median salary of a teacher, which is $54,550 (Bureau of Labor Statistics 2016b), and increase it by 40% since hiring a teacher entails extra costs in the form of additional classroom space over hiring a teacher's aide which does not require this additional expenditure. Decreasing class size from a target of 22 students to 15 students means an additional 46.7% more teachers need to be hired, (45) which means only this percent of the cost is relevant for the calculation. Dividing by the numeraire number of students to net a comparable figure, the cost is $1,535 per student. (46,47)

What is left is to provide a value for [delta]. As in Krueger and Whitmore (2001), I base my figure on the average increase in mathematics and reading test scores due to the interventions of interest. I use two figures for 8 for the full-time teacher's aide intervention and one for the small class intervention, all of which are based on the average test score increases across subjects due to the classroom interventions. I do this in order to provide a range of answers based on different assumptions of the magnitude of the teacher's aide effect and due to the uncertainty over the effect of teacher's aides given the general lack of statistically significant results found in the literature. All estimates of [delta] are obtained based on regression results using the Project STAR data. The first value of 8 that I examine is the estimated value of [tau](1, 1)(0, 0), which assumes future treatments serve only to maintain the benefit to human capital (and thus, [tau](1, 1)(0, 0) = [tau](1, 1, 1, 1)(0, 0, 0, 0)). The second value of 8 for teacher's aides and the value for the small class effect are from the coefficients on [[gamma].sub.1] and [[gamma].sub.2], respectively from the pooled panel data regression

(7) [A.sub.it] = [[gamma].sub.0] + [[gamma].sub.1] [aide.sub.it] + [[gamma].sub.2][small.sub.it] + [[gamma].sub.3][X.sub.it] + [[epsilon].sub.it],

where aide is a dummy variable for being enrolled in a full-time teacher's aide class, small is a dummy variable for enrollment in a small class, X is a vector of controls, and e is the usual random error term which also contains school and grade fixed effects. (48)

The internal rates of return are displayed in Table 11. The estimated internal rate of return for small classes is similar to Krueger and Whitmore (2001) and Mueller (2013), whose estimates vary between 5% and 8%. In the best case scenario for full-time teacher's aides, the internal rate of return is roughly 5% or 6%, which is equal to or less than similar estimates for small classes. In the worst case, the full-time teacher's aide intervention does pay for itself.

These estimates of the internal rates of return for these interventions allow us to conclude that, where there are teacher shortages, it is still cost-effective to hire full-time teacher's aides as a stopgap measure if we take the pessimistic view of their positive effects on student achievement (relative to the estimates derived in this paper). Moreover, teacher's aides may be competitive in terms of cost when compared with the alternative policy of reducing class sizes. Both of these inputs provide net benefits to society, but they may take a long time to materialize.

The calculations here may be underestimating the values of the benefits of the interventions. As previously discussed, Chetty et al. (2011) provide evidence that the small class intervention improves adult outcomes at least partly via increases in noncognitive skills. Therefore, additional potential benefits such as lower rates of institutionalization and lower all-cause mortality are not factored into the numbers derived in this section. (49)

VI. DISCUSSION

Despite the paucity of observations, the results of Section IV provide evidence to support the claim that a full-time teacher's aide increases academic achievement, and the positive coefficients suggest further investigation to see if they affect mathematics scores as well. These facts suggest that pooling the full-time teacher's aide classes with those of regular classes in econometric specifications, as is done frequently in the literature, may lead to incorrect conclusions if the results are sensitive to this pooling. Future research should split the regular and full-time aide groups; alternatively, observations in the full-time aide group can simply be dropped (as was done in Gilraine 2017) because random assignment guarantees this will not bias the estimates of interest. In addition, since the treatment effects estimated in this study for first grade are relative to a part-time aide, they likely constitute a good estimate of the lower bound of the benefits of a fulltime aide in first grade relative to a class without an aide if we make the plausible assumption that teacher's aides do not decrease academic achievement.

An additional note of consideration pertains to the heterogeneous effects of teacher's aides relative to that of smaller class sizes when deciding between interventions. Previous literature has found that small classes more strongly benefit Blacks and those on free or reduced price lunches (e.g., Krueger 1999), while in this paper, I find that the statistically significant benefits of full-time teacher's aides are mostly limited to Whites and those who are not receiving subsidized lunches. I advise caution in implementing policies based on the results of this paper concerning the effects by race or by free lunch status because I may be running into problems related to statistical power in determining their effect. I recommend further research on the topic of heterogeneous effects of teacher's aides by race and socioeconomic status before any policy changes are pursued.

The primary limitation of this study is that almost no information is available on the teacher's aides in the sample, such as race, level of education, or other observable characteristics. The statistically significant effects found here, coupled with their possible cost-competitiveness relative to small classes found in Section V suggest that additional investigation and data collection is merited. Future research should explore whether observable characteristics matter for their effectiveness and investigate the channels through which teacher's aides help to increase student achievement.

ABBREVIATIONS

DATT: Dynamic Average Treatment on the Treated

EPF: Education Production Function

SAT: Stanford Achievement Tests

doi: 10.1111/ecin.12511

Online Early publication October 16, 2017

REFERENCES

Becketti, S., W. Gould, L. Lillard, and F. Welch. "The Panel Study of Income Dynamics after Fourteen Years: An Evaluation." Journal of Labor Economics, 6, 1988, 472-92.

Bureau of Labor Statistics. "Usual Weekly Earnings of Wage and Salary Workers First Quarter 2016." Accessed June 6, 2016a. http://www.bls.gov/news.release/pdf/wkyeng.pdf.

Bureau of Labor Statistics. "Occupational Outlook Handbook: Education, Training, and Library Occupations." Accessed June 4, 2016b. http://www.bls.gov/ooh/ education-training-and-library/home.htm.

Cascio, E. U., and D. O. Staiger. "Knowledge, Tests, and Fadeout in Educational Interventions." Working Paper No. 18038. National Bureau of Economic Research. Cambridge, MA, May, 2012.

Chetty, R., J. N. Friedman, N. Hilger, E. Saez, D. W. Schanzenbach, and D. Yagan. "How Does Your Kindergarten Classroom Affect Your Earnings? Evidence from Project STAR." Quarterly Journal of Economics, 126, 2011, 1593-660.

Chung, K. C., L. K. Kalliainen, S. V. Spilson, M. R. Walters, and H. M. Kim. "The Prevalence of Negative Studies with Inadequate Statistical Power: An Analysis of the Plastic Surgery Literature." Plastic and Reconstructive Surgery, 109,2002, 1-6.

Ding, W., and S. F. Lehrer. "Estimating Treatment Effects from Contaminated Multiperiod Education Experiments: The Dynamic Impacts of Class Size Reductions." Review of Economics and Statistics, 92, 2010,31-42.

Finn, J. D., J. Boyd-Zaharias, R. M. Fish, and S. B. Gerber. "Project STAR and Beyond: Database User's Guide." HEROS, Incorporated. Lebanon, Tennessee. January 1, 2007.

Gallup Poll conducted by Jeff Jones and Lyndia Saad on behalf of the Gallup News Service. "Gallup Poll Social Series: Economy and Personal Finance." Performed April 3-6, 2014.

Gerber, S. B., J. D. Finn, C. M. Achilles, and J. Boyd-Zaharias. "Teacher Aides and Students' Academic Achievement." Educational Evaluation and Policy Analysis, 23, 2001, 123-43.

Gilraine, M. "Multiple Treatments from a Single Discontinuity: An Application to Class Size." Working Paper, January 2, 2017.

Goralski, P. J., and J. M. Kerl. "Kindergarten Teacher Aides and Reading Readiness Minneapolis Public Schools." Journal of Experimental Education, 37, 1968, 34-38.

Hadzima, J. G. Jr. "How Much Does an Employee Cost?" Boston Business Journal, 2005. Accessed June 6, 2016. http://web.mit.edu/e-club/hadzima/how-much-doesan-employee-cost.html.

Hough, J. R. "Educational Finance Issues in North America." Education Economics, 1, 1993, 35-42.

Jackson, E., and M. E. Page. "Estimating the Distributional Effects of Education Reforms: A Look at Project STAR." Economics of Education Review, 32, 2013, 92-103.

Krueger, A. B. "Experimental Estimates of Education Production Functions." Quarterly Journal of Economics, 114, 1999, 497-532.

Krueger, A. B., and D. M. Whitmore. "The Effect of Attending a Small Class in the Early Grades on College-Test Taking and Middle School Test Results: Evidence from Project STAR." The Economic Journal, 111, 2001, 1-28.

Mueller, S. "Teacher Experience and the Class Size Effect--Experimental Evidence." Journal of Public Economics, 98, 2013, 44-52.

Park, C. B. "The Bay City, Michigan Experiment: A Cooperative Study for the Better Utilization of Teacher Competencies." Journal of Teacher Education, 7, 1956, 99-153.

Penney, J. "Hypothesis Testing for Arbitrary Bounds." Economics Letters, 121, 2013, 492-4.

Rivkin, S. G., E. A. Hanushek, and J. F. Kain. "Teachers, Schools, and Academic Achievement." Econometrica, 73, 2005, 417-58.

Rothstein, J. "Teacher Quality in Educational Production: Tracking, Decay, and Student Achievement." Quarterly Journal of Economics, 125, 2010, 175-214.

Simes, R. J. "An Improved Bonferroni Procedure for Multiple Tests of Significance." Biometrika, 73, 1986, 751-54.

Sojourner, A. "Identification of Peer Effects with Missing Peer Data: Evidence from Project STAR." Economic Journal, 123, 2012, 574-605.

Stafford, C. "Teacher Time Utilization with Teacher Aides." Journal of Educational Research, 56, 1962, 82-88.

Todd, P. E., and K. I. Wolpin. "On the Specification and Estimation of the Production Function for Cognitive Achievement." Economic Journal, 113, 2003, F3-F33.

JEFFREY PENNEY *

* I would like to thank Simone Balestra and seminar participants at Queen's University and the International College of Economics and Finance for their comments and suggestions.

Penney: Assistant Professor, Department of Economics, Pontificia Universidad Javeriana, Bogota, Colombia. E-mail dr.jeffrey.penney@gmail.com

(1.) Some educational institutions employ teacher's aides exclusively to assist in the integration of special needs students into the classrooms; the people who fulfill these roles are not the focus of this paper.

(2.) For example, authors would not include teacher's aide dummies in their regression specification of the education production function.

(3.) Sojourner (2012) uses the same subset of data to identify peer effects.

(4.) Power analyses are rarely performed in the face of statistically insignificant results in the economics literature. An example of the pervasiveness of this problem in the field of plastic surgery can be found in Chung et al. (2002).

(5.) Regular classes in Project STAR in first through third grade had access to the services of a teacher's aide between 25% and 33% of the time on average (Krueger 1999).

(6.) The only notable change in this paper is that the differencing step for the regression equations beyond kindergarten that occurs in Ding and Lehrer (2010) does not take place: this is because the treatment is completely randomized at both kindergarten and first grade rather than just kindergarten in this sample, so this step is not needed to generate unbiased and consistent estimates.

(7.) In the sample that I employ, all teachers are either White or Black.

(8.) Male teachers are rare in the data: there are none in kindergarten and only three in first grade. Because of this, male teacher dummies are not used in the regression specifications.

(9.) A pretreatment effect is defined as an effect of a treatment in some time period t on an outcome in time t' where t > t' (that is, treatments cannot affect past outcomes). Here, this assumption corresponds to the restriction that the coefficient on the first-grade treatment assignment dummy [T.sub.i1] in the equation for [A.sub.ik] is zero were it present.

(10.) Most past research using Project STAR data relies on just the initial randomization into different class types that took place in kindergarten to justify ignoring the presence of unobservables; while this is sufficient to do so, it relies on stronger assumptions concerning the attrition patterns.

(11.) Since the teacher's aide treatment is randomly assigned and since teachers are also randomly assigned to students, and since student characteristics are fixed from the student's point of view, it is likely that this condition is satisfied.

(12.) To emphasize that these are treated on the treated effects, other literature that has employed the notation of [[tau].sup.(a,b)(c,d)] to denote what is referred to here as [tau](a, b)(c, d).

(13.) For example, the standard error of [tau](1, 1)(0, 0) is [mathematical expression not reproducible]

(14.) This is similar to running different t tests on the same coefficient for different reasons: for example, consider the regression y = [[rho].sub.0] + [[rho].sub.1] [beta] + [epsilon]. One may be interested in whether [[rho].sub.1] is equal to zero (the usual question of statistical significance) and whether it is equal to 3 (because some economic theory says it should be equal to 3). In this particular example, one does not need to do a multiple testing correction even though two different t tests are being conducted because one is simply testing different hypotheses using the same coefficient and the same covariance matrix [[??].sup.2] [(X'X).sup.-1].

(15.) More detailed descriptions of the data can be found in Krueger (1999) and Finn et al. (2007).

(16.) See House Bill (HB) 544, Tennessee Legislature, 1985.

(17.) Students who entered the school in later grades were also randomly assigned.

(18.) For simplicity of exposition, I will refer to both regular classes in kindergarten and regular classes with a parttime teacher's aide in first grade as regular classes.

(19.) The rerandomization took place due to complaints from the parents of students not assigned to small classes in Project STAR (Krueger 1999).

(20.) There is considerable overlap in the test scores across grades. For example, the top kindergarten students performed similarly to the median third grade students in mathematics.

(21.) Cascio and Staiger (2012) show that the use of normalized scores can mechanically cause the estimated impacts of interventions to appear to fade out over time; these results stem from the increasing variance in accumulated knowledge as students move through school. There is no such pattern of increasing variance in the scaled test scores in the data employed here.

(22.) Recall that the randomization that took place in this grade excluded students that were in small classes in kindergarten; however, students that changed to different Project STAR schools could have been randomized into any class type upon entering first grade. Students that entered a small class in first grade within the same school are considered noncompliers, since this transition was not possible in the experimental design.

(23.) Herein, I may at times refer to the category of students who receive a free or reduced price lunch simply as "free lunch." The data do not differentiate between students who receive their school lunches at no cost and those who receive them at a reduced cost.

(24.) Note that in Section III, we fail to find any statistically significant differences in these observable characteristics according to treatment path.

(25.) The treatment path analysis is performed without three schools that leave Project STAR in the first-grade period.

(26.) Because the authors combine the regular classrooms and full-time aide class types in their analysis, they do not specifically address the question of whether the additional covariates from the tax record data are balanced between these two treatments.

(27.) Formally, in the case of kindergarten, the null hypothesis is that the difference in means for the observable variable between the regular and full-time aide groups is zero. For first grade, the joint null hypothesis is that the difference in means for the observable variable between every treatment path is zero. A multiple testing procedure is used to adjust the p values because we know ex ante that randomization has occurred, producing what are known as q-values; see the footnote of Table 3 for details. The regression sample is used; that is, the sample for which all observables are present and with at least one valid test score for the particular grade, sans the small class enrollees.

(28.) Because testing took place at the end of each grade, I am able to include kindergarten test scores in the first-grade randomization tests; this ensures that the rerandomization check included academic ability.

(29.) All references to attrition in this study include those that have switched to a small class within the same school.

(30.) If we knew their class type assignment before they entered a small class within the same school, we would be able to perform bounding exercises. However, this data is not generally available and may not even exist.

(31.) Both randomizations were performed by members of the Project STAR consortium and were overseen at the level of the school by graduate students at four different universities (Finn et al. 2007).

(32.) The exception is attrition that takes place after one's treatment in the current period is learned but before any outcomes can be observed; this is a common limitation of randomized controlled trials.

(33.) The q-values displayed are corrected using the Simes (1986) procedure for the three hypothesis tests of interest (one for each of the structural coefficients) being conducted for each regression. An even more conservative approach to the p values would be to correct them across both regressions. In this case, the q-values for the coefficients in the first-grade reading regression are as follows: [mathematical expression not reproducible]. Of course, the coefficients for the mathematics regression remain insignificant.

(34.) In experimental design, the standard is to pick a sample size that is at least this large that satisfies these conditions.

(35.) Please note that these are not the sample sizes required to obtain statistically significant effects of the given magnitudes; rather, these are the sample sizes that would be required in order to give the data sufficient statistical power to be reasonably confident of whether or not statistically significant effects exist at the given magnitudes.

(36.) Note that, by construction, [tau](0, 1)(1, 0) = -[tau](1, 0)(0, 1); moreover, both estimates will always have the same standard errors.

(37.) The test employed here is more powerful than the typical approach of seeing whether an estimated coefficient's confidence interval lies entirely within the arbitrary bounds; see Penney (2013) for details.

(38.) Briefly, Penney (2013) develops a test to determine whether a number falls within two arbitrarily chosen bounds. The null hypothesis in Table 8 is that the estimated coefficient is outside of the bounds, and the alternative is that the number lies between the bounds. For the case examined here, a rejection of the null hypothesis allows one to conclude that the effect size is between -0.05 and 0.05 test score standard deviations.

(39.) However, both the initial benefit of a teacher's aide class and the benefit of additional years are not precisely estimated.

(40.) I perform an analysis of both the small class and fulltime teacher's aide interventions in order to be able to directly compare the benefits of the two proposed policies (either shrinking class sizes or adding full-time aides to classrooms) because cost-benefit analyses in previous literature use a number of different assumptions in calculating the internal rates of return and as such would not be comparable with the results derived here. For example, Krueger and Whitmore (2001) assume only 2.3 years of small class attendance in their figures, since this is the average duration of small class attendance for students who participated in Project STAR who were at any point assigned to a small class; the calculations here assume 4 years of the small class treatment assignment.

(41.) Hence, we begin the stream of benefits at t = 17 on the right-hand side of Equation (6) since students enter kindergarten at age 5 and the first period is normalized to t = 1. Thus, t = 17 corresponds to age 21.

(42.) The age of retirement has started to trend upward after being relatively stable during the 1990s. As it is expected to continue to rise, the figures here may understate the benefits of the proposed interventions.

(43.) Both Krueger and Whitmore (2003) and Mueller (2013) use this value for beta in their analyses.

(44.) The true cost of an employee is usually found to be between 1.25 and 1.4 times their salary due to considerations such as benefits and additional office space (Hadzima 2005).

(45.) This number is derived as follows. Suppose a school with enrollment n has [t.sub.1] teachers, with classrooms whose average size is 22: n/22 = [t.sub.1]. The school moves to an average classroom size of 15, which will give t2 teachers: 22/15 = [t.sub.2]. Normalize [t.sub.1] = 1. Then, equating n, we have 22/15 = [t.sub.2] = 1.467, which is 46.7% more teachers.

(46.) In order to derive a comparable per-student figure with that of teacher's aides, I also divide the pre per-student cost by the number of students in regular classrooms, which is approximately 23.24.

(47.) The cost numbers derived here for small classes are generally lower compared to that in other literature such as Krueger and Whitmore (2001), who assume that per pupil cost increases due to smaller class sizes rise in proportion with the number of additional teachers hired.

(48.) This regression specification is similar to that used in Krueger (1999) and Mueller (2013).

(49.) It is also important to remember that the calculations here are based on the value of the interventions relative to a regular size class which has access to a part-time teacher's aide in first through third grade; the returns are likely even larger compared to the case when no part-time teacher's aide is present at all.

Caption: FIGURE 1 Distributions of First-Grade Test Scores

TABLE 1 Summary Statistics Kindergarten Variable All t(0) t(1) Student level Female 0.49 0.49 0.48 (0.50) (0.50) (0.50) White or Asian 0.67 0.67 0.66 (0.47) (0.47) (0.47) Free lunch 0.49 0.48 0.50 (0.50) (0.50) (0.50) Teacher level Black 0.17 0.19 0.14 (0.37) (0.35) (0.35) Rookie 0.13 0.15 0.11 (0.34) (0.48) (0.32) Master's degree 0.37 0.37 0.37 (0.48) (0.40) (0.48) First Grade Variable All t(0,0) t(0,1) t(1,0) t(1, 1) Student level Female 0.50 0.53 0.49 0.51 0.46 (0.50) (0.50) (0.50) (0.50) (0.50) White or Asian 0.70 0.68 0.74 0.66 0.72 (0.46) (0.47) (0.44) (0.48) (0.45) Free lunch 0.46 0.45 0.44 0.47 0.48 (0.50) (0.50) (0.50) (0.50) (0.50) Teacher level Black 0.14 0.10 0.17 0.11 0.19 (0.35) (0.30) (0.38) (0.32) (0.40) Rookie 0.16 0.19 0.11 0.20 0.11 (0.36) (0.39) (0.32) (0.40) (0.32) Master's degree 0.34 0.28 0.40 0.31 0.38 (0.47) (0.45) (0.49) (0.46) (0.48) Notes: The table shows the means of the observables for every treatment path; standard deviations are in parentheses. Numbers calculated from sample data. t(a) denotes the treatment in kindergarten and t(a, b) denotes the treatment sequence in first grade where i = [a, b) is equal to 1 if assigned to a full-time aide and 0 otherwise; see Section II.B for more details. Rookie is a dummy variable equal to 1 if the teacher has fewer than 3 years of experience. The regression sample is used to calculate the above statistics. TABLE 2 Transition Tree Kindergarten First Grade t(1, 1) = 706 t(1, 0) = 761 t(1, [N.sub.i1]) = 108 t(1, [L.sub.i1]]) = 587 t(l) = 2,162 t(0) = 2,104 t(0, 1) = 661 t(0, 0) = 737 t(0, [N.sub.i1]) = 122 t(0, [L.sub.i1]) = 584 Notes: The table displays the number of students that experience every treatment path. [N.sub.i1] denotes noncompliance with treatment assignment in first grade by entering a small class within the same school, and [L.sub.i1] denotes attrition in first grade. The full sample is used to construct this tree, minus the schools that leave the sample in first grade. See the text for more details. TABLE 3 Tests of Randomization Kindergarten First Grade Student characteristics Female 0.951 0.292 White or Asian 0.683 0.605 Free lunch 0.683 0.777 Teacher characteristics Black 0.683 0.605 Rookie 0.683 0.605 Master's degree 0.951 0.605 Math score, kindergarten 0.605 Reading score, kindergarten 0.605 Notes: The table displays the q-values of the F-statistics for the test of equality of means for all treatment paths along the listed observable dimension and grade while controlling for school fixed effects and are based on standard errors that are clustered at the level of the school. The q-value is defined here as a p value that has undergone a multiple testing correction taking into account the multitude of checks at each grade; here, the Simes (1986) procedure is employed. This correction is needed since we know ex ante that randomization has been attempted. Rookie is a dummy variable equal to 1 if the teacher has fewer than 3 years of experience. The regression sample is used for these tests. TABLE 4 Structural Coefficient Estimates Mathematics Reading Kindergarten [[beta].sub.kk] -0.005 0.026 (0.041) (0.036) Observations 4,052 3,994 First grade [[beta].sub.11] 0.061 0.146 ** (0.060) (0.052) [0.225] [0.018] [[beta].sub.1k] 0.075 0.089 ** (0.042) (0.044) [0.309] [0.045] [[beta].sub.1,1k] -0.082 -0.145 ** (0.063) (0.064) [0.287] [0.038] Observations 2,707 2,640 Notes: The table displays the estimates of the structural coefficients in Equations (2) and (3). Standard errors clustered at the level of the classroom are given in parentheses; q-values corrected for multiple testing using the Simes (1986) procedure are given in square brackets. Normalized test scores are used as the response variable. Regressions include class type, free lunch status, a rookie dummy indicating the teacher has fewer than 3 years of experience, whether the teacher has a graduate degree, whether the teacher is Black, whether there-spondent is female, whether the respondent is either White or Asian, and school fixed effects as controls. * Significance at 10%, ** significance at 5%, *** significance at 1%. TABLE 5 Dynamic Average Treatment on the Treated Estimates Mathematics Reading Kindergarten [tau](1)(0) -0.005 0.026 (0.041) (0.036) Observations 4,052 3,994 First grade [tau](l, 1)(0, 0) 0.053 0.090 * (0.060) (0.054) [tau](l, 1)(0, 0) 0.075 * 0.089 ** (0.042) (0.044) [tau](0, 1)(0, 0) 0.061 0.146 *** (0.060) (0.052) Observations 2,707 2,640 Notes: The table displays the dynamic average treatment on the treated estimates for exposure to a full-time teacher's aide classroom where the counterfactual path receives no treatment. Standard errors clustered at the level of the classroom are given in parentheses. Normalized test scores are used as the response variable. Regressions include class type, free lunch status, a rookie dummy indicating the teacher has fewer than 3 years of experience, whether the teacher has a graduate degree, whether the teacher is Black, whether the respondent is female, whether the respondent is either White or Asian, and school fixed effects as controls. * Significance at 10%, ** significance at 5%, *** significance at 1%. TABLE 6 Power Calculations Mathematics Detect Detect [tau] = 0.10 [tau] = 0.05 Kindergarten [tau](1)(0) 2,355 9,162 First grade [tau](1, 1)(0, 0) 4,786 18,878 [tau](1, 0)(0, 0) 4,211 16,577 [tau](0, 1)(0, 0) 4,874 19,227 Reading Detect Detect [tau] = 0.10 [tau] = 0.05 Kindergarten [tau](1)(0) 2,414 9,399 First grade [tau](1, 1)(0, 0) 4,840 19,097 [tau](1, 0)(0, 0) 4,258 16,768 [tau](0, 1)(0, 0) 4,944 19,512 Notes: The table displays the number of observations needed in the sample to achieve a power of [pi] = 0.80 to detect an effect of the given size at the 95% confidence level. Recall that the kindergarten regression has 4.052 observations in mathematics and 3,994 in reading, while the first-grade regression has 2,707 observations in mathematics and 2,640 in reading. TABLE 7 DATT Estimates for Additional Treatments Mathematics Reading First grade [tau](l, 1)(1, 0) -0.022 0.001 (0.052) (0.051) [tau](1, 1)(0, 1) -0.007 -0.056 (0.046) (0.047) [tau](0, 1)(1, 0) -0.014 0.057 (0.051) (0.048) Observations 2,707 2,640 Notes: The table displays the dynamic average treatment on the treated estimates for exposure to a full-time teacher's aide for a given treatment and counterfactual path [tau](*). Standard errors clustered at the level of the classroom are given in parentheses. Normalized test scores are used as the response variable. Regressions include class type, free lunch status, a rookie dummy indicating the teacher has fewer than 3 years of experience, whether the teacher has a graduate degree, whether the teacher is Black, whether the respondent is female, whether the respondent is either White or Asian, and school fixed effects as controls. TABLE 8 Precise Zero Tests [H.sub.o]: [tau](*) [less than or equal to] -0.05 or [tau](*) [greater than or equal to] 0.05 Mathematics Reading First grade [tau](1, 1)(1, 0) 1.000 0.000 *** [tau](1, 1)(0, 1) 1.000 0.971 [tau](0, 1)(1, 0) 1.000 0.548 Notes: The table displays the p values for the arbitrary bounds test; it is described in Penney (2013). A rejection of the null means that there exists sufficient evidence to claim an economically insignificant effect. * Significance at 10%, ** significance at 5%, *** significance at 1%. TABLE 9 DATT Estimates by Free Lunch Status Mathematics Reading Free Lunch No Yes No Yes Kindergarten [tau](1)(0) 0.010 -0.016 0.007 0.066 (0.047) (0.053) (0.040) (0.049) Observations 2,073 1,979 2,040 1,954 First grade [tau](1,1)(0, 0) 0.122 * -0.020 0.168 ** 0.022 (0.068) (0.087) (0.069) (0.071) [tau](1, 0)(0, 0) 0.094 * 0.062 0.109 * 0.075 (0.054) (0.067) (0.058) (0.067) [tau](0, 1)(0,0) 0.038 0.070 0.213 *** 0.008 (0.066) (0.090) (0.065) (0.075) Observations 1,488 1,219 1,460 1,180 Notes: The table displays the dynamic average treatment on the treated estimates for exposure to a full-time teacher's aide for a given treatment and counterfactual path [tau](*) for students that do not receive a free or reduced price lunch. Standard errors clustered at the level of the classroom are given in parentheses. Normalized test scores are used as the response variable. Regressions include class type, a rookie dummy indicating the teacher has fewer than 3 years of experience, whether the teacher has a graduate degree, whether the teacher is Black, whether the respondent is female, whether the respondent is either White or Asian, and school fixed effects as controls. * Significance at 10%, ** significance at 5%, *** significance at 1%. TABLE 10 DATT Estimates by Race Mathematics Reading Black Student No Yes No Yes Kindergarten [tau](1)(0) 0.015 -0.021 -0.019 0.144 * (0.044) (0.080) (0.038) (0.075) Observations 2,710 1,342 2,683 1,311 First grade [tau](1,1)(0, 0) 0.061 0.083 0.152 ** -0.093 (0.066) (0.115) (0.067) (0.068) [tau](1, 0)(0,0) 0.071 0.141 * 0.117 ** 0.034 (0.049) (0.083) (0.056) (0.068) [tau](0, l)(0,0) 0.034 0.158 0.200 *** -0.067 (0.065) (0.119) (0.062) (0.080) Observations 1,889 818 1,830 810 Notes: The table displays the dynamic average treatment on the treated estimates for exposure to a full-time teacher's aide for a given treatment and counterfactual path [tau](*) for White students. Standard errors clustered at the level of the classroom are given in parentheses. Normalized test scores are used as the response variable. Regressions include class type, free lunch status, a rookie dummy indicating the teacher has fewer than 3 years of experience, whether the teacher has a graduate degree, whether the teacher is Black, whether the respondent is female, and school fixed effects as controls. * Significance at 10%, ** significance at 5%, *** significance at 1%. TABLE 11 Internal Rates of Return Teacher's Teacher's Small Intervention Aide Aide Class [delta] 0.072 0.027 0.126 Source [tau] (1,1)(0, 0) Equation (7) Equation (7) Internal rate of return for 0% wage 4.9% 1.8% 6.4% growth 1 % wage 6.0% 2.9% 7.5% growth Notes: This table shows the estimates of the internal rates of return for full-time teacher's aides and smaller classes from kindergarten through third grade using Equation (6). 8 represents the cumulative effect of the intervention of interest on human capital, which is proxied by academic achievement. Internal rates of return are relative to a regular size class which may have the services of a part-time aide. See the text for details.

Printer friendly Cite/link Email Feedback | |

Author: | Penney, Jeffrey |
---|---|

Publication: | Economic Inquiry |

Article Type: | Author abstract |

Geographic Code: | 1USA |

Date: | Apr 1, 2018 |

Words: | 13465 |

Previous Article: | STUDENT RESPONSIVENESS TO EARNINGS DATA IN THE COLLEGE SCORECARD. |

Next Article: | THE EFFECTS OF JOB RELOCATION ON SPOUSAL CAREERS: EVIDENCE FROM MILITARY CHANGE OF STATION MOVES. |

Topics: |