Determination of optimum sample size in regression analysis for some hydrologic variables with emphasis on power analysis.
Determination of sample size is considered as the basic aspect in scientific researches (Colosimo et al., 2007). Lenth (2001) believes that determining the sample size is a main and difficult step in planning a statistical study. There are a limited number of papers on sample size for a specific test (Lenth, 2001). Especially, about hydrology that generally, just a few resources can be found inside and outside the country while the sample size in hydrologic studies is remarkably significant; especially, due to scarcity of hydrometric stations and the statistical length recorded in them. Kennard et al. (2009) believed that one of the most important surveys in metric estimations is the rate of uncertainty related to the period of data record (sample size), sample period (period of sampling) and sample overlap among stream gauge records.
Regression analysis is one of the widely used statistical methods (Shieh, 2007). The purpose of the current study is determination of sample size in regression analysis of hydrologic variables by means of power analysis where power analysis is considered for generally fitting the model. The case study is Dez Basin in Khouzestan Province in Iran.
When a statistical test is performed, the decision being made is whether to accept or reject the null hypothesis. There are, however, actually four possible outcomes of a test, correctly or incorrectly accepting or rejecting the null hypothesis (table 1). Incorrectly rejecting the null hypothesis is called Type I error and incorrectly accepting the null hypothesis is called Type II error. For every statistical test, the probability of type I error is called [alpha] and the probability of type II error is called [beta] as it is shown in table 1.
As a common choice in statistics, the occurrence probability of type I error, 0.05 is considered (Mapstone, 1995 ; Foster, 2001). Thus the chance of occurrence for type I error is rare and significant results can be reported with high certainty. The probability of type II error ([beta]) is not usually controllable (Keough and Mapstone, 1997) and to some extent, high and is not reported in studies with several methodology (psychology: Sedlmeier and Gigrenzer 1989; Cohen 1962; fishery and hydrology: Peterman 1990a; Studies of Environmental Effects: Fairweather 1991; Mapstone 1995). The maximum risk of type II error is shown in [beta]. There is no traditional value for [beta], although some recommended 0.2 (e.g. cohen 1988). What is still common in resources is that the zero hypothesis is accepted when a statistical test is not significant. As Freman et al. (1978), Hayes (1987), Peterman (1990b) and others have showed, type II error is mostly a big error, especially when the sample size is small.
The power of a statistical test defines as the probability of discovering a significant difference or relations between the measured variables if they really exist (Cohen, 1988); that is the power of a statistical test for discovering an effect if it really exists (Cohen, 1988). A powerful test rejects an incorrect hypothesis with high probability.
Power is defined in this equation: Power = 1 - [beta] (1)
For all statistical tests, Eq. (1) depends ona , effect size, sample size and sample variance. Although the relation between the power and the aforementioned parameters changes based on type of statistical test, generally, their relationship is described as follows (Burgman and Lindenmayer, 1998):
Power [varies] ES z [alpha] x [square root of (n)]/[sigma] (2)
"ES" is the effect size, "[alpha]" is type I error, "n" is sample size and "a" is the standard deviation. With respect to the equation, the power increases when the effect size, sample size and "[alpha]" increases, but it decreases when data variability increases.
A statistical test with small power is resulted by high probability of type II error that leads to discovery of one non-significant effect when a significant effect does really exist. The effect size is the difference between the real amount of test parameters and the determined amount under zero hypotheses (Dattalo, 2008). In simple linear regression, the regression model and the related hypothesis include:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3)
Therefore, the effect size is the difference between the real value of the independent variable coefficient (b) and the determined value in zero hypotheses (Faul et al., 2007).
Power analysis can be a priori or a posteriori. In this study, the latter is considered where the estimation of the sample size is based on an acceptable level of effect size "ES", a and power. Many authors suggest the ratio of 4:1 for [beta]: [alpha] (cohen, 1988; Hinkle et al., 2003). Thus, if a is 0.05, [beta] is equal to 0.2 and power is 1 - 4(0.05) = 0.8. In this study, power is considered at 0.8.
The power analysis is advantageous in two aspects: 1) Selection of sample size on a logical basis and not on the rule of thumb, 2) The researcher shall identify the desired effect size.
Materials & Methods
As mentioned, the purpose of this study is determination of the adequate sample size in regression analysis of hydrologic variables with power analysis. Therefore, the regression relations between annual medium discharge ([m.sup.3]/s) and area of Basin ([km.sup.2]) of Dez Basin, in Iran were studied. Figure 1 indicates situation of Dez basin in Khuzestan province in Iran. Different researches have used the relation between these variables in their studies. Taylor (1967) identified a significant relation between annual medium discharge and different morphometeric variables (such as catchments' area, maximum basin length, etc.) for 12 rivers in New Zealand. Potter (1953) also benefited from the relations between pick discharge and variables such as catchments' area, topography, rainfall, and rain frequency. Dez Basin is located in Khouzestan Province and has 22 sub-basins. The required data for this research is derived from Khuzestan Water and Power Authority.
[FIGURE 1 OMITTED]
Power analysis Calculations depends on the specific model or test used in the research (Dattalo, 2008). Therefore, the first step is determination of statistical test required in the research. For its wide use of the regression test in hydrologic studies, determination of sample size with power analysis was studied considering this viewpoint. Since one independent variable (catchment's area) is considered for estimating quantity of dependent variable (annual medium discharge), the method of simple linear regression was used. The number of data used in this research is 22(the number of hydrometric stations in Dez basine is 22) to perform the method of regression. The correct performance of regression requires considering crucial points such as related assumptions. The performance of steps for method of simple linear regression that applied in this research, are as follows:
1. Identifying parametric or non-parametric status of regression method where both dependant and independent variables shall be with normal distribution (Dytham, 1999).
2. Studying simple regression assumptions: linear regression has 5 assumptions (Helsel and Hirsch, 2002). The assumptions include:
a. The model form is correct if the dependant variable (Y) is linearly related with independent variable (X).
b. Variable (X) is selected by the researcher at the specific and approved level and independent variable shall be measured without error. Then at each level of variable (X), the observations related to variable (Y) are measured randomly (Dowdy and Wearden 1991).
c. Variance of the residuals is constant (is homoscedastic). On the other hand, it shall not be dependant on independent variable (X) or another variable such as time.
d. The residuals are independent.
3. The residuals are normally distributed.
4. Studying outliers: outliers and extreme values play significant roll in regression analysis (Reimann et al., 2008) and it is studied for this reason.
Determination of regression model: After the above steps, regression test was performed between two variables; area of basin and annual medium discharge and the related regression equation was obtained.
1. In this step, power analysis was used to determine the required sample size. The related calculations include:
a. With respect to the hypothesis of the test in this study is [H.sub.0] : b = 0, to obtain the required sample size or the least sample size with required power (1-[beta]), first the regression coefficient related to [b.sub.0] shall be calculated that is indicated with [[rho].sub.0]:
[[rho].sub.0] = [b.sub.0] [[sigma].sub.x]/[[sigma].sub.y] (4)
"[b.sub.0]" is the least regression coefficient in the regression equation, [[sigma].sub.x] and [[sigma].sub.y] are accordingly, standard deviation of independent and independent variables. Therefore, the minimum sample size ([n.sub.min]) is gained in the below equations:
[n.sub.min] [greater than or equal to][[Z.sub.[beta](1)] + [Z.sub.[alpha]]/[Z.sub.0]] + 3 [Z.sub.0] = 0.5ln(1 + [[rho].sub.0]/1 - [[rho].sub.0]) (5)
[z.sub.a] in the first equation is the converted "Z" for the critical value equal to correlation coefficient at a specific significant level ([alpha]) and [Z.sub.[beta](1)] is one-way probability of deviation from normal distribution and is derived from table Z.
After estimating the sample size in the aforementioned method, the least sample size is obtained with power analysis (n). In the next step, 5 groups of "n" from the total data of (22) were selected randomly and the regression model was determined with regression analysis. Then the value of Root Mean Squar of Errors (RMSE) was obtained for the 5 groups and compared with the RMSE of the total data. One sample t-test was used to compare them.
What is coming below is accordingly the results of the simple linear regression test and the findings of power analysis.
As table 2 indicates, by performing the two tests of normality and equality of variance, it was identified that data does not follow the normal distribution. To cause the data follow the normal distribution, their logarithm form can be used (Feng and Wang, 2003; Helsel and Hirsch, 2002; Ries and Friesz, 2000). Therefore, with the changed form of the data, their distribution accorded with Log-Normal distribution, and the term of variance homogeneity was obtained for them.
Also all regression assumptions were studies and it was observed that the data used in this study has the required presuppositions to use regression.
After the above steps, the obtained regression model is:
log y = 0.492log x - 0.602, [R.sup.2] = 0.587 p - value < 0.001
In the next step to determine the sample size, as it was mentioned before, the power analysis was used. There are various types of software to perform this analysis such as G*Power. By means of this software, we obtained the sample size of 8.
In the next step, 5 groups of 8 were randomly selected from the data and for each of them, the regression model and the quantity of root mean square of error were calculated. The results of this section are shown in table 3.
The comparison of the error related to all the data and the error related to groups of 8 by means of one sample t-test indicated that there is no significant statistical difference between them ([t.sub.0.05,4] = -1.6, P > 0.05). In other words, the decrease in the sample size from 22 to 8 does not have negative effect in estimation of dependant variable and as before, the dependant variables can be expected with the same accuracy.
Study of relations between statistical power and the sample size
The sample size can be estimated with the quantity of type I and II error, the standard deviation of the independent and dependent variables and the effect size. The study of the relations between power and the sample size, according to figure 2, indicated that by initially increasing the number of the sample size, power increases quickly. When "n" reaches at a specific number, power becomes fixed and reaches at its highest point (power =1).
[FIGURE 2 OMITTED]
Also, by increasing a, power increases. In fact, by increasing the risk of type I error, the risk of type II error decreases (Foster, 2001) (Fig.3). Furthermore, by increasing the effect size, power increases. It means that by increasing the difference between the real quantities of sample parameter in proportion to the determined quantity in zero hypotheses, the test can better shows the relation between the variables and therefore, power increases (Foster, 2001) (Fig.4).
[FIGURE 3 OMITTED]
[FIGURE 4 OMITTED]
Planning sample size in mostly important and its determination is almost always difficult (Lenth, 2001). In hydrologic studies, there is usually limitation due to the number of hydrometric stations and or registration period of hydrologic data. For this reason, it is mostly hard and sometimes unreachable to have adequate sample size. On the hand, in every study, it is the question of the researcher to know the required sample size in the study. This matter is especially important in hydrologic studies. In this study, the method of employing power analysis is proposed to determine the least required sample size and its statistical basis.
Anyway, the results of the statistical methods accompany with probabilities and we can never be certain that such results be accurate (Stefano, 2001). Before using statistics to conclude, the risk of occurrence shall be identified for statistical error. Although previous statistical methods give data on quantity of type I error to researchers, the occurrence of type II error mostly remains unknown (Stefano, 2001). Therefore, the prior power analysis can be used to estimate the quantity of type II error before collecting data. As it was mentioned in the previous sections, power analysis is used to determine the sample size considering type II error. In other words, for one specific level of [alpha], the effect size and with estimation of community variance, the prior power analysis can be used to determine the required sample size to reach at a specific level of statistical power (Stefano, 2001). Without such analysis, due to the probability of occurrence of type II error, the gained results may not have enough accuracy. In this study, to study the regression relation between annual medium discharge and the area of basin, 22 samples are considered in Dez Basin. As the results indicate, with power analysis, it is possible to decrease the required sample size to 8 for this case while regression with 8 data did not lead to less accuracy or correctness of data. As the results indicate, RMSE in the first state (by using 22 data) and in the second state (by using 8 data) did not show significant difference. Therefore, it can be concluded that power analysis in every hydrologic studies is necessary to determine the least sample size.
In this study, the level of type I error was considered 0.05. Also the power analysis of 0.8 was used to determine the sample size. While with respect to the purpose of the research, it is possible to change either of them or other effective factors in power analysis in its acceptable range. To increase the statistical power, if required, more sample size can be used or the level of error a can be increased (Foster, 2001). When the number of sample size is a lot, it is possible to maintain the level of both type I and type II errors at low range; of course, it was not the purpose of this study. Thus there shall be a balance between, them. The ratio of 4:1 according to Cohen (1988) and Hinkle et al. (2003) was considered in this regard. In other words, we considered a equal to 0.05 and P equal to 0.2. It shall also be taken into consideration that to have high statistical power in a test, the power shall be 0.8 more (Cohen, 1988).
Such researches can not be made by one person and in any phase of performance, guidance and cooperation of specialists, scientists and relevant authorities in different centers appear as great assistance.
As a result, I find it necessary to express sincere gratitude and thanks to all those who have provided me with their help and dedication.
 Burgman MA, Lindenmayer DB (1998). Conservation biology for the Australian environment. Surrey Beatty, Chipping Norton, Sydney, NSW.
 Cohen J (1962). The statistical power of abnormal-social psychological research: a review. J. Abnorm. Soc. Psychol. 65:145-153.
 Cohen J (1988). Statistical power analysis for the behavioural sciences, 2nd Edition. Lawrence Erlbaum, Hillsdale, NJ.
 Colosimo EA, Cruz FRB, Miranda JLO, Van Woensel T (2007). Sample size calculation for method validation using linear regression, [bar.J.of Statistical Computation and Simulation], 77(6):505-516.
 Dattalo P (2008). Determining Sample Size (Balancing Power, Precision, and Practicality). Oxford University Press, Inc.
 Dowdy S, Wearden S (1991). Statistics for research. 2nd Edition. John Wiley & Sons, New York.
 Dytham C (1999). Choosing and using statistics, Blackwell Science Ltd.
 Fairweather PG (1991). Statistical power and design requirements for environmental monitoring. Aust. J. Mar. Freshwater Res. 42: 555-567.
 Faul F, Erdfelder E, Buchner A, Lang AG (2007). Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Production Nr. BSC910.
 Feng CX, Wang XF (2003). Surface Roughness Predictive Modeling: Neural Networks versus regression. IIE Transactions, 35(1): 11-27.
 Foster RJ (2001). Statistical power in forest monitoring. Forest Ecology and Management, 151: 211-222.
 Hayes JP (1987). The positive approach to negative results in toxicology studies. Ecotoxicol. Environ. Safety 14: 73-77.
 Hinkle DE, Wiersma W, Jurs SG (2003). Applied statistics in the social sciences. Boston: Houghton Mifflin.
 Helsel DR, Hirsch RM (2002). Statistical Methods in Water Resources, USGS.
 Kennard MJ, Mackay SJ, Pusey BJ, Olden JD, Marsh N (2009). Quantifying uncertainty in estimation of hydrologic metrics--implications of discharge record length and record period.
 Keough MJ, Mapstone BD (1997). Designing environmental monitoring for pulp mills in Australia. Water Sci. Technol. 35: 397-404.
 Lenth RV (2001). Some Practical Guidelines for Effective Sample Size Determination. The American Statistician,. 55(3).
 Mapstone BD (1995). Scalable decision rules for environmental impact studies: effect size, Type I and Type II errors. Ecol. Appl. 5: 401-410.
 Peterman RM (1990a). Statistical power analysis can improve fisheries research and management. Can. J. Fish. Aquat. Sci. 47: 2-15.
 Peterman RM (1990b). The importance of reporting statistical power: the forest decline and acidic deposition example. Ecology 71: 2024-2027.
 Potter WD (1953). Rainfall and topographic feature that affect runoff. Trans. Am. Geophys. Un. 34: 67-73.
 Reimann C, Filzmoser P, Garrett RG, Dutter R (2008). Statistical Data Analysis Explained, John Wiley & Sons, England.
 Ries KG, Friesz PJ (2000). Methods for estimating low-flow statistics for Massachusetts streams: U.S. Geological Survey Water-Resources Investigations Report 00- 4135, 81 P.
 Sedlmeier P, Gigrenzer G (1989). Do studies of statistical power have an effect on the power of studies? Psychol. Bull. 105: 309-316.
 Shieh G (2007).A Unified Approach to Power Calculation and Sample Size Determination for Random Regression Models. Psychometrika, 72 (3): 347360.
 Stefano JD (2001). Power analysis and sustainable forest management. Forest Ecology and Management, 154: 141-153.
 Taylor CH (1967). Relations between geomorphology and stream flow in selected New Zealand river catchments. Journal of hydrology (New Zealand) 6: 106-112.
Mozayyan M. (1), Akhondali A. (2) and Basiri R. (3)
(1) Member of Scientific Board of Behbahan High Educational Complex, Behbahan, Iran. Department of Range and Watershed Management, Faculty of Natural Resources, Behbahan Higher Educational Complex, at The Beginning of Deilam Road, Behbahan, Khoozestan Province, Iran. E-mail: firstname.lastname@example.org
(2) Member of Scientific Board of Shahid Chamran University of Ahwaz, Ahwaz, Iran. Department of Hydrology, Faculty of Water Science, Shahid Chamran University of Ahwaz, Ahvaz. E-mail: email@example.com
(3) Member of Scientific Board of Behbahan High Educational Complex, Behbahan, Iran. Department of Statistical Ecology, Faculty of Natural Resources, Behbahan Higher Educational Complex, at the beginning of Deilam Road, Behbahan, Khoozestan Province, Iran. E-mail: Basiri52@yahoo.com
Table 1: Four possible states in a statistical test. Statistical decision Real situation Acceptance of zero hypothesis Rejection of zero hypothesis Correct decision Incorrect decision, Zero hypothesis (1-[alpha]). There type I error is correct. is no effect & no ([alpha]). There is effect is discovered. no effect but one effect is discovered. Incorrect decision, Correct decision Zero hypothesis type II error (1-([beta]). There is is incorrect. ([beta]). There is one effect & it is one effect but it is discovered. not discovered. Table 2: Results of the test of normality and variance homogeneity. Normality test (Shapiro--Wilk) Homogeneity of For independent Variance Test variable For dependant variable (Leven Test) (area of basin) (Annual medium Discharge) [L.sub.0.05,1,26] [W.sub.0.05,14] [W.sub.0.05,14] = 0.973, =3.326, P>0.05 = 0.97,P > 0.05 P > 0.05 Table 3: Five regression model & their particulars. Group Regression model 1 Logy = -0.742+0.6logx 2 Logy = -0.537+0.399logx 3 Logy = -0.555+0.484logx 4 Logy = -0.529+0.484logx 5 Logy = -0.571+0.506logx Group The square of correlation coefficient([R.sup.2]) RMSE 1 [R.sup.2] = 0.948 p-value = 0.000 0.119 2 [R.sup.2] = 0.421 p-value = 0.041 0.317 3 [R.sup.2] = 0.589 p-value = 0.013 0.260 4 [R.sup.2] = 0.772 p-value = 0.002 0.141 5 [R.sup.2] = 0.697 p-value = 0.005 0.278