Chapter 2 Summarizing data.
The purpose of data collection is to allow us to make informed decisions about the problem at hand. Through the manipulation and statistical analysis of the data, we obtain information that bears on our problem and use it to make a better decision, hopefully, than we would have made in the absence of that information.
We generally collect raw statistical data in random order, as shown in table 2.1. Since there is no real order to the data, we experience difficulty in obtaining anything of value from them upon inspection. We may array the data as an aid to interpretation (table 2.2). An array reorders the data from the smallest to the largest value. From an array, we readily compute the range by subtracting the smallest from the largest observation. For the cotton example, the range is 373 minus 215, or 158 pounds of lint. An array also indicates something about the distribution of the units between the two extremes and their tendency to cluster toward some central value, such as 290 in the cotton example.
We may further summarize the data into a frequency distribution like that shown in table 2.3. We construct the frequency distribution with a given number of classes. Each class has a certain width or number of units expressed as an interval. For the data in table 2.3, there are eight classes, each with a width of 20 pounds of lint. The frequency column in the table expresses the number of observations that fall within each class. The total number of frequencies must equal the number of data points, or 75 in this example. The variable "per acre yield of cotton lint" is continuous. Thus we arbitrarily break the class intervals at 20 pounds for ease of presentation. We see that the cotton yield intervals in table 2.3 have the upper limit of the first class repeated as the lower limit of the second class. This signifies that we are using real class limits. The values in the first class can approach 235 but cannot equal 235. A data point with the value 235 is a member of the second class, which has 235 as its lower limit. When we use stated classes, they are constructed so that there is a break of one unit between the upper limit of the first class and the lower limit of the second, e.g., 215-234, 235-254, etc. We sometimes use stated classes in presentations because they cause less confusion in interpretation. However, we can use only real classes for computations such as determining the width of the class interval. If we use stated classes, we underestimate the interval width. We get the width of the class interval, 20, by subtracting the real class lower limit from the upper limit, e.g., 235 minus 215.
Not all frequency distributions have equal width classes. This is especially true when we are dealing with income data in which presentation with equal width classes requires that some classes be empty or contain few observations. We prefer to make the classes of different widths in these instances to both shorten the frequency distribution and to aid in interpretation. With some data sets, we prefer to have the last class open-ended rather than have an upper limit because there are a few data points that are much larger than the planned upper limit of the class. In these cases, we write the last class as 355 and over, or more than 355. A disadvantage of having open-ended classes is that we cannot compute some types of averages and measures of dispersion such as the arithmetic mean, midrange and standard deviation, and range from the frequency distribution.
[FIGURE 2.1 OMITTED]
A frequency distribution can also be presented in a bar chart, as shown in figure 2.1. This type of chart is called a histogram. In the histogram, the bars touch and the real class limits appear on the x axis at the end of each bar. If the class limits are not all the same width, neither are the bars of the histogram. The number of frequencies are displayed on the y axis to complete the chart.
An advantage of charting a frequency distribution is that we readily see its shape. It is easy to determine its skewness or if it is relatively symmetric in shape and to see the peakedness or kurtosis of the distribution. The cotton data in figure 2.1 are almost symmetrical. Also, the distribution is somewhat peaked or mesokurtic. It is not flat across the top or platykurtic.
The frequency polygon is a line graph that is sometimes used to display data. It shows frequencies on the y axis and class midpoints on the x axis. To make the graph touch the x axis, we plot midpoints of imaginary classes at each end of the distribution against zero frequencies. If we leave these out, the graph "floats" above the axis and looks strange. We compute class midpoints by averaging the upper and lower real class limits. For example, for the class 215 up to 235 of cotton data, we calculate the class midpoint as (215 + 235)/2 = 225, and so on for each class (see table 2.4).
Frequency polygons are even easier to use than histograms to tell the shape of a distribution. We commonly use frequency polygons to plot the relative frequencies on the same graph to compare two distributions, especially when we have a different number of frequencies in each distribution. We determine relative frequencies by dividing the frequency in each class by the total number of frequencies for the distribution and then multiplying by 100 to get percent. As an example of how to compare two distributions, we examine the weekly wage distributions of farm workers and truck drivers, as shown in table 2.5. In these two data sets, farm workers' wages are generally lower; but when we plot them as frequency polygons (figure 2.2), we have difficulty in comparing the distributions because of the relatively small number of farm workers. The last two columns of table 2.5 show relative frequencies for the two distributions, and relative frequency polygons for these data (figures 2.3 and 2.4), show that, in general, the wage distribution for truck drivers is higher than that for farm workers, but they have essentially the same shape.
[FIGURE 2.2 OMITTED]
[FIGURE 2.3 OMITTED]
We decide on the number of classes in a frequency distribution, but as a rule we select a value in the range of 5 to 15. Fewer than five classes does not provide enough categories to adequately display the data, and more than fifteen is generally too many. Sometimes we use Sturges's rule, which is a formula that solves for the number of classes, K:
[FIGURE 2.4 OMITTED]
K = 1 + 3.322([log.sub.10] n)
where n represents the total number of frequencies.
The value K will usually not be a whole number, so we need to choose whether to round the number up or round down. Once we select K, then we obtain the interval width, I, by dividing the range of the data by K. When we determine values for the number of classes and the width of the class interval, we check the data to see that they work. Sometimes, one data point may be left out. Another consideration that is important in selecting interval widths is to make the class midpoints of the intervals representative of the data in the classes. For example, if we have data for wages that end in $5 values, we should select class midpoints that also end in $5 values. To do this, the class intervals may have to be adjusted. Sturges's formula is an approximation; we must finally decide what to do.
We can cumulate a frequency distribution to help understand its content. We cumulate either of two ways--on a "less-than" or "more-than" basis. In the less-than procedure, we cumulate the frequencies beginning with the lowest class and ending with the highest. We write them as less than the upper limit of each class and add them from the top down (table 2.6). We can plot an S-shaped cumulative frequency curve called an ogive from these data with cumulative frequencies on the y axis and the upper class limits on the x axis (figure 2.5). Ogives provide an easy way to divide the frequency distribution into an equal number of parts such as percentiles, deciles, quartiles, and other quantiles. Percentiles divide data into 100 equal parts, deciles into ten equal parts, and quartiles into four equal parts. In the more-than procedure, we begin the cumulation with the highest class and end with the lowest. We state the values as higher than the lower class limit (table 2.7). We plot a more-than ogive with the cumulative frequencies on the y axis and the lower class limits on the x axis, and it makes a reverse S-shaped curve (figure 2.6). If we place the two ogives on the same graph, they cross at the middle of the distribution.
[FIGURE 2.5 OMITTED]
An average is a number used to represent the central value of a data set or a distribution. We use averages more often than any other statistical measure and express them both for ungrouped or raw unsummarized data and for data summarized into a frequency distribution. Computational procedures for the two types of data are generally different because interpolation formulas are often necessary when dealing with frequency distributions. Although the computation of averages from samples may be no different than that for the population or universe, we generally use different notations so there is no confusion as to which data we are manipulating. We recall that a sample is based on a subset of the data pertaining to the problem, while a population contains all of the data in the set. We use ordinary alphabetic characters in the formulas to express the average for sample data and generally employ Greek letters in formulas dealing with population data.
[FIGURE 2.6 OMITTED]
The Arithmetic Mean
The arithmetic mean is the most widely used measure of central tendency. We obtain the mean, [mu], of a population by summing the values of the observations, [X.sub.i], and dividing by the number, N, as in equation 2.1. For example, for the data set:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] [2.1]
the sum of the [X.sub.i] is 20, the value of N is 4, thus, the mean,
[mu] = 20/4 = 5.
If we take a sample from this population, say of size n = 2, we can calculate [bar.X], the mean of the sample, by equation 2.2. The value of the sample mean depends upon which two
[bar.X] = [n.summation over (1)] [X.sub.i]/n [2.2]
items from the population we select. If we choose the first two values, 7 and 3, their sum is 10, and since n = 2, the sample mean is 10/2 = 5, which is a perfect representation of the population mean. However, for several other choices of values from the population, the sample mean will not perfectly represent [mu]. It will be either too small or too large. Thus, the value of the sample mean depends upon which elements of the population we select for the sample, and it may or may not equal [mu]. The larger n becomes, the less effect extreme values in the population have on the value of the sample mean since they are combined with several other values, and the more closely the sample mean approaches [mu].
In the preceding example, we gave each item from the population selected for the sample the same weight, namely a weight of 1. However, sometimes we want to assign weights other than 1, [w.sub.t], to the values of a set of data. In this case, we calculate the weighted mean by equation 2.3.
[bar.X] = [n.summation over (1)][w.sub.i][X.sub.i]/[n.summation over (1)][w.sub.i] [2.3]
For example, the following data for the wages of migrant workers by type of crop also includes the number of workers for each crop. We can compute an average wage by using the number of workers as weights (table 2.8).
The weighted mean is computed as 12480/2570 = $4.86 per hour. It is appropriate because there is a different number of workers involved with each crop. An unweighted average would give a different result. In the case in which the weights are fractions and they sum to 1, the formula can be modified so that we drop division by [summation][w.sub.i], since any value divided by 1 is unchanged.
Two properties of the arithmetic mean are: (1) the sum of the deviations from the mean is zero, if we define small x as a deviation from the mean as in equation 2.4, so that [[summation].sub.x] = 0;
x = (X - [bar.X]) [2.4]
and (2) the sum of squares of the deviations from the mean is a minimum. If M = [summation]X/n, then the quantity [summation][(X - M).sup.2] is at least as small as when M is defined in any other way.
When data are in a frequency distribution, often the values of the individual observations are not available and we must consider the distribution class by class. In this procedure, we use the class midpoint as a proxy for the values of all the observations in that class. Also, since the frequency distribution has a different number of frequencies for each class, we use the frequencies as weights in calculating the arithmetic mean, as in equation 2.5. Suppose we decide to calculate the average yield of cotton lint from the frequency distribution in table 2.9. In this case, the arithmetic mean for the frequency distribution is 21835/75 = 291.1, or 291 pounds of lint per acre, whereas the arithmetic mean for the raw data (see table 2.1) is 21807/75 = 290.8, or 291 pounds. Thus the formula for the frequency distribution overestimated the mean by
291.1 - 290.8 = 0.3 pounds.
[bar.X] = [n.summation over (1)] [f.sub.i] [X.sub.i]/[n.summation over (1)] [f.sub.i] [2.5]
Using the frequency distribution formula generally does not cost much in terms of lost accuracy provided we construct the distribution so that the class midpoints are good proxies for the data in the classes. This is good news because many secondary data sources, for example, most census reports, publish tables containing frequency distributions but generally do not present the raw data. Thus, if we wish to use frequency distribution data as the basis for calculating statistics such as the mean, we can do so with confidence.
The midrange, or center, is the arithmetic mean of the smallest and the largest items in a data set. When we array data, it is the average of the first and last items. Thus, in equation 2.6, if [X.sub.1] represents the smallest and [X.sub.n] the largest, then we calculate the midrange, MR, as
MR = [X.sub.1] + [X.sub.n]/2. [2.6]
For the cotton yield data from table 2.2, the midrange is
(215 + 373)/2 = 294.
We use the midrange most often for computing the daily average temperature or average stock price because we are especially interested in the maximum and minimum values for these data series, and they are among the few values recorded. The midrange is generally unreliable when used as an estimate of the population mean because it is based on the two values in the data set that tend to change significantly from sample to sample, thus its value tends to vary widely in repeated sampling.
Unlike the arithmetic mean, which is based on the values of every observation in the data set, the median is a place average. It is not affected by the values of observations at the extremes of the data and is a better average to use in cases in which we encounter extreme values, as with income and education data. For ungrouped data, the median is the value of the middle observation after the data are arrayed. When we have an even number of observations, we average the middle two to obtain the median. For example, for the population of four values we had earlier:
7 3 2 8
we first array the data, and then select the middle observation:
2 3 7 8
We compute the median as the average of the two middle values, 3 and 7, which is equal to 5. The example we used in computing the median actually contains too few observations. Because the median is a position average, we should not use it with data sets that contain only a few observations. With six or fewer items in the data set, the median is probably a poor measure, but with twenty or more observations in the data set, it is likely to be very reliable. At least half of the observations in the data set are smaller in value than the median, and half are larger. For extremely peaked distributions, the median is the most reliable estimate of the population mean.
For grouped data, the median is determined by an interpolation formula, as in equation 2.7.
Md = L + n/2 - F/f I [2.7]
where: L is the lower limit of the class containing the median,
n is the total number of frequencies,
F is the cumulative frequencies in the class preceding the class containing the median,
f is the number of frequencies in the class containing the median, and
I is the width of the class containing the median.
To determine the class containing the median, we first cumulate the frequencies on a less-than basis. Since we know the median is the value of the middle observation in the data, we compute n/2 to determine the observation at the middle. We next inspect the cumulative frequency column and determine which class contains the middle observation. That class is the median class. Its lower limit is the value of L; its number of frequencies is the value of f; and its width is the value of I. To determine F, we go up one row and note the value of the cumulative frequencies for the class that precedes the median class. To determine the median, we plug all of the values into the formula and evaluate it.
For example, for the wage data for farm workers, the value of n, the total number of frequencies, is 30 and n/2 is 30/2, or 15 (table 2.10). To determine the class containing the median, we compare the value just computed, 15, to the cumulative frequency, F, column in the table. It falls between the values 10 and 23 for F. The value 10 signifies that the class 110 up to 120 contains seven values beginning with the fourth and ending with the tenth. Similarly, the value 23 means that the class 120 up to 130 contains thirteen values beginning with the eleventh and ending with the twenty-third observation. Thus, the fifteenth observation falls in the class 120 up to 130, which is the class containing the median. So we write L = 120, f = 13, I = 10, and F = 10, which are the cumulative frequencies for the class 110 up to 120, since that class precedes the median class. Plugging these values into the equation gives:
Md = 120 + (15 - 10/13) (10) = 120 + 3.8 = 123.8
The mode is another average. It identifies the most common observation in the data set. For ungrouped data, we determine the mode by inspection. We simply examine the data and select the value of the observation that occurs most often. Ungrouped data may not have a mode if all values appear only once in the set; or we may observe several modes if there are many values that appear the same number of times in the data. For ungrouped data, the mode tends to be unreliable because its value changes significantly in repeated sampling. We use the mode when we want to know what is in vogue, or what is most common, such as the most popular television show, i.e., the one watched by the most viewers, or the most common variety of corn planted, etc.
For grouped data or distributions in general, the mode is the value at the highest point of the distribution. The crude mode is thus the midpoint of the class with the largest number of frequencies. For the data in table 2.10, the crude mode is 125 since the class 120 up to 130 contains the most frequencies, 13. We can use an interpolation formula for the mode (equation 2.8).
Mode = L + [d.sub.1]/[d.sub.1] + [d.sub.2] I [2.8]
where: L is the lower limit of the modal class,
[d.sub.1] is the first difference: the frequency in the modal class minus the frequency in the preceding class,
[d.sub.2] is the second difference: the frequency in the modal class minus the frequency in the following class, and
I is the width of the modal class.
We can identify the modal class by inspection since it has the largest frequency. For the example in table 2.10, the modal class is 120 up to 130 because 13 is the largest frequency. Thus, L is 120; I is 10; [d.sub.1] is 13 - 7 = 6; and [d.sub.2] is 13 - 5 = 8. Plugging these values into equation 2.8 gives
Mode = 120 + (6/6+8) (10) = 120 + 4.3 = 124.3.
Thus, we get almost the same value as for the crude mode of 125.
When a distribution is bimodal, we often prefer to divide the data into two groups and analyze each group separately because some variable not accounted for in the analysis may be causing the data to cluster into the two groups.
Characteristics of the Mean, Median, and Mode
We may use the three averages together to determine the relative symmetry or skewness of a distribution. If the distribution is perfectly symmetric in shape, then the values of the three averages are identical. If the distribution has a tail on the right, or is skewed positively, then the arithmetic mean will be the largest, the mode the smallest, and the median about two-thirds of the way in between toward the mean. The mean is largest because it is affected by the few extremely large values in such a distribution. The median is sensitive to the position of the values, but not to their size. When a distribution has a tail on the left, or is skewed negatively, the mode is the largest, the arithmetic mean the smallest, with the median between the two about two-thirds the distance toward the mean. Thus, with any two of the three values, we have enough information to know whether a distribution is skewed negatively, is symmetric, or is skewed positively.
Occasionally we get a distribution with no mode, such as extremely platykurtic distributions as well as rectangular and uniform distributions.
The arithmetic mean is the only average of the three that we can use in algebraic calculations; hence for this reason alone, it is generally the most useful. For example, we can compute a grand mean from a set of sample means by averaging the means and using the number of observations in each sample as weights. However, the mean is difficult, if not impossible, to calculate when the frequency distribution contains open-ended classes. These do not affect the other two averages.
Measures of Dispersion
We define dispersion as the representativeness or validity of an average. It explains the amount by which data in a distribution, whether an array or a frequency distribution, are dispersed away from the average or clustered around it. The smaller the dispersion relative to the average it accompanies, the more representative the average, and thus the more comfortable we are in using the average as the one number to represent the entire distribution. For this reason, when we present averages for describing data, we should see that they are accompanied by some relevant measure of dispersion so the audience will know how well the average represents the data. For example, in a set of data on corn yields, if every yield is 100 bushels, the average is also 100 whether it is the mean, median, or the mode. The dispersion in this data is zero since the average precisely represents every item in the data set. But as soon as we observe only one yield value different from 100, we will see some dispersion, or scatter, in the distribution. The more different the individual corn yields become, the greater the scatter in the distribution and the less representative the average is as the one number to represent the set.
The range, R, is a measure of dispersion defined as the difference between the beginning and the end of a distribution when the data are arrayed, as in equation 2.9. We can use it with the arithmetic mean, median, or the midrange. We do not use it with the mode. Indeed, the mode has no specific measure of dispersion associated with it, and partly for this reason, it is the least used of all averages. The particular advantage of the mode is for nonmathematical representations.
The range (R) = [X.sub.n] - [X.sub.1]. 2.9
The range indicates not only how high and how low the values go, but the range of the data itself. We can use the range as a measure of dispersion for small samples (n [less than or equal to] 12) selected from a normal distribution. However, for large samples (n [greater than or equal to] 30), the range provides a biased estimate of the variability of the population. The range is not our first choice as the measure of dispersion, because it is based only on the two extreme values of the data set--no others--and it tends to be inconsistent in repeated sampling from the same population.
The Quartile Deviation
The quartile deviation (QD) is a measure of dispersion used only with the median and indicates the dispersion of the data in the middle half of the distribution. We compute it as in equation 2.10.
QD = [Q.sub.3] - [Q.sub.1]/2 [2.10]
It is one-half the difference between the first and third quartiles. We remember that quartiles divide data into four equal parts, and to do this requires only three dividing lines: the first quartile, the median or second quartile, and the third quartile. Alternatively, we can view the first quartile as the median of the first half of the data, and the third quartile as the median of the last half. And for data in an array, that is how we compute the quartiles. First we determine the median for the data. We next examine the upper half of the data, from [X.sub.1] to Md, and determine the median of that set. We call it [Q.sub.1] since it is the first quartile. We employ the same procedure for the last half of the data to obtain [Q.sub.3]. Finally, we compute the quartile deviation and use it with the median. For example, if we have the following years of formal education for a small sample of rural adults:
8 12 6 14 10
we first array the data to obtain:
6 8 10 12 14
The median of the data is 10; [Q.sub.1] is 7; [Q.sub.3] is 13; and thus
QD = (13 - 7)/2 = 3.
The quartile deviation is somewhat similar to the range because it includes measurement of the difference between two values, but the values are in the middle half of the distribution. Because it omits the ends of the data, the quartile deviation is a poor measure when there is wide dispersion in the tails of a distribution. It is quite useful, however, when we must compute an average for a frequency distribution with open-ended classes and we need a measure of dispersion to go with it. With frequency distributions, we use interpolation formulas for the first and third quartiles quite similar to that for the median, as in equations 2.11 and 2.12. We define all the terms in formulas 2.11 and 2.12
[Q.sub.1] = L + n/4 - F/f I [2.11]
[Q.sub.3] = L + 3n/4 - F/f I [2.12]
the same as in the formula for the median except that we define the class containing the first quartile by comparing n/4 with the cumulative frequencies, F, and determine the class containing the third quartile by comparing 3n/4 with the cumulative frequencies, F. Once we identify the appropriate class, L, f, and I refer to that class and F refers to the class that precedes it.
The Standard Deviation
We use the standard deviation as a measure of dispersion with the arithmetic mean. Its value is based on all of the observations in the data set. It is sometimes called the root-mean-square deviation because it is computed by taking the square root of the arithmetic mean of the squares of the deviations from the mean.
We compute the standard deviation for a population, denoted [sigma] (sigma), for which we have ungrouped data, as in equation 2.13. The population variance [[sigma].sup.2] is the square of the
[sigma] = [square root of [summation][(X - [mu]).sup.2]/N] [2.13]
standard deviation. We must use the variance in mathematical calculations, and then at the end, compute the standard deviation if that is what we want. The standard deviation is the most widely used of all measures of dispersion, in part because the arithmetic mean is widely used as compared to other averages, and in part because the variance is mathematically sound and can be used in further calculations.
The sample variance, [S.sup.2], is an estimate of the population variance, [[sigma].sup.2], and as such is computed from sample data. We know that in repeated sampling, the sample variance is biased and underestimates the population variance by the fixed amount (n - 1)/n. Therefore, most statisticians revise the sample variance formula by dividing by (n - 1) rather than n, which removes the bias. Thus, we have equation 2.14 for the sample standard deviation. We note that this revision is
S = [square root of [summation][(X - [bar.X]).sup.2]/n - 1] [2.14]
significant when n is small, but not when n is large, say 100 or more.
To illustrate calculation of the sample standard deviation (table 2.11), we consider the following example: Suppose that in a large agribusiness firm, management selects the sick leave records of six employees at mid-year and records days absent from work as follows. Compute the arithmetic mean and sample standard deviation for these data. To calculate S, since the formula contains [bar.X], we sum the first column of the table to get the data total, 60, and divide by the number of observations in the sample, 6, to get 10 for [bar.X]. We next compute the deviations from the mean by subtracting [bar.X] from every value of X, one observation at a time, and recording them in the second column of the table. Finally, we square each value in column two of the table, recording the result in column three; then we total column three. [S.sup.2] is the value of 80/(n - 1) or 80/5 = 16. So S is 4, the square root of the variance. The calculations in this example are relatively simple since the mean and the data values are all whole numbers. Thus, we get whole number values for the deviations from the mean in column two. When such values are not whole numbers, the calculations become somewhat tedious. However, we can use an alternative formula for S (equation 2.15) that does not contain deviations from the mean. To compute S by the alternate
S = [square root of [summation][X.sup.2] - [([summation]X).sup.2]/n/n - 1] [2.15]
formula, we need two columns for X and [X.sup.2], as follows:
X [X.sup.2] 7 49 14 196 8 64 5 25 15 225 11 121 Totals 60 680
If we substitute 680 for [summation][X.sup.2], 60 for [summation]X, and 6 for n in equation 2.15 and solve, we get S = 4, as we did before.
For data grouped into frequency distributions, we use the frequencies as weights. Thus, we modify all of the formulas to incorporate the weights. The formulas for the standard deviation of sample data grouped into a frequency distribution appear in equations 2.16 and 2.17, with the definitional formula in equation 2.16 and the computational formula in equation 2.17. For an example for computing S, we once again refer to the data for wages of hired farm workers (table 2.12). Notice that the table, in addition to containing the original frequency distribution, has columns for X, fX, [X.sup.2], and f[X.sup.2], which we need to obtain values for the terms in our formula. We must
S = [square root of [summation] f [(X - [bar.X]).sup.2] / [summation] f - 1] [2.16]
S = [square root of [summation]f[X.sup.2] - [([summation]fX).sup.2]/[summation]f - 1] [2.17]
multiply f by X to get fX and then total those values to obtain [summation]fX; we also square X and multiply f times X squared to get f[X.sup.2] and then total those values to obtain [summation]f[X.sup.2]. The only other value we need for the formula is [summation]f, which we obtain by totaling the frequency column. Thus
S = [[(461950 - [3710.sup.2]/30)/(30 - 1)].sup.0.5] = [[108.5].sup.0.5] = 10.4.
Two Properties of S and [bar.X]
We find two properties of the mean and standard deviation that are noteworthy, especially when we must make transformations to the data. For example, if we must add a constant to every element in the data set, what happens to the value of S? According to the first property, the value of the mean is increased by the value of the constant and S is unchanged by the addition of a constant to each value. (1) A second property is that multiplying each value of X by a constant multiplies the mean and S by the absolute value of the constants (2) and the variance by the square of the constant. We find these properties useful because we do not have to make the actual calculations for the mean or variance of the transformed data. We just apply the property to the mean or variance already calculated from the original data to obtain the value we want for the transformed data.
We can also standardize data by transforming the mean to a value of zero and the standard deviation to a value of 1. To obtain a mean of zero, we subtract the mean itself from every element in the data set one value at a time. The transformed data then have a mean of zero, but the standard deviation is not affected. To change the standard deviation to 1, we divide each element in the data by S (multiply by 1/S). This calculation also divides the mean by S, but zero divided by any number except zero returns a value of zero. We generally denote the standardized data for a population by the variable Z, where [Z.sub.i] = ([X.sub.i] - [mu]) / [sigma].
Coefficient of Variation
When we want a measure of the variability associated with a set of data, we use the standard deviation along with the mean. The standard deviation is in the same units as the mean and, hence, we prefer it to the variance, which is in squared units. But we often compute the coefficient of variation to determine the relative variability in a set of data. It is the ratio of the standard deviation to the mean, multiplied by 100 percent, i.e., V = (S/[bar.X])(100). Thus, the coefficient of variation states how large the standard deviation is in comparison to the mean in percentage terms. A V of 100 percent indicates that S and the mean are equal. In that case, the data are highly variable and the mean is not a useful measure of the center of the distribution. The smaller the value of V, the better the mean represents the data set. As a rule of thumb, we use caution when we obtain values of V above 50 percent in deciding to represent the data set by the mean.
We cannot compare the standard deviations of two distributions because the S values belong to the means of those distributions, i.e., the standard deviation has no interpretation apart from the mean (except perhaps in the unusual case in which the means are equal). Thus, we compare the V values of the distributions instead. The distribution with the lowest V has the least variability.
1. Given the following sample of data, calculate: -10
a. Arithmetic mean 2 b. Median 3 c. Mode 2 d. Midrange -4 e. Range 2 f. Standard deviation 5
2. For the data in exercise 1, calculate the first and third quartiles and the quartile deviation.
3. Given the following set of numbers: 8, 5, 2, 6, 4, 5,
a. Find the mean, median, and mode of this population.
b. Find the variance and standard deviation.
c. Calculate the range and midrange.
4. The following beginning salaries were offered to sixteen recent agriculture graduates:
$26,500 $20,400 $24,600 $23,600 $19,900 $21,400 $22,600 $28,400 $31,200 $21,800 $24,800 $23,400 $31,400 $25,500 $27,000 $29,100
a. Compute the arithmetic mean and median for these data.
b. Use a class interval of size $2,500 to form a frequency distribution for the data. Construct a histogram for this distribution and draw the frequency polygon connecting the class midpoints (begin with the class $19,000-$21,500).
c. Form a less-than relative cumulative frequency distribution and plot the ogive. What percentage of the graduates earn less than $24,000? More than $29,000?
d. Use the frequency distribution you obtained in part b to compute the arithmetic mean and standard deviation. What percentage of the salaries fall within one standard deviation of the mean? Two standard deviations?
e. From the frequency distribution for part b., calculate the median and quartile deviation and the mode.
e. Draw a histogram of the frequency distribution and comment on its symmetry.
5. Given the following frequency distribution of the amount of calories in milk with 4 percent butterfat for a sample of 100 cows, compute the:
a. Mean b. Standard deviation c. Median d. Mode
Calories Frequency 90 < 100 12 100 < 110 55 110 < 120 25 120 < 130 8 Total 100
e. Draw a histogram of the frequency distribution and comment on its symmetry.
6. How are the mean, median, and mode related in a symmetrical frequency distribution? How are they related in a frequency distribution that is skewed right? Skewed left? Sketch several distributions to clarify your answer.
Use Excel or if comparable computer program to complete the following exercise.
7. Given the following 205-day weaning weights of twenty-four black brangus calves, compute:
482 505 467 521 550 534 485 542 470 511 490 517 545 476 487 530 496 504 558 536 463 470 553 512
a. The mean and standard deviation of the raw data using the Tools, Data Analysis, Descriptive Statistics module.
b. A frequency distribution with intervals of 20 pounds beginning with 460 and ending with 560 using the Tools, Data Analysis, Histogram module. Note: In an adjacent column, construct the bin values 460, 480, 500, 520, 540, 560 prior to clicking Tools on the Menu Bar. Place this column range in the Bin Field on the Histogram definition table that appears. Also check the graph box so a graph will be generated.
(1) If [S.sup.2] = [summation][(X - [bar.X]).sup.2]/(n - 1) and we add a constant k to X so that we have [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], which is the original formula for the variance.
(2) If we have kX, then [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] and, hence, kS for the standard deviation.
TABLE 2.1 Cotton Lint Yields in Pounds Per Acre from Seventy-Five Farms in the Blackland Prairie 255 373 242 257 305 285 358 279 261 312 288 297 303 290 283 215 299 275 260 284 311 217 295 260 316 314 290 294 280 326 249 256 276 292 286 333 274 250 295 318 292 283 274 336 325 290 309 272 296 346 254 235 268 299 352 334 251 367 309 315 228 259 354 283 306 234 258 365 281 312 342 268 278 291 288 TABLE 2.2 Array of Cotton Lint Yields in Pounds Per Acre from Seventy-Five Farms in the Blackland Prairie 215 217 228 234 235 242 249 250 251 254 255 256 257 258 259 260 260 261 268 268 272 274 274 275 276 278 279 280 281 283 283 283 284 285 286 288 288 290 290 290 291 292 292 294 295 295 296 297 299 299 303 305 306 309 309 311 312 312 314 315 316 318 325 326 333 334 336 342 346 352 354 358 365 367 373 TABLE 2.3 Frequency Distribution of Cotton Yields in Pounds of Lint Per Acre for Seventy-Five Farms in the Blackland Prairie Cotton Yield Number of Farms 215 up to 235 4 235 up to 255 6 255 up to 275 13 275 up to 295 21 295 up to 315 15 315 up to 335 7 335 up to 355 5 355 up to 375 4 Total 75 TABLE 2.4 Frequency Distribution with Class Midpoints for Cotton Yield Data, Blackland Prairie Farms Cotton Yield Class Number Midpoints of Farms Imaginary class 205 0 215 up to 235 225 4 235 up to 255 245 6 255 up to 275 265 13 275 up to 295 285 21 295 up to 315 305 15 315 up to 335 325 7 335 up to 355 345 5 Imaginary class 365 0 Total 75 TABLE 2.5 Frequency Distributions of Weekly Wages for Farm Workers and Truck Drivers Class Frequencies: Frequencies: Mid- Farm Truck Weekly Wages Points Workers Drivers 100 up to 110 105 3 0 110 up to 120 115 7 0 120 up to 130 125 13 20 130 up to 140 135 5 48 140 up to 150 145 2 72 150 up to 160 155 0 44 160 up to 170 165 0 16 Total 30 200 Relative Relative Frequencies: Frequencies: Weekly Wages Farm Workers Truck Drivers 100 up to 110 10 0 110 up to 120 23 0 120 up to 130 43 10 130 up to 140 16 24 140 up to 150 8 36 150 up to 160 0 22 160 up to 170 0 8 Total 100 100 TABLE 2.6 Less-Than Cumulative Frequency Distribution of Weekly Wages for Farm Workers Wages Cumulative Frequencies, F Less than 100 0 Less than 110 3 Less than 120 10 Less than 130 23 Less than 140 28 Less than 150 30 TABLE 2.7 More-Than Cumulative Frequency Distribution of Weekly Wages for Farm Workers Wages Cumulative Frequencies, F More than 100 30 More than 110 27 More than 120 20 More than 130 7 More than 140 2 More than 150 0 TABLE 2.8 Hourly Wages and Number of Migrant Workers by Type of Crop Crop Hourly Wage, X Number of wX Workers, w Cucumbers 4.50 950 4,275 Melons 4.75 600 2,850 Onions 5.25 1,020 5,355 Totals -- 2,570 12,480 TABLE 2.9 Calculation of the Arithmetic Mean for Cotton Yield Data in a Frequency Distribution Cotton Yield Frequency, f Class fX Midpoint, X 215 up to 235 4 225 900 235 up to 255 6 245 1,470 255 up to 275 13 265 3,445 275 up to 295 21 285 5,985 295 up to 315 15 305 4,575 315 up to 335 7 325 2,275 335 up to 355 5 345 1,725 355 up to 375 4 365 1,460 Total 75 -- 21,835 TABLE 2.10 Frequency Distribution of Weekly Wages of Farm Workers Used to Calculate the Median Weekly Wages Frequencies, f Cumulative Frequencies, F 100 up to 110 3 3 110 up to 120 7 10 120 up to 130 13 23 130 up to 140 5 28 140 up to 150 2 30 Total 30 TABLE 2.11 Example Data for Calculating the Sample Standard Deviation Days Absent, X (X - [bar.X]) [(X - [bar.X]).sup.2] 7 -3 9 14 4 16 8 -2 4 5 -5 25 15 5 25 11 1 1 60 0 80 TABLE 2.12 Calculation of the Standard Deviation for a Frequency Distribution of Farm Wages Weekly Frequencies, Class Wages f Midpoints, X [chi square] 100 up to 110 3 105 11,025 110 up to 120 7 115 13,225 120 up to 130 13 125 15,625 130 up to 140 5 135 18,225 140 up to 150 2 145 21,025 Total 30 -- -- Weekly Wages fX f [chi square] 100 up to 110 315 33,075 110 up to 120 805 92,575 120 up to 130 1,625 203,125 130 up to 140 675 91,125 140 up to 150 290 42,050 Total 3,710 461,950
|Printer friendly Cite/link Email Feedback|
|Publication:||Introduction to Agricultural Statistics|
|Date:||Jan 1, 2000|
|Previous Article:||Chapter 1 Introduction.|
|Next Article:||Chapter 3 Probability.|