Research design, hypothesis testing, and sampling.
Statistical applications are an essential element of the practicing appraiser's tool kit. The extent to which statistics are employed in appraisal practice depends on a number of circumstances, such as the scope of work, the statistical competence of the analyst, and the availability of data. This article discusses essential theory necessary for competence in statistical inference--that is reaching a conclusion about an unknown characteristic of a population based on sample data. This article explains how to design and implement inferential statistical research and examines the topics of hypothesis construction, reliability and validity of research, sampling, setting significance levels, and sample size.
Research Design and Hypothesis Testing What Is the Question?
At its most elementary level, the application of inferential statistics boils down to answering questions. For example, we might ask, "Does the theory of diminishing marginal utility hold for this property type in this market area?" Or, "How does this particular rental market react to proximity to public transportation?" Or, "What, if any, is the influence of nearby street noise on home prices in the subject subdivision?" Ultimately, the question may be as fundamental as, "What is my opinion of market value for this property, and can the credibility of my opinion be bolstered by the application of inferential methods?"
Research questions like these may represent the entire scope of a valuation services assignment, or they may be small but important aspects of a more comprehensive study. In either case, the effective application of inferential methods requires a clear understanding of the relevant questions, explicit or implicit formulation of testable hypotheses, appropriate data, credible analysis, and valid interpretation of the analytical results.
In much the same way that the appraisal process serves as a systematic and organized way to design a work plan consistent with the scope of a specific assignment, statistical analysis research design provides a road map for moving from research question to insight.
From Research Question to Testable Hypotheses
Hypothesis testing relies on a principle often referred to as "Popperian falsification" Early twentieth-century philosopher Karl Popper held that inferential statistics cannot prove anything with absolute certainty. However, inferential methods can cast doubt on the veracity of an assertion of "truth." When sufficient doubt can be raised, an assertion of truth can be "falsified," at least to some degree. The degree of certainty associated with labeling a statement as false is related to "statistical significance" or merely "significance" in the language of statistics.
The process of forming and testing a hypothesis (i.e., a theory) is as follows:
1. Determine an appropriate expected outcome based on theory and experience. This is generally referred to in inferential statistics as a "research hypothesis" 2. Formulate a pair of testable hypotheses related to the research hypothesis: a "null hypothesis" and an "alternative (research) hypothesis." The testable hypotheses must be mutually exclusive and collectively exhaustive. The hypothesis testing goal is to falsify or reject the statement of truth implied by the null hypothesis, leaving the research hypothesis as the only reasonable alternative.
3. Formulate a conclusion that falsifies (or fails to falsify) the null hypothesis.
Hypothesis Testing in the Real World
While the three-step process of forming and testing a hypothesis is easy to outline, it is generally much more difficult to apply. Let's consider a simple example and think about the complications that may arise.
Consider the effect of street noise on housing prices. This simple residential valuation issue illustrates the complications encountered in formulating and testing real world hypotheses. An appropriate research hypothesis could be a general statement like, "Exposure to street noise affects home price" Or, if the analyst makes a more specific supposition concerning the direction of the effect, the research hypothesis could be, "Exposure to street noise reduces home price" Depending on the scope of the assignment, the appropriate research question could be much more specific than this. More refinement might be required, resulting in research hypotheses such as "Exposure to street noise in excess of 'X decibels' above the ambient noise level reduces home price" or "Exposure to street noise reduces home price, but the size of the reduction decreases with distance from an abutting street and becomes negligible at 'X feet' from the abutting street" Based on these simple examples, it should be apparent that the testable hypotheses must be customized to the underlying research question.
For simplicity's sake, assume that the appropriate research hypothesis is "Exposure to street noise reduces home price in the subject property's market area" This statement becomes the alternative hypothesis--the hypothesis you believe the data will support. Remember that the null hypothesis and the alternative hypothesis are mutually exclusive and collectively exhaustive. The null hypothesis (the statement of truth you are attempting to falsify) would therefore be "Exposure to street noise either increases home price or has no effect on home price in the subject property's market area" These two statements are mutually exclusive (only one of them can be true), and they are collectively exhaustive (home price must go up, down, or stay the same).
In summary, the relevant hypotheses for this example are
* Research hypothesis: Exposure to street noise reduces home price in the subject property's market area.
* Testable hypotheses:
* Null hypothesis ([H.sub.0]): The street noise price effect is [greater than or equal to] 0.
* Alternative hypothesis ([H.sub.a]): The street noise price effect is < 0.
Once these hypotheses have been formulated, a research plan must be devised that allows the analyst to credibly test the veracity of the null hypothesis.
If the null hypothesis can be falsified with sufficient certainty, then the analyst can conclude that the alternative hypothesis is likely to be true.
It is important to recognize, however, that inferential statistical methods are not intended as means of supporting illogical, unreasonable, or atheoretical suppositions. The research and alternative hypothesis statements should be well reasoned and logical, keeping in mind that inferential methods are designed to support valid research hypotheses.
Validity and Reliability
Tests of the veracity of the null hypothesis are of no use unless the tests are credible (i.e., worthy of belief). Two concepts-validity and reliability (1)--are paramount to credible hypothesis testing. These concepts are deeply rooted in research design and scientific inquiry.
[FIGURE 1 OMITTED]
The concepts of reliability and validity can be confusing at first, but they are actually quite simple. For example, consider Figure 1, which illustrates the idea that reliability is analogous to clustering shots on a target. Shots that are scattered all over the target, as in the left panel, are unreliable. Shots that are tightly clustered but off center, as in the right panel, are reliable but invalid. Only shots that are tightly clustered and centered on the target are reliable (consistent) and valid (accurate).
Because threats to reliability and validity erode credibility, credible research and valuation-related opinions are more likely to occur when analysts understand and assess the extent to which the methods employed were both reliable and valid. Paying attention to a few simple criteria can go a long way toward ensuring credible results. For example, logical research designs, controlling measurement error, standardizing interview protocols, using representative data, and applying appropriate analytical tools are basic and essential elements of using statistical methods to support credible valuation opinions.
Strictly speaking, validity is the extent to which a statistical measure reflects the real meaning of what is being measured. Consider a scale that consistently indicates weights that are 95% of true weight. Obviously the result is not valid when the intent is to measure true weight. Although this sort of measurement error is correctable if the error is consistent and known, measurement error is usually neither consistent nor known in many situations.
Lack of research validity stems from many sources, and assessing the validity of research involves numerous considerations such as
* Logical validity
* Construct validity
* Internal validity
* External validity
* Statistical conclusion validity
A research design consists of several parts, such as a problem statement, a research hypothesis, selection and definition of variables, implementation of design and procedures, findings, and conclusions. Logical validity is satisfied when each part of the overall design flows logically from the prior step. If the overall design isn't logical, then the results aren't likely to be valid. Appraisers should already be familiar with the elements and logical flow of research design because the valuation process is a similar algorithm.
Construct validity deals with how well actual attributes, characteristics, and features are being measured. For example, although tall people generally weigh more than short people, use of a weight scale is not a valid construct for measuring height. As this example illustrates, construct validity is a simple concept, but it can be quite nuanced in practice.
Concerns about construct validity are particularly applicable to the use of interviews and questionnaires. Precise definitions of variables and the elimination of ambiguity are important in ensuring that questions are not misinterpreted by respondents or researchers. Meanings assigned by respondents should be consistent with the meanings intended by the researcher. Furthermore, meaning should be consistent from respondent to respondent.
Construct validity is especially problematic when respondents must interpret technical or scientific language, as is the case in many interviews related to real property transactions (e.g., sales confirmation). Do not assume that persons being interviewed fully understand the meanings of technical terms such as capitalization rate, internal rate of return, net operating income, effective gross income, obsolescence, and the like.
Internal validity requires that all alternative explanations for causality have been ruled out Ruling out threats to internal validity is a laborious task because it requires explicit identification of each alternative explanation for causation along with the rationale for rejecting it. If all reasonable alternative causes cannot be ruled out, the research may be inconclusive and invalid.
External validity exists when findings and conclusions can be generalized from a representative sample to a larger or different population. Random selection from a target population is the best means of obtaining a representative sample, subject to the vagaries of sampling error, which is ubiquitous. Therefore, when a random sample has been obtained, the analyst should assess the extent to which the characteristics of the sample match the characteristics of the target population.
Statistical Conclusion Validity
Statistical conclusions will not be valid if the statistical tests being applied are inappropriate for the data being analyzed. The researcher should be aware of the assumptions underlying each statistical test and how robust the test is if those assumptions are violated.
Bias occurs when there is a systematic error in research findings. Bias can come from several sources, and it can be classified into two categories: nonsampling error and sampling error. (2) Nonsampling error includes nonresponse bias, sample selection bias, and systematic measurement error. Sample selection bias may be encountered in real property studies, which often rely on observational samples (e.g., the occurrence of a comparable sale cannot be assumed to have been a random event). Sampling error stems from the fact that a random sample can differ from the underlying population simply by chance.
Reliability is the extent to which "the same data would have been collected each time in repeated observations [measurements] of the same phenomenon." (3) A reliable model would produce results that can be thought of as consistent, dependable, and predictable.
As an example, assume that six appraisers are asked to measure the same 1,400-sq.-ft. house and calculate its improved living area. A set of estimates consisting of 1,360 square feel 1,420 square feet, 1,400 square feel 1,450 square feel 1,350 square feel and 1,340 square feet would not be reliable, even though they tend to bracket the true floor area. However, in comparison, a set of estimates consisting of 1,440 square feel 1,435 square feel 1,445 square feel 1,445 square feel 1,440 square feel and 1,435 square feet would be more reliable, despite not bracketing the true floor area. Although the second set of living area estimates is more reliable (i.e., predictable and consistent), the floor area calculations exhibit a systematic, upward bias, making this set of estimates invalid. An ideal set of estimates would be highly consistent (reliable) and accurate (valid), such as 1,398 square feel 1,402 square feel 1,400 square feel 1,395 square feel 1,405 square feel and 1,400 square feet.
Reliability can be difficult to attain, especially when data come from sources beyond the analyst's control. For example, subjective assessments of condition, construction quality, and curb appeal provided by third parties may be unreliable, especially if more than one person is rendering opinions. What appears to be "excellent" to one person may be viewed as being "above average" or merely "average" to another.
Because reliability can be difficult to assess and control, it is good practice to think about possible threats to reliability that may be encountered. When data comes from an outside source, ask if a standardized measurement or categorization protocol was employed. Find out if more than one person was involved in making quality or condition assessments. Think about how errors in scoring or measurement may occur, and make random checks for measurement error. Look for the use of ambiguous questions, ambiguous instructions, or idiosyncratic (technical) language that might be difficult for respondents to comprehend.
A sample is a subset of a larger population selected for study. When the research goal is to better understand the larger population, the sample should be as similar to the larger population as possible.
Statisticians use the term "representative" to indicate the similarity of a sample to the larger population. When a sample is not representative, it is difficult to assert that the characteristics of the sample are indicative of the characteristics of the larger target population.
While sampling is a simple concept, it can be a challenging process in application. The first challenge is obtaining a sample frame, which is a list of items in, or members of, the population you want to study. Sometimes full or partial lists exist. Often they do not exist at all, or the compilers of the lists are unwilling to allow access to them. For example, if you were interested in knowing what percentage of lake homes in your state are serviced by central sewer systems and how many have on-site septic systems, you could develop a representative sample of lakeshore properties to obtain an estimate of the population proportions. However, obtaining a comprehensive list of lakeshore properties (the sample frame) in order to draw the sample could be difficult. Compiling an owner's list yourself from county records is one option, but it would be time-consuming.
Samples can be broadly divided into two categories-probability samples and nonprobability samples. Probability samples are characterized by knowledge of the probability that an item in the population will be selected. As you would expect, the probability that an item will be selected is unknown in a nonprobability sample. Statistical inferences formed through the analysis of probability samples are preferred because inferences drawn from nonprobability samples may be unreliable and inaccurate.
Numerous probability sampling methods exist, and the most common include
* Simple random samples
* Stratified random samples
* Systematic random samples
* Cluster samples
Simple Random Samples
In a simple random sample, every item in a population has the same probability of selection. A simple random sample may be selected either with replacement or without replacement. When sampling with replacement, the probability of selection for each member of the population is 1/N each time a selection is made, where N represents total population size.
When sampling without replacement the probability of selection increases as items are selected. The probability of selection for the first item selected is l/N, reducing to 1/(N - 1) for the second item selected, 1/(N - 2) for the third item selected, and so forth as the unsampled population size is reduced through the sample selection process.
Think of sampling with replacement as picking a card from a full deck, replacing the card, shuffling the deck, and picking another card. In contrast, think of sampling without replacement as being dealt a hand of poker, with each new card in your hand being dealt from a smaller and smaller deck.
Because each item in a population has an equal probability of selection on a given draw, simple random samples are considered to be highly representative. Nevertheless, it is still possible to randomly select a nonrepresentative sample merely by chance. This possibility is referred to as sampling error. Although sampling error cannot be totally eliminated, it can be minimized through the selection of larger samples.
Stratified Random Sample
Creating a stratified random sample begins by dividing the population into subgroups (known as strata) based on one or more essential characteristics. Once this has been done, you can select random samples from each stratum. Stratified samples ensure that the sample proportion for the stratifying characteristic is identical to the population proportion, reducing sampling error and improving the accuracy of inferences.
As a simple example of the value of stratified random sampling, assume that you want to use sampling to make a statement about the mean apartment rent in a market area. Assume also that the apartment population contains many floor plans with different bedroom and bath counts. If a simple random sample were used, you would have no assurance (due to sampling error) that the floor plan mix of the sample would be identical to the floor plan mix of the population. Although the mix would, on average, be the same with repeated random samples, the mix is apt to differ from the population in any single sample. Use of a stratified sample allows you to control the proportion of the sample being drawn from each apartment unit type, thereby controlling for unit-mix sampling error. When this is done, sample mean rent is a more accurate estimate of population mean rent.
When the parameter of an important population proportion is known, a stratified random sample mirroring the population proportion usually provides the most accurate inferences. You might be wondering, "If stratified random sampling improves the accuracy of inferences, why isn't it done more often?" The primary reason is insufficient understanding of the population proportion for one or more important characteristic. For instance, the simple stratification in the preceding paragraph could not be done if the population unit mix proportions were unknown.
Systematic Random Sample
A systematic sample is just what its name implies-a system employed to select the sample from the frame. Systematic sampling typically involves sorted data such as accounting records filed by date or medical records filed alphabetically. For example, if you want to sample 1,000 files out of a population of 30,000 files, you could decide to select every 30th file. You could then randomly select a file from the first 30 files as a starting point and then select every 30th file after the starting point. If you randomly chose to start with file 14, your sample would consist of files 14, 44, 74, 104, and so forth.
While systematic sampling may seem convenient, it can pose problems when there is a systematic pattern associated with how the data were sorted. If this is the case, the sample could be biased. Say, for example, you are auditing your company and randomly choose to look at accounting records from the 4th and 23rd day of each month. You would not be happy to learn, after the fact, that a part-time employee who helped out on the 14th and 15th of each month had been embezzling money. Because you randomly chose the wrong days to audit, the theft would have gone undiscovered. Had a random sample been drawn from each month of the year, there would have been an 800% probability of picking the 14th or 15th day of at least one month. (4) The pattern in the data, along with the systematic sample's unfortunate starting point, biased the sample by inadvertently excluding all of the dates when criminal activity occurred.
Use of a systematic sample requires an assessment of the likelihood of the existence of a pattern in the data in the sample frame that could bias the sample. When in doubt, use a different sampling method.
Cluster sampling is often used for geographic data such as real estate where clusters are naturally occurring. City blocks, subdivisions, census tracts, and zip codes are examples of naturally occurring geographic clusters. Random selection of clusters and of items within each selected cluster constitutes a random sample.
Consider the apartment sample referred to in the earlier discussion of stratified random samples. If there were no available sample frame, you could draw a sample by randomly selecting geographic clusters (e.g., census tracts) within the study area, identifying all of the apartments within each selected cluster, and randomly selecting a sample from the identified apartments in each cluster. The resulting sample would be representative of the population if the selected clusters were representative of the market and the properties chosen from each cluster were representative of their cluster.
The problem with this method is one that appraisers are familiar with from other contexts--namely, compounding error. Achieving the state of "representativeness" becomes a multilayered construct in the use of cluster sampling. If the coarser selection layer--the clusters--is not representative, then the sample will not be representative regardless of how well the selected properties represent their clusters. If the coarser, cluster layer is representative, the more focused granular selection layer--properties within each cluster--may still not be fully representative if some or all of the selected properties do not represent their cluster. Due to these issues, sample size in terms of number of clusters and items selected from each cluster should be greater than the sample size required for a simple random sample or stratified sample.
When a sample frame is unavailable, cluster sampling may be the only alternative. Care should be taken however to ensure that the clusters are as representative of the population as possible. With geographic data this often entails selecting clusters that incorporate all of a market area's important geographic variables. Depending on the situation, important geographic variables might include
* School districts
* Age of neighborhoods
* Relative household incomes
* Length of commutes
Self-Selection and the Appraiser's Quandary
Recall from the earlier discussion of validity that sample selection bias may be encountered in real property studies, which often rely on observational samples because a self-selection process separates properties that are offered for sale from those that are not offered for sale. Property owners are not randomly chosen to sell their homes each month. Therefore, a sample of homes "for sale" or "sold" may not be representative of the population of all similar properties in a market.
Self-selection may or may not be a problem, depending on how and if sold properties differ from unsold or not-for-sale properties. Generally speaking, in broader and more active markets self-selection is less likely to be a relevant issue. For example, the housing market is more active than the shopping center market and is less likely to exhibit systematic differences between properties offered for sale and properties not for sale. Nevertheless, some residential neighborhoods could be affected by a localized externality such as an environmental hazard, plant closing, or change in access. (5) In such cases data from an affected location might not be representative of properties in unaffected locations.
The retail sector provides a good example of how self selection can affect real property transaction data. Suppose a prominent and common anchor tenant is ceasing operation or reorganizing through bankruptcy, and several of the market area's shopping centers occupied by this anchor tenant are offered for sale. If and when these properties sell, they probably would not be representative of the remaining shopping centers in the market that were not affiliated with this tenant. Statistical analysis of market transaction data biased by inclusion of these sales might misrepresent the segment of the retail population unaffected by the store closings. The same logic applies to comparable rents associated with retail centers having dark anchor stores.
Because real property offered for sale or rent is a self-selected sample rather than a random sample, appraisers should take care to ensure that the transaction data being analyzed is truly representative of the subject property's competitive market. Experienced appraisers should be able to determine the influence, if any, of self-selection in a market that may preclude some data from inclusion in a given analysis or study. Furthermore, competent appraisers know that unfamiliarity with a market and an inability to assess the existence of self-selection bias within it require the assistance of someone who understands the market in order to credibly assess transaction data.
Nonprobability samples are less useful for inference than the sorts of probability samples we have been talking about so far because the conclusions we can reach through statistical analysis of a non-probability sample are sample-specific. Information obtained from the sample data may not be applicable to the larger population because there is no guarantee that the sample data are representative of the population.
For example, Internet surveys where users of a site are asked their opinion on matters as varied as election outcomes, results of sports contests, or whether or not an economic recession is looming are nonprobability samples. The results of such a survey only tell us how the proportion of a Web site's users who responded felt about the issue. We do not know if survey respondents are representative of all of the site users. Nor do we know if the opinions of the respondents mirror the opinions of the general population. The survey results could be applicable to the general population, but no statistical measure has been provided of the relationship of such a nonprobability sample to the general population.
This is why expert, professional appraisal judgment is necessary when applying statistical analysis of comparable sale or rental data to a subject property or subject market. Because the generation of comparable data is largely a self-selection process, valuation expertise is required to assess whether or not comparable data items are representative of the subject of a study. When unrepresentative data items are identified, the items can either be removed from the analysis or flagged for later treatment (i.e., attempting to statistically control and adjust for the aspects of the transactions that cause them to be unrepresentative of the subject market).
A decision to exclude data or to apply some form of statistical control depends on the amount of available data and the reason for conducting the study. When a reduced data set excluding unrepresentative data is large enough for a valid study, the unrepresentative data may be excluded. For example, in a residential context it is usually preferable to exclude luxury home sales and entry-level home sales from a study of mid-priced home values. While it is possible to statistically control for the differences between luxury homes, entry-level homes, and mid-priced homes, figuring out which controls to employ may not be a simple task. However, if we needed to know the effects of an externality such as street noise or power line proximity on an entire residential market, then we would most likely want to understand the effect of the externality across all price categories--entry-level, mid-priced, and luxury. The study of the entire residential market might also include apartments, condominiums, and townhomes in the sample, depending on the scope of work applicable to the assignment.
Sampling error occurs when the sample differs from the population. In hypothesis testing, sampling error may result in rejection of a null hypothesis that is actually true or it may result in failure to reject a null hypothesis that is actually false. Either of these results will lead to an inappropriate conclusion. In the first case, the research hypothesis is flawed, but the data indicate that it is not. In the second case, the research hypothesis is correct, but the analysis doesn't support it.
The outcomes of hypothesis testing can be reduced to four possibilities:
[H.sub.0] is true fail to reject [H.sub.0] correct result [H.sub.0] is false reject Ho correct result [H.sub.0] is true reject Ho erroneous result [H.sub.0] is false fail to reject H0 erroneous result
Rejecting a true null hypothesis is referred to as Type I Error. With Type I error the study supports the research hypothesis even though it is based on a false supposition. The probability of rejecting a true null hypothesis is symbolized by the lowercase Greek letter alpha ([alpha]), which is called the significance level. The probability of not making a Type I error (1 - [alpha]) is referred to as the confidence level. The probability of Type I error can be controlled by selecting the significant level [alpha] prior to performing a statistical test of the null hypothesis. The researcher decides on an acceptable probability of rejecting a true null hypothesis and rejects the null hypothesis if the statistical results are at or better than the predetermined [alpha] threshold. For example, if [alpha] is set at 5% and the statistical result is an [alpha] of 3% the result is considered to be significant and the null hypothesis is rejected.
If this seems confusing, look at it from a confidence level perspective. Setting [alpha] at 5% is the same as saying, "If the data allow me to be 95% confident that my research supposition is correct, then I am going to reject the null hypothesis and accept the research hypothesis" For example, if the analysis results in [alpha] of 3% the corresponding confidence level is 97%. In this case we have exceeded the 95% confidence level threshold, supporting the validity of our research hypothesis.
One way to guard against the erroneous rejection of a null hypothesis that is actually true is to take care in the construction of the research hypothesis. Note that the null hypothesis is true only if the research hypothesis is false. Better reasoning, logic, and understanding of underlying phenomena will help guard against flawed research designs that attempt to support false suppositions.
The erroneous failure to reject a null hypothesis that is actually false is known as Type H Error. In this case the study fails to support the research hypothesis even though it is based on a true supposition. The probability of Type II error is symbolized by the lowercase Greek letter beta ([beta]). Unfortunately [beta] cannot be known with certainty unless you know the true population parameter you are attempting to infer. (If you know [beta], why would you be attempting to infer it?)
Consider the example of the effect of traffic noise on home price. If the effect of noise is substantial, then the probability of failing to support the research hypothesis is small. If the effect does exist but is not substantial, then we will have more difficulty demonstrating the effect statistically, meaning that [beta] will be relatively large. When the effect of traffic noise is small, the statistical analysis must be more "powerful," increasing the probability of demonstrating the effect. The power of a statistical test is symbolized by 1 - [beta].
Statistical power can be increased in three ways:
1. Relax [alpha]. This choice may not be very satisfactory, if the initial logic behind the initial decision on [alpha] has not changed. (6)
2. Increase the size of the sample. Small effects are much easier to uncover with more data.
3. Eliminate confounding effects. In the street noise example, the street noise effect may be masked if lots abutting a thoroughfare are generally larger than lots in the interior of the same subdivision. Controlling for lot size in the analysis should improve a model's ability to detect the effect of street noise.
Relating Choice of Significance Level [alpha] to the Standard Normal Distribution
Choice of [alpha] is a way of stating how far--in statistical distance--the sample mean must be from what the population mean would be if the null hypothesis were true in order to reject the null hypothesis. Consider a simple pair of statistical test hypotheses:
[H.sub.0]: [mu] = 0
[H.sub.a]: [mu] [not equal to] 0
If we select [alpha] = 0.05 and the population to which p applies is normally distributed, then we are saying, "If the sample mean is 1.96 standard deviations or more from 0, I will reject the null hypothesis that the population mean is 0."
Why 1.96 standard deviations? If we look up Z = 1.96 in the standard normal table, we will find a probability of Z [less than or equal to] 1.96 = 0.975. Additionally when we look up -1.96 on the standard normal table, we will find a probability of Z [less than or equal to] -1.96 = 0.025. Therefore,
P(-1.96 [less than or equal to] Z [less than or equal to] 1.96) = 0.975 - 0.025 = 0.95.
The confidence level of 95% is associated with = 0.05 (5%). To be 95% confident that the null hypothesis of [mu] = 0 is false, [bar.x] must be at least 1.96 standard deviations from 0. This concept is illustrated pictorially in Figure 2.
Figure 2 shows the standard normal curve along with the locations of 0 standard deviations (the middle of the curve) and [+ or -] 1.96 standard deviations. Recall that the area under the curve is equal to 1 (100%) and the area under a portion of the curve is equal to the probability of a sample mean ([bar.x]) being in that location when the true population mean is 0. By looking up -1.96 in the standard normal table we find that the area under the curve to the left of -1.96 is 0.025 (or a 2.5% probability of being in this location if [mu] = 0). By looking up 1.96 in the standard normal table we find that the area under the curve to the left of 1.96 is 0.975 (97.5%), leaving 0.025 to the right of 1.96 (2.5% probability of [bar.x] being in this location if [mu] = 0). Therefore, with [alpha] = 0.05, we can reject the null hypothesis that [mu] = 0 if [bar.x] [less than or equal to] -1.96 standard deviations from 0 or if [bar.x] [greater than or equal to] 1.96 standard deviations from 0.
[FIGURE 2 OMITTED]
Hypothesis testing is a skill that is developed through practice. So, let's work through an example problem. Suppose we interviewed a representative of a fast food franchise and were told that the chain's average restaurant floor area is 2,400 square feet. Other sources indicate that the average floor area for this particular fast food concept has grown over time as menus have expanded and adjusted to new consumption patterns. This fast food concept is fairly new to your state and we suspect that the average floor area here exceeds 2,400 square feet. We decide to use a random sample of floor areas to test our hypothesis and decide also that if we can be 90% confident we will conclude that the average floor area in this state exceeds 2,400 square feet.
First we state the research, null, and alternative hypotheses and the significance level required to reject the null hypothesis with 90% confidence.
Research Hypothesis: Average floor area exceeds 2,400 square feet.
[H.sub.0]: [mu] [less than or equal to] 2,400 square feet
[H.sub.a]: [mu] > 2,400 square feet
Notice that the null hypothesis in this example contains the "[less than or equal to]" symbol rather than the "=" symbol because the research hypothesis is stated as "exceeds" Remember that the null and alternate hypotheses must be mutually exclusive and collectively exhaustive, therefore [H.sub.0] must cover all of the possibilities that differ from [H.sub.a].
Next, we calculate the sample mean and assume for now that we know the population standard deviation [sigma].
[bar.x] = 2,560 square feet
[sigma] = 114 square feet
Z = [bar.x] - [mu]/[sigma] = 2,560 - 2,400/114 = 1.40
Is the sample mean of 2,560 far enough from 2,400 in statistical terms to reject the hypothesis that the mean floor area is 2,400 square feet or less? We can address this question in one of two ways:
1. Select the Z value associated with [alpha] = 0.10 and compare 1.40 to this significance level threshold.
2. Assess the probability of [bar.x] being 1.40 standard deviations or more from the hypothesized mean, and compare this result to the required [alpha] level of 10%.
Let's do it both ways.
The Z value associated with [alpha] = 0.10 is the value--call it "B"--where P(Z [less than or equal to] B) = 0.90. The standard normal table indicates that this occurs with a value of approximately 1.28 standard deviations. The value of 1.28 is called the critical value of Z because an [bar.x] result of this amount or more is required to reject the null hypothesis. (7) Because 1.40 is greater than the critical value of 1.28 you can reject the null hypothesis.
Alternately, the standard normal table shows that the probability of Z being less than or equal to 1.40 is 0.919. Therefore, the significance level [alpha] indicated by the sample is 0.081, which is less than 0.10, so we can reject the null hypothesis and state with at least 90% confidence (or, more precisely, 91.9% confidence) that the mean floor area in this state is greater than 2,400 square feet. In statistics the Z value probability of 0.081 is referred to as a p-value, which is the probability of the [bar.x] result being 1.40 standard deviations from 2,400, assuming the null hypothesis is true.
In this example we rejected the null hypothesis based on what is called a one-tailed test. The null hypothesis contains a [less than or equal to] statement, so we need only be concerned with the right tail of the Z distribution to test the validity of the null hypothesis. Similarly, if the null hypothesis contained a [greater than or equal to] statement, then we would only concern ourselves with the left tail of the Z distribution (also a one-tailed test). When the null hypothesis contains an = statement, it can be rejected at either tail of the Z distribution, which is referred to as a two-tailed test.
As a practical matter this exercise, though quite simple in statistical terms, could be useful in assessing whether or not an old floor plan is significantly smaller than new store requirements in support of an assessment of functional obsolescence. Or it could support a highest and best use analysis of a pad site, adjusting requisite floor area ratio to the current trend in store size.
Once you decide to gather sample data for a statistical study, yon are immediately confronted with the issue of how much data you need. The resolution of this issue can be simple or complex, depending on the situation. If you intend to study sample means or sample proportions, the calculation of sample size may be a straightforward result of selecting the accuracy you expect to achieve and plugging that information into a simple equation. If the study involves data collection by survey, the sample size will have to be adjusted for nonresponders and inappropriate responders. Of course you may not know how many of these you will encounter until you have completed the survey. (8)
However, be aware that if you plan to employ a regression model the sample will have to be large enough to accommodate all of the variables you may need to include in the model. Unfortunately, you may not know how many variables are needed in the model in advance, which is a confounding issue.
Sample Size for Estimating Means
Suppose you want to estimate the mean rent for one-bedroom apartments from a sample representative of all one-bedroom apartments in your city. Required sample size can be calculated once you make three decisions:
1. Level of confidence you require
2. Degree of accuracy you expect to achieve
3. An estimate of the standard deviation of one-bedroom rents in the city
The level of confidence you require is 1 - [alpha]. Therefore, this decision determines [alpha], which is required to estimate sample size. The degree of accuracy you expect to achieve is stated in terms of units of measure. For instance, if you are estimating mean rent, the degree of accuracy is stated in dollars. Degree of accuracy is called sampling error, which is symbolized as e. The standard deviation ([sigma]) of the variable being estimated will be unknown and must be estimated. Methods of estimating [sigma] include referencing prior studies, conducting a small pilot study, or investigating the range of the variable of interest (the range will often be approximately 6 times [sigma] for a normal distribution).
The equation for estimating sample size required to estimate a population mean is
n = [Z.sup.2][[sigma].sup.2]/[e.sup.2]
Picking up the one-bedroom apartment rent example again, let's assume you decide on a 95% confidence level, expect to be accurate within [+ or -]$10.00, and estimate the range of monthly rent for one-bedroom apartments in the market area at $120 ($650 to $770). Based on these factors, you select Z = 1.96 based on [alpha] = 0.05 and the standard normal distribution, e = 10 and [sigma] = 20 ($120 / 6). The required sample size is
n = [Z.sup.2][[sigma].sup.2]/[e.sup.2] = [1.96.sup.2] x [20.sup.2]/[10.sup.2] = 15.36
Sample size calculations are generally rounded up, so you would want to draw a random sample of at least 16 one-bedroom apartment rents.
Suppose you require more precision than an estimate of mean rent [+ or -]$10. For example, you may need more statistical power to compare mean rents for two types of one-bedroom apartment (say, those with and without a private balcony). Assume you need to decrease sampling error from $10 to $5 in order to have enough statistical power to detect the effect of private balconies. What does this requirement do to sample size?
n = [Z.sup.2][[sigma].sup.2]/[e.sup.2] = [1.96.sup.2] x [20.sup.2]/[5.sup.2] = 61.47
Sample size essentially quadruples to 62. This emphasizes an important point:
Increases in statistical power are "costly" when cost is stated in terms of sample size.
Cutting sampling error in half quadruples sample size, and reducing sampling error to one-quarter of the amount illustrated in this example ($2.50) would increase sample size 16-fold. The relationship between sample size and sampling error is exponential due to the [e.sup.2] term in the denominator of the sample size equation.
Sample Size for Estimating Proportions
As Americans, we are accustomed to reading about proportion estimates at election time. The following was reported by Reuters on the eve of the January 8, 2008, New Hampshire presidential primary election:
A Reuters/C-SPAN/Zogby poll showed Obama with a 10-point edge on Clinton in the state, 39 percent to 29 percent, as he gained a wave of momentum from his win in Iowa.
The margin of error (e) for the statement above was reported elsewhere to have been [+ or -]4.4%. Assuming a significance level of 0.05, the poll taker was 95% confident that Obama's proportion of the vote was between 54.6% and 45.4% and Clinton's proportion was between 24.6% and 55.4%. Based on this information we can deduce that there were approximately 496 respondents, as we will see shortly.
The equation for the sample size required to estimate a population proportion is
n = [Z.sup.2]p(p - 1)/[e.sup.2]
where Z is the standard normal value associated with the confidence level, e is the margin of error, and p is an estimate of the population proportion. For most proportion estimates p is set at 0.50 because this proportion maximizes the value of p(1 - p), ensuring that the sample is sufficiently large regardless of the true population proportion. Returning to the New Hampshire presidential primary poll, we can apply the equation for sample size to deduce the number of respondents:
n = [Z.sup.2](1 - p)/[e.sup.2] = [1.96.sup.2] x 0.50(1 - 0.50)/[0.44.sup.2] = 496
Now let's look at a more practical problem for an appraiser. Suppose you want to estimate the proportion of recent in-migrants to a city opting for rental housing rather than home ownership during their first year of residency. Assuming you could obtain a list of recent in-migrants from which to draw a sample (e.g., from electric company records), you could determine the number of respondents you would need by deciding on a confidence level and the margin of error. If you set [alpha] = 0.05 and e = 2%, the number of randomly chosen responses you would need is:
n = [1.96.sup.2] x .50(1 - 0.50)/[0.02.sup.2] = 2,401
Comparing the presidential primary poll to the housing tenure choice sample above, we can see that reductions in the margin of error (4.4% to 2%) dramatically increase sample size and related costs. Therefore, we must carefully consider the question of what is a sufficient margin of error and the associated amount time and money to devote to data gathering.
Internet resources suggested by the Lum Library
Appraisal Institute Education Courses, "Real Estate Finance, Statistics, and Valuation Modeling"
American Statistical Association
Bureau of Labor Statistics
Commercial Real Estate Research, National Association of Realtors
Data and Statistics--General Data Resources, U.S. General Services Administration Reference Center
Field Guide to Quick Real Estate Statistics, National Association of Realtors Library
Trends and Statistics--Real Estate, Internal Revenue Service
U.S. Census Bureau
The World Wide Web Virtual Library-Statistics, University of Florida, Department of Statistics
by Marvin L. Wolverton, PhD, MAI
The material in this article originally was published as chapter 6 in Marvin L. Wolverton, An Introduction to Statistics for Appraisers (Chicago: Appraisal Institute, 2009).
(1.) A recommended source for a discussion of validity and reliability in research design and implementation is Mary L. Smith and Gene V. Glass, Research and Evaluation in Education and the Social Sciences (Boston: Allyn and Bacon, 1987).
(2.) David M. Levine, Timothy C. Krehbiel, and Mark L. Berenson, Business Statistics:A First Course, 3rd ed. (Upper Saddle River, N.J.: Prentice Hall, 2003), 23-25.
(3.) Earl Babble, The Practice of Social Research, 6th ed. (Belmont, Calif.: Wadsworth, 1992).
(4.) Assuming a 30-day month, the probability of randomly picking the 14th or 15th each month is 2 + 30, or 1/15th. Over 12 months this sums to 12/15ths, or 80%.
(5.) Localized externalities differ from marketwide externalities affecting all properties. For example, the 2008/2009 residential foreclosure wave determines the market in many locales, and a representative sample would legitimately be expected to include foreclosed properties and foreclosure price effects, if any.
(6.) Because the choice of a high level of significance reduces [beta], the effect on [beta] should have been considered in the initial selection of [alpha].
(7.) Try calculating the critical value associated with 0.90 using the "=NORMINV" macro in Excel to derive a more precise critical value of 1.2816.
(8.) Perhaps the best reference for survey sampling design and maximizing response rate is by Don Dillman, who has written a series of books on the topic and is a highly regarded expert. His latest book is Mail and Internet Surveys: The Tailored Design Method (Hoboken, N.J.: John Wiley & Sons, 2007). If you are not anticipating conducting a Web-based survey, then one of his older books would be sufficient.
Marvin L. Wolverton, PhD, MAI, is a practicing real property valuation theorist and consultant currently employed as a senior director in the national Dispute Analysis and Litigation Support practice at Cushman & Wakefield, where he engages in litigation consulting and expert witness services. Wolverton is also an emeritus professor, and the former Alvin Wolff Distinguished Professor of Real Estate, at Washington State University. He is a state-certified general appraiser and has been a member of the Appraisal Institute since 1985. Wolverton is a current member of the Appraisal Journal Review Panel. He has also served as editor of the Journal of Real Estate Practice and Education and on the editorial boards of the Journal of Real Estate Research and The Appraisal Journal. He has authored more than forty articles in refereed and professional journals, including the Journal of Real Estate Research, Real Estate Economics, Journal of Real Estate Finance and Economics, Assessment Journal, Journal of Real Estate Portfolio Management, Journal of Property Valuation and Investment, Journal of Property Research, and The Appraisal Journal. He has edited and written books and chapters of books on valuation theory and specialized appraisal topics, and he teaches appraisal courses on behalf of the Appraisal Institute. His formal education includes a bachelor of science in mining engineering from New Mexico Tech, a master of science in economics from Arizona State University, and a doctor of philosophy, specializing in real estate and decision science, from Georgia State University. Contact: firstname.lastname@example.org