Printer Friendly

Methodological techniques for dealing with missing data.

When analyzing data, it is commonplace to observe that data are not always complete for each case. Rather, some data are usually missing. In some cases the amount of missing data may be minimal; in others it may be significant. This article deals with the methodological issues related to missing data. Specific questions addressed are: What are missing data? Why are missing data important? What are the major reasons for missing data? How might missing data be prevented or minimized? How do you detect missing data? What are the types of missing data? What do you do with missing data? And, how do common software packages handle missing data?



Missing data are data you desired to collect but never got into your database for subsequent analysis.


There are several reasons for being concerned about missing data. Missing data can reduce your effective sample size, which results in a loss of statistical power. That is, you may have initiated your study with a sample size of 300 people but missing data may effectively reduce the sample size to 250 cases. As sample size decreases, variability in your data increases and confidence decreases. As sample size decreases, your data may no longer be representative. Missing data may introduce bias. For example, if your find that some sensitive health behavior questions, such as sexual activity or drug behavior, or just a basic demographic question such as annual income, are not answered by many people, then the responses you do get may not be representative of the population. That is, people whose behavior was socially acceptable or whose incomes were neither very low or high may have responded, whereas those having less socially acceptable responses or who were at very low or high incomes may have been more likely not to respond. Thus when you analyze the health behaviors you have a built-in bias due to missing data.

Similarly, missing data make it difficult to measure effects. Using the same example, you may have been interested in assessing the effects of income on health behaviors. However, if people at the ends of the income spectrum were more likely not to respond then those in the middle of the range, you may conclude there was no relationship between income and health behaviors when in fact one may exist. Thus, missing data can influence both the analysis and interpretation of your data.


There are many reasons for missing data. Common reasons are presented below and ways to address them are presented in the following section. A common form of missing data is the subject's refusal to answer an item. Often this is because of question sensitivity. Sensitive questions about health behaviors, income, and illegal activities are examples. Another common reason is that the respondent simply doesn't know the answer. This may be because of memory problems for example, not being able the remember the date of the last physical check-up, or comprehension problems--not understanding the words or constructs in the question. Sometimes the data desired are simply not applicable. For example, a questionnaire item asking a person how long ago it was since they had a tetanus shot may result in a nonresponse by someone who never had one. Or, asking enrollees of an HMO to evaluate their most recent visit during the past twelve months will not be applicable to those who haven't been to the HMO facility during that period.

Missing data may be the result of questionnaire programming errors if computer-assisted interviewing is used. For example, with reference to the previous recent visit question, the respondent who wasn't in the HMO in the past year shouldn't have been confronted with the questions in the first place. Data processing errors also may be responsible for missing data. The data may have not been entered or entered correctly. Missing data are also commonplace in studies where data are collected over time. For example, in measuring a health promotion weight reduction intervention program, data on knowledge, attitudes, eating behaviors and weight may be collected at the beginning and end of the program as well as six and twelve months later. At each data collection point, subject attrition is possible and cumulative, with attrition increasing with time.


An understanding of reasons for missing data can help reduce the problem and thus strengthen your data. Problems with missing data can be prevented and data loss minimized. The best approach to avoiding or minimizing missing data is prevention. That is, missing data can be prevented at the outset by developing a well-designed instrument with clear directions and unambiguous and answerable items. Another strategy is, at the time of data collection via phone or personal interview, checking that all applicable data are collected before ending the interview. Data returned by mail questionnaire can be checked for missing data and follow-up done accordingly, although this can be a time-consuming and costly process.

Refusals can be reduced by attempting to develop a better rapport with the respondent, including explaining the purpose of the study and how it can be beneficial to the respondent, assuring confidentiality, reducing question sensitivity or asking sensitive questions in a different way. The Sudman text (1982) Asking Questions, provides an excellent reference for ways to increase response rates to sensitive questions.

"Don't knows" can be reduced by being realistic in your data gathering instrument or providing cues or techniques of aided recall for the respondent. For example, instead of asking whether or not people have eaten fruits or vegetables high in fiber during the past week, you might provide a list of fruits and vegetables high in fiber with a yes/no response after each. This not only identifies what is considered high in fiber but provides a cue to aid in recall. Alternatively, for some questions, don't know can be a legitimate response to be included in your instrument and analyzed as valid data. Data processing errors can be reduced by having a properly designed questionnaire that facilitates data entry or by setting up a data entry program that directs you through the data entry process.


There are several ways to detect missing data. Techniques are described in greater detail in a recent paper by O'Rourke (2000). These include visual scanning of the data, utilizing a data entry program such as QPL or SPSS[R] and then listing missing data for each case in the database. Another is identifying missing data by running a frequency program that counts the frequency of responses for each variable. Another possibility is doing a bivariate analysis which is simply a crosstabulation of one variable by another. Any respondent who didn't answer one or both of the items would be counted as missing. For example, let's assume there was a health behavior question with a "yes" or "no" response, an income question with three response possibilities, and 300 respondents. Of those respondents, let's say that 20 did not answer either question. That reduces the number of respondents to 280. Let's also say that an additional 10 cases did not answer the health behavior question and another 15 respondents did not provide their income group. This would result in a reduction of an additional 25 cases, resulting in a final count of 255 cases having data for both variables. A listing of any cases having any missing data for the variables being analyzed could be generated. Efforts to collect the missing data or strategies to deal with the missing data could be initiated.


Another important decision is determining the type of missing data. Three types will be discussed. The first type is when the missing data are completely at random. That is, data that are missing are independent of either the independent variable X or the dependent variable Y. In these cases the missing data are a random sub-sample of the original sample. To clarify further, in this instance there are no significant differences you can identify between cases with and without missing values on either the X or Y variable. This situation is the least problematic because the missing data are not related to the outcome of the data. In this instance, the available data can be analyzed and reliable statistics obtained.

The second type is also where the data are missing at random. In this instance the missing Y depends on X but Not Y. For example, if poorly educated respondents (education = X) have trouble answering a question about drug attitude (Y), then data are missing at random since it is the trouble answering the question and not the drug attitude per se that is accounting for the missing data. This type of missing data also is of less concern than the final type, which involves nonrandom missing data. This is very much of concern and can jeopardize the entire analysis. In this instance Y depends on both X and Y. For example, if depression (Y) and income (X) are both related to giving a response, then missing data are nonrandom and ignoring the missing data are not warranted, because the data will be biased unless compensations are made. Basically, missing data is problematic if the independent variable X is related to the dependent variable Y, but the missing data X does not allow for the testing of the association or relationship.


Depending upon the situation, missing data may be dealt in a variety of ways. One strategy is simply to ignore them and analyze the remaining data. This approach is reasonable if the missing data are minimal and/or the results are clear. For example, if you had a data set of 100 cases and the results of one variable showed that 5 cases had missing data while 90% of the remaining respondents answered with a similar response, then the missing data could not influence the overall analysis significantly and ignoring it would be appropriate. However, if there were 100 cases total, 40 cases that did not respond, and those responding were divided approximately evenly in their responses, the missing data would be problematic for the interpretation.

Another way of dealing with missing data is to edit or use imputation, which involves providing a value for the missing data. Imputation needs to be considered carefully since you are inserting a real value for a missing value. What is most important in imputation is choosing the most appropriate value. Using an inappropriate measure can increase error and distort findings. Several methods of imputation may be utilized. One is to use the mean or average value of the entire sample for the missing variable. For example, if the respondent did not provide an annual income, then the mean value of the entire group might be imputed. This may be appropriate if the data were not skewed. However, if the income variable data were skewed, the median or modal value would be more appropriate. In either case the most appropriate value to describe the entire sample would be used. In another instance one might choose the mode as the most appropriate measure. For example, if survey question had two responses (Yes=1, No=2) and a frequency count of 100 respondents indicated that 90 of them said "Yes", 5 said "No" and 5 were "Missing", then entering a "Yes" value of 1 for each of the five missing cases would be reasonable. Alternatively, once could simply analyze the data of the 95 respondents answering the question. In either case the results are clear.

At times, a conditional mean (or median, or mode) substitution may be more appropriate. In this instance you would impute the mean value for a respondent missing data but that value would not be based on the group value. Rather, the imputed value would be conditional based upon the characteristics of the respondent. For example, if the mean income of white males was $50,000 and white females were $40,000, then the missing data for a white female would be imputed as $40,000. If you were doing a study of youth smoking, had missing data for age of onset for a male smoker, and knew through a review of the literature that males had an earlier age of onset than females, it would he more appropriate to use the age of onset of other males in your study rather than the entire group. If the data were skewed then the median value of those groups, not the average of all groups, could be used. Another imputation possibility is a conditional regression estimation. In this case if you have missing data for a dependent variable of a respondent, you would run a regression of the independent variable(s) and dependent variable for the complete cases. If you find the independent variables are predictive, then you can generate an imputed predictive value for the missing respondent data.

The overall approach for imputation is to decide on what is the preferable approach for different data scenarios prior to analyzing any data. Next, review your collected data and, based on that, choose the preferable approach. A more detailed description of these and other more sophisticated methodologies for dealing with missing data can be found in Kim & Curry (1977); Coy & Cohen (1985); Kalton & Kasprzyk (1986); Little & Rubin (1987); Little (1992); and Afifi & Clark (1996).


Most widely used statistical software packages such as SPSS[R] or SAS[R] can handle missing data in several ways. One is what is called "listwise" deletion, which deletes any case having any missing values for variables being analyzed in the data set. Thus only cases having complete data are analyzed. This is usually the default mode of analysis in statistical software packages. While "listwise deletion" ensures analysis of only complete data sets, its disadvantage is that it reduces effective sample size. If you have a very large sample, listwise is usually not problematic. However, it can be very problematic small samples. For example, if you had 300 respondents to 50 variables, but 200 respondents had missing data on one or more variables being analyzed, then data would be analyzed for only the 100 respondents having complete data. The result is that the data from the 100 respondents with complete data may be different from and less representative than the data from the 300 respondents. "Listwise deletion" is generally considered a very conservative approach to data analysis.

Alternatively, software packages allow for "pairwise" deletion of missing data. In this case the respondent data will be used except when data for a given variable is missing. That is, the entire case is not thrown out of subsequent analysis due to missing data for one or more variables but simply for the analyses of the variables for which missing data are observed. The respondent will be included for other analyses for which data were reported or imputed. While this approach uses a broader set of respondents, its disadvantage is that different subsets of cases are analyzed for different questions, depending upon the extent of missing data.

It is also possible to impute a value for any variable that is missing before proceeding with any data analysis. This is particularly useful and appropriate if the amount of missing data is relatively minor and a logical imputation is used as previously described. More detailed information about statistical software and missing data can be found by reviewing the SPSS and SOLAS web sites found in the references.


Missing data in studies are commonplace and can range from minimal to problematic. Addressing the issue of missing data is important because it can impact subsequent analyses, discussions and conclusions. This article identifies multiple reasons for missing data and how they can be prevented or minimized. It also identifies several techniques to detect missing data and strategies to deal with them. It describes several approaches to imputation that may be utilized depending on an analysis of the data. Finally, k describes how statistical software programs can handle missing data and the advantage and disadvantage of each approach.


Afifi, A. & Clark, V. (1996). Computer-Aided Multivariate Analysis (3rd ed.). London: Chapman & Hall, Chap. 9, 197-202.

Cox, B. & Cohen, S. (1985). Methodological Issues for Health Care Surveys. New York: Marcel Dekker, Inc., Chap. 8, 214-237.

Kalton, G. & Kasprzyk, D. (1986). The Treatment of Missing Survey Data. Survey Methodology, 12, 1-16. Kim, J. & Curry, J. (1977). The Treatment of Missing Data in Multivariate Analysis. Sociological Methods & Analysis, 6, 215-240.

Little, R. (1992). Regression with Missing X's: A Review. Journal of the American Statistical Association, 87, 1227-1237.

Little, R. & Rubin, D. (1987). Statistical Analysis with Missing Data. New York: John Wiley & Sons.

O'Rourke, T. (2000). Data Analysis: The art and science of coding and data entry. American Journal of Health Studies, 16, 164-166.

O'Rourke, T. (2000). Techniques for screening and cleaning data for analysis. American Journal of Health Studies, 16, 217-219.

SAS Statistical Software:

SPSS missing data analysis white paper:

SPSS Statistical Software:

Sudman, S., Asking question. San Francisco: Jossey-Bass, (1982).


Responsibility IV--Evaluating Effectiveness of Health Education Programs

Competency A--Develop plans to assess achievement of program objectives.

Sub-competency 4--Select appropriate methods for evaluating program effectiveness.

Thomas W. O'Rourke, Ph.D., MPH, CHES is a Professor in the Department of Community Health and College of Medicine at the University of Illinois at Urbana-Champaign, IL 61820. Address all correspondence to Thomas W. O'Rourke, Ph.D., MPH, CHES, 120 Huff Hall, University of Illinois, MC-588, 1206 South Fourth Street, Champaign, IL, 61820, PHONE: 217.333.3163, FAX: 217.333.2766, E-MAIL:
COPYRIGHT 2003 University of Alabama, Department of Health Sciences
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2003, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

Article Details
Printer friendly Cite/link Email Feedback
Author:O'Rourke, Thomas W.
Publication:American Journal of Health Studies
Date:Mar 22, 2003
Previous Article:A qualitative assessment of college students' perceptions of health behaviors.
Next Article:A preliminary investigation of college students' physical activity patterns.

Related Articles
Review activity for continuing education contact hours.
Springer launches six new journals.

Terms of use | Privacy policy | Copyright © 2018 Farlex, Inc. | Feedback | For webmasters