Printer Friendly
The Free Library
14,528,975 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

Techniques for screening and cleaning data for analysis.


Health professionals in the area of health promotion/education frequently need to collect and analyze data for a variety of reasons, such as needs assessment, program planning and implementation, evaluation, budget justification, trend identification, projections or assessment of knowledge, attitudes and/or and/or  
conj.
Used to indicate that either or both of the items connected by it are involved.

Usage Note: And/or is widely used in legal and business writing.
 behaviors. Data may be collected in many settings, including schools, colleges/universities, community agencies, worksites and healthcare settings. They may be collected in a variety of ways, including from students in a classroom, clients in a community agency, or patients/consumers in a healthcare setting. Data also may be collected via mail survey, phone or personal interview, or from school, worksite, community agency or healthcare records.

After the data have been collected, they need to be entered into a format ready for analysis. This is generally done by entering the data into a data file on a computer. Each person or record for which data are collected is known as a case. Data files are made up of variables (each of the items you wish to analyze) and values for each variable, for each case you are analyzing. However, before analysis, data should be screened for errors and corrected. This article focuses on techniques for screening and cleaning data for analysis.

ENTERING DATA

As an example, let's let's  

Contraction of let us.
 say you had a data set having 11 variables (ID#, gender, age in years, current marital status marital status,
n the legal standing of a person in regard to his or her marriage state.
, ethnicity ethnicity Vox populi Racial status–ie, African American, Asian, Caucasian, Hispanic , height in inches, weight in pounds, cholesterol count, yes-no responses for two variables [such as, "Do you generally wear a seat belt when in an automobile?" and "Have you had a physical exam during the previous twelve months?"] and one attitude variable ["Do you favor sex education being taught in the schools?"] with response codes of either strongly agree, agree, disagree, or strongly disagree). For some of these variables (age in years, height in inches, weight in pounds, and cholesterol count), the values would be the actual data. Other variables such as gender, marital status, answers to yes/no and the attitude items would require coding in a number for each possible value, for each variable, for each case. Ideally, a coding scheme would be developed before collecting the data in order to facilitate data entry (see O'Rourke The O'Rourkes were the historic rulers of Breifne.

O'Rourke may refer to several different people: People
  • Andrew O'Rourke, judge and politician from New York State
  • Beto O'Rourke (born 1973), American entrepreneur and civic leader
, 2000, article "Data Analysis: The Art and Science of Coding and Entering Data" in a previous issue of this Journal). For example, for the hypothetical Hypothetical is an adjective, meaning of or pertaining to a hypothesis. See:
  • Hypothesis
  • Hypothetical
  • Hypothetical (album)
 data set just mentioned, you may have coded the values for gender as male = 1, female = 2; current status as married = 1, separated = 2, divorced = 3, widowed = 4, never married = 5; ethnicity as Caucasian Caucasian or Caucasoid: see race.  = 1, African-American = 2, Hispanic Hispanic Multiculture A person of Mexican, Puerto Rican, Cuban, Central or South American, or other Spanish culture or origin, regardless of race Social medicine Any of 17 major Latino subcultures, concentrated in California, Texas, Chicago, Miam, NY, and elsewhere  = 3, Asian = 4, Native American American, river, 30 mi (48 km) long, rising in N central Calif. in the Sierra Nevada and flowing SW into the Sacramento River at Sacramento. The discovery of gold at Sutter's Mill (see Sutter, John Augustus) along the river in 1848 led to the California gold rush of  = 5, Other = 6; yes/no variables as yes = 1, no = 2; and the attitude item as strongly agree = 1, agree = 2, disagree = 3, strongly disagree = 4.

As a suggestion, you should include a "don't know Don't know (DK, DKed)

"Don't know the trade." A Street expression used whenever one party lacks knowledge of a trade or receives conflicting instructions from the other party.
" possibility for any variable where "don't know" is a distinct possibility, such as cholesterol count and knowing if you had a physical exam during the last twelve months or even weight, if an actual measurement was not used. In each of the cases, as a convention, "don't know may be coded as an 8 if the variable has a single digit A single character in a numbering system. In decimal, digits are 0 through 9. In binary, digits are 0 and 1.

digit - An employee of Digital Equipment Corporation. See also VAX, VMS, PDP-10, TOPS-10, DEChead, double DECkers, field circus.
 response possibility (as in the case of a yes/no or attitude variable) or as an 888 for cholesterol count (whose values are three digits). You also should code in a 9 for missing data or for a refusal for any variable having a single digit possibility, a 99 if the variable has a 2-digit response possibility (such as for age), and 999 for 3-digit variables such as weight or cholesterol count. Again, the reader is encouraged to review the O'Rourke, 2000, article "Data Analysis: The Art and Science of Coding and Entering Data" that appeared in an earlier issue of this Journal.

TECHNIQUES FOR SCREENING AND CLEANING DATA

Once you have entered the data, they should be "screened" and "cleaned" before subsequent analysis. Screening is a process that identifies real or potential errors in your data entry. The errors need to be corrected ("cleaned") to the maximum extent possible before analyzing your data. For example, for the gender variable, all values should be either a 1 for male, 2 for female or 9 for missing data. No other values are possible for this variable. For the yes/no items on wearing a seat belt or having a physical exam, the possibilities could be 1 for yes, 2 for no, 8 for "don't know," and 9 for missing or refused. The possible values for the attitude item would be 1, 2, 3 or 4 for strongly agree through strongly disagree, 8 for "don't know" and 9 for missing or refused. When cleaning the size variables, the data analyst may wish to check any height over 7.6 inches or weight above 250 pounds or so, or check each variable against another. That is, it may be entirely possible for a person 6'6" (78 inches) to weigh 250 pounds; but it is not very likely for a third grader A grader, also commonly referred to as a blade or a motor grader, is an engineering vehicle with a large blade used to create a flat surface. Typical models have three axles, with the engine and cab situated above the rear axles at one end of the vehicle and a third  at 3'6" (42 inches).

So how does one go about cleaning data? There are several ways, from low tech to high tech. Basically, choose the one that best meets your needs. A low tech approach is simple visual scanning of the data. If you have only a few variables (let's say less than 30) and have less than 300 cases, you can quickly scan a printout (PRINTer OUTput) Same as hard copy.  or computer screen of the data and look for any impossible data, such as a code 5 for gender, a height of 8'6" (96 inches), a weight of 725 pounds, or a cholesterol count of 940. Upon identifying a real or potential error, you can go back to check the original data. To facilitate this you should make the first variable on the data set a unique ID number for each case. The ID number links the original data record to the data that are entered into the computer record. In this way when you find a data entry error you can quickly go back to the original data record and make the correction.

Table 1 provides a sample data set from an adult survey having eleven variables and ten cases. Review the data for each column and you will notice a number of incorrect data entries that are bolded. Each reflects a datum The singular form of data; for example, one datum. It is rarely used, and data, its plural form, is commonly used for both singular and plural.  that need to be cleaned before proceeding with your analysis. To accomplish this you would first identify the incorrect data and then go back to the case ID number of the original data in order to make the necessary corrections.

While a visual check is useful, it is often not realistic to do in terms of time or accuracy when you have many variables and/or many cases. There are several ways to clean larger data sets. One is to use a program such as SPSS A statistical package from SPSS, Inc., Chicago (www.spss.com) that runs on PCs, most mainframes and minis and is used extensively in marketing research. It provides over 50 statistical processes, including regression analysis, correlation and analysis of variance. [R] Data Entry or Questionnaire Programming Language (QPL QPL Q Public License
QPL Qualified Products List
QPL Qualified Parts List
QPL Quantum Programming Language
QPL Questionnaire Programming Language
QPL Qualified Product Listing
QPL Quality Products List
QPL Quality Parts List
QPL Qt Public License
). These programs are used to edit and prepare the collected data for analysis. In these programs you set the possible acceptable ("legal") values for each variable before entering your data. Any data entered outside of the values you set are flagged immediately. The data entry program also can provide for automatic skipping skip  
v. skipped, skip·ping, skips

v.intr.
1.
a. To move by hopping on one foot and then the other.

b. To leap lightly about.

2.
 of a question or questions. Let's say a person answered "no" to a survey question asking him whether he had ever used a university health center. Let's also say that the next five questions on the survey dealt with his evaluation of his last visit to the health center. Since only people who had answered "yes" to having used the university health center could have answered these questions, the data entry program would automatically not allow you to enter the data for any person indicating they had not ever used the health center. Rather, the program would skip you to the next question to which the respondent In Equity practice, the party who answers a bill or other proceeding in equity. The party against whom an appeal or motion, an application for a court order, is instituted and who is required to answer in order to protect his or her interests.  could logically answer. A caveat is that data entry programs will not indicate an error so long as the value is a logical value. For example, if a person answered yes/no with a 1, but you entered a 2, then the error would not be detected. Only responses outside the range you indicate would be

noted. The advantage of these programs is that they not only only reduce data entry errors but also can generate frequency counts for each variable and data files ready for subsequent analysis.

Whereas data entry programs are designed to check data before being entered, you can utilize several techniques after data entry to detect errors if you do not have a data entry program. One technique is to use a "list cases" type program that is available in statistical software programs like SPSS[R]. Here you designate des·ig·nate  
tr.v. des·ig·nat·ed, des·ig·nat·ing, des·ig·nates
1. To indicate or specify; point out.

2. To give a name or title to; characterize.

3.
 the possible legal values for each variable. The list cases program then identifies the ID number with any illegal data for that case. You can then go back to the original data for that case and make the appropriate correction(s).

Another possibility is to use a statistical software program such as SPSS[R] or SAS (1) (SAS Institute Inc., Cary, NC, www.sas.com) A software company that specializes in data warehousing and decision support software based on the SAS System. Founded in 1976, SAS is one of the world's largest privately held software companies. See SAS System.  and run a frequency program. A frequency program provides a count for the values for each variable. Unexpected codes in the table can indicate errors in data entry or coding. For example, if you have a data set of 300 cases and run a frequency on gender, which was coded as male = 1 and female = 2, you might get the following results: 1 (males) = 148, 2 (females) = 149, two values of 3, and one value of 4. The values of #3 and #4 reflect incorrectly entered data. You could then either use a "find" command to locate the 3s and the 4 for the gender column in the data set or use the "list cases" command previously described for the gender variable. You would then review the results of the frequency program for each variable looking for Looking for

In the context of general equities, this describing a buy interest in which a dealer is asked to offer stock, often involving a capital commitment. Antithesis of in touch with.
 any incorrect values and utilize the same process for identifying the case number and correcting the data for each case. Alternatively you could go to the column for that variable and use a "find" command for the incorrect entered data.

Once the data are cleaned you can proceed with the analysis. The next article will describe the basics of analyzing your data using a frequency table and basic statistics for each variable.
Table 1. Sample Data Set from Adult Survey

ID#             Gender            Ethnicity

01    32          1       1         2         72
02    47          2       7         1         14 [A]
03    23          2       5         1         65
04    03 [A]      1       3         7         68
05    53          1       4         3         66
06    44          4 [A]   1         4         67
07    36          2       1         1         96 [A]
08    22          3 [A]   6 [A]     2         74
09    38          3 [A]   1         5         60
10    42          1       3         1         68

ID#

01    210       162       3 [A]     2          3
02    118       436 [A]   1         1          1
03    130       155       2         1          2
04    180       012 [A]   2         8          2
05    163       185       1         5 [A]      4
06    157       176       4 [A]     2          6 [A]
07    728 [A]   888       2         1          1
08    020 [A]   175       1         3 [A]      2
09    098       999       2         2          5 [A]
10    176       210       1         1          2

[A] Bold numbers = incorrect data entry


REFERENCES

Norusis, M. (1982). SPSS[R] introductory guide: Basic statistics and operations. New York New York, state, United States
New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of
: McGraw-Hill The McGraw-Hill Companies, Inc., (NYSE: MHP) is a publicly traded corporation headquartered in Rockefeller Center in New York City. Its primary areas of business are education, publishing, broadcasting, and financial and business services.  Book Company, 7.

O'Rourke, T. (2000). Data analysis: The art and science of coding and entering data. American Journal of Health Studies, 16, 164-166.

Thomas (language) Thomas - A language compatible with the language Dylan(TM). Thomas is NOT Dylan(TM).

The first public release of a translator to Scheme by Matt Birkholz, Jim Miller, and Ron Weiss, written at Digital Equipment Corporation's Cambridge Research Laboratory runs
 W. O'Rourke, MPH MPH Master of Public Health.
MPH Master's Degree in Public Health
, Ph.D., CHES, is a professor in the Department of Community Health and College of Medicine, University of Illinois at Urbana-Champaign Early years: 1867-1880
The Morrill Act of 1862 granted each state in the United States a portion of land on which to establish a major public state university, one which could teach agriculture, mechanic arts, and military training, "without excluding other scientific
, IL, 61820.
COPYRIGHT 2000 University of Alabama, Department of Health Sciences
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2000, Gale Group. All rights reserved. Gale Group is a Thomson Corporation Company.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Author:O'Rourke, Thomas W.
Publication:American Journal of Health Studies
Geographic Code:1USA
Date:Sep 22, 2000
Words:1939
Previous Article:An examination using the PRECEDE model framework to establish a comprehensive program to prevent school violence.
Next Article:Baseline data on coordinated school health programs in the state of Ohio.
Topics:



Related Articles
Striving for a better melt. (CastExpo '93: 97th AFS Casting Congress, Chicago)
New technique for assessing building re-occupancy at ground zero. (Products & Services).
Fractal mapping. (Data Visualisation).
Practical experiences in additive screening using a torque-based flocculation analyzer.(Papermaking: summary of peer-reviewed material)
Clean magnetic strainer.(D7G Tractor ...)(Brief Article)
Recruitment incentives to reduce disparities among medically underserved men.(recruiting African-American men to participate in prostate cancer...
Letters to the editor.(Letter to the Editor)
NAG DMC 2.0.(Software News and Products)(Brief Article)
What is Function Point Analysis?(SOFTWARE WORLD INTELLIGENCE)(Function Point Analysis)
Health, safety, and ecological implications of using biobased floor-stripping products.(Practical Stuff!)

Terms of use | Copyright © 2009 Farlex, Inc. | Feedback | For webmasters | Submit articles