Techniques for screening and cleaning data for analysis.Health professionals in the area of health promotion/education frequently need to collect and analyze data for a variety of reasons, such as needs assessment, program planning and implementation, evaluation, budget justification, trend identification, projections or assessment of knowledge, attitudes and/or and/or conj. Used to indicate that either or both of the items connected by it are involved. Usage Note: And/or is widely used in legal and business writing. behaviors. Data may be collected in many settings, including schools, colleges/universities, community agencies, worksites and healthcare settings. They may be collected in a variety of ways, including from students in a classroom, clients in a community agency, or patients/consumers in a healthcare setting. Data also may be collected via mail survey, phone or personal interview, or from school, worksite, community agency or healthcare records. After the data have been collected, they need to be entered into a format ready for analysis. This is generally done by entering the data into a data file on a computer. Each person or record for which data are collected is known as a case. Data files are made up of variables (each of the items you wish to analyze) and values for each variable, for each case you are analyzing. However, before analysis, data should be screened for errors and corrected. This article focuses on techniques for screening and cleaning data for analysis. ENTERING DATA As an example, let's let's Contraction of let us. say you had a data set having 11 variables (ID#, gender, age in years, current marital status marital status, n the legal standing of a person in regard to his or her marriage state. , ethnicity ethnicity Vox populi Racial status–ie, African American, Asian, Caucasian, Hispanic , height in inches, weight in pounds, cholesterol count, yes-no responses for two variables [such as, "Do you generally wear a seat belt when in an automobile?" and "Have you had a physical exam during the previous twelve months?"] and one attitude variable ["Do you favor sex education being taught in the schools?"] with response codes of either strongly agree, agree, disagree, or strongly disagree). For some of these variables (age in years, height in inches, weight in pounds, and cholesterol count), the values would be the actual data. Other variables such as gender, marital status, answers to yes/no and the attitude items would require coding in a number for each possible value, for each variable, for each case. Ideally, a coding scheme would be developed before collecting the data in order to facilitate data entry (see O'Rourke The O'Rourkes were the historic rulers of Breifne. O'Rourke may refer to several different people: People
As a suggestion, you should include a "don't know Don't know (DK, DKed) "Don't know the trade." A Street expression used whenever one party lacks knowledge of a trade or receives conflicting instructions from the other party. " possibility for any variable where "don't know" is a distinct possibility, such as cholesterol count and knowing if you had a physical exam during the last twelve months or even weight, if an actual measurement was not used. In each of the cases, as a convention, "don't know may be coded as an 8 if the variable has a single digit A single character in a numbering system. In decimal, digits are 0 through 9. In binary, digits are 0 and 1. digit - An employee of Digital Equipment Corporation. See also VAX, VMS, PDP-10, TOPS-10, DEChead, double DECkers, field circus. response possibility (as in the case of a yes/no or attitude variable) or as an 888 for cholesterol count (whose values are three digits). You also should code in a 9 for missing data or for a refusal for any variable having a single digit possibility, a 99 if the variable has a 2-digit response possibility (such as for age), and 999 for 3-digit variables such as weight or cholesterol count. Again, the reader is encouraged to review the O'Rourke, 2000, article "Data Analysis: The Art and Science of Coding and Entering Data" that appeared in an earlier issue of this Journal. TECHNIQUES FOR SCREENING AND CLEANING DATA Once you have entered the data, they should be "screened" and "cleaned" before subsequent analysis. Screening is a process that identifies real or potential errors in your data entry. The errors need to be corrected ("cleaned") to the maximum extent possible before analyzing your data. For example, for the gender variable, all values should be either a 1 for male, 2 for female or 9 for missing data. No other values are possible for this variable. For the yes/no items on wearing a seat belt or having a physical exam, the possibilities could be 1 for yes, 2 for no, 8 for "don't know," and 9 for missing or refused. The possible values for the attitude item would be 1, 2, 3 or 4 for strongly agree through strongly disagree, 8 for "don't know" and 9 for missing or refused. When cleaning the size variables, the data analyst may wish to check any height over 7.6 inches or weight above 250 pounds or so, or check each variable against another. That is, it may be entirely possible for a person 6'6" (78 inches) to weigh 250 pounds; but it is not very likely for a third grader A grader, also commonly referred to as a blade or a motor grader, is an engineering vehicle with a large blade used to create a flat surface. Typical models have three axles, with the engine and cab situated above the rear axles at one end of the vehicle and a third at 3'6" (42 inches). So how does one go about cleaning data? There are several ways, from low tech to high tech. Basically, choose the one that best meets your needs. A low tech approach is simple visual scanning of the data. If you have only a few variables (let's say less than 30) and have less than 300 cases, you can quickly scan a printout (PRINTer OUTput) Same as hard copy. or computer screen of the data and look for any impossible data, such as a code 5 for gender, a height of 8'6" (96 inches), a weight of 725 pounds, or a cholesterol count of 940. Upon identifying a real or potential error, you can go back to check the original data. To facilitate this you should make the first variable on the data set a unique ID number for each case. The ID number links the original data record to the data that are entered into the computer record. In this way when you find a data entry error you can quickly go back to the original data record and make the correction. Table 1 provides a sample data set from an adult survey having eleven variables and ten cases. Review the data for each column and you will notice a number of incorrect data entries that are bolded. Each reflects a datum The singular form of data; for example, one datum. It is rarely used, and data, its plural form, is commonly used for both singular and plural. that need to be cleaned before proceeding with your analysis. To accomplish this you would first identify the incorrect data and then go back to the case ID number of the original data in order to make the necessary corrections. While a visual check is useful, it is often not realistic to do in terms of time or accuracy when you have many variables and/or many cases. There are several ways to clean larger data sets. One is to use a program such as SPSS A statistical package from SPSS, Inc., Chicago (www.spss.com) that runs on PCs, most mainframes and minis and is used extensively in marketing research. It provides over 50 statistical processes, including regression analysis, correlation and analysis of variance. [R] Data Entry or Questionnaire Programming Language (QPL QPL Q Public License QPL Qualified Products List QPL Qualified Parts List QPL Quantum Programming Language QPL Questionnaire Programming Language QPL Qualified Product Listing QPL Quality Products List QPL Quality Parts List QPL Qt Public License ). These programs are used to edit and prepare the collected data for analysis. In these programs you set the possible acceptable ("legal") values for each variable before entering your data. Any data entered outside of the values you set are flagged immediately. The data entry program also can provide for automatic skipping skip v. skipped, skip·ping, skips v.intr. 1. a. To move by hopping on one foot and then the other. b. To leap lightly about. 2. of a question or questions. Let's say a person answered "no" to a survey question asking him whether he had ever used a university health center. Let's also say that the next five questions on the survey dealt with his evaluation of his last visit to the health center. Since only people who had answered "yes" to having used the university health center could have answered these questions, the data entry program would automatically not allow you to enter the data for any person indicating they had not ever used the health center. Rather, the program would skip you to the next question to which the respondent In Equity practice, the party who answers a bill or other proceeding in equity. The party against whom an appeal or motion, an application for a court order, is instituted and who is required to answer in order to protect his or her interests. could logically answer. A caveat is that data entry programs will not indicate an error so long as the value is a logical value. For example, if a person answered yes/no with a 1, but you entered a 2, then the error would not be detected. Only responses outside the range you indicate would be noted. The advantage of these programs is that they not only only reduce data entry errors but also can generate frequency counts for each variable and data files ready for subsequent analysis. Whereas data entry programs are designed to check data before being entered, you can utilize several techniques after data entry to detect errors if you do not have a data entry program. One technique is to use a "list cases" type program that is available in statistical software programs like SPSS[R]. Here you designate des·ig·nate tr.v. des·ig·nat·ed, des·ig·nat·ing, des·ig·nates 1. To indicate or specify; point out. 2. To give a name or title to; characterize. 3. the possible legal values for each variable. The list cases program then identifies the ID number with any illegal data for that case. You can then go back to the original data for that case and make the appropriate correction(s). Another possibility is to use a statistical software program such as SPSS[R] or SAS (1) (SAS Institute Inc., Cary, NC, www.sas.com) A software company that specializes in data warehousing and decision support software based on the SAS System. Founded in 1976, SAS is one of the world's largest privately held software companies. See SAS System. and run a frequency program. A frequency program provides a count for the values for each variable. Unexpected codes in the table can indicate errors in data entry or coding. For example, if you have a data set of 300 cases and run a frequency on gender, which was coded as male = 1 and female = 2, you might get the following results: 1 (males) = 148, 2 (females) = 149, two values of 3, and one value of 4. The values of #3 and #4 reflect incorrectly entered data. You could then either use a "find" command to locate the 3s and the 4 for the gender column in the data set or use the "list cases" command previously described for the gender variable. You would then review the results of the frequency program for each variable looking for Looking for In the context of general equities, this describing a buy interest in which a dealer is asked to offer stock, often involving a capital commitment. Antithesis of in touch with. any incorrect values and utilize the same process for identifying the case number and correcting the data for each case. Alternatively you could go to the column for that variable and use a "find" command for the incorrect entered data. Once the data are cleaned you can proceed with the analysis. The next article will describe the basics of analyzing your data using a frequency table and basic statistics for each variable. Table 1. Sample Data Set from Adult Survey ID# Gender Ethnicity 01 32 1 1 2 72 02 47 2 7 1 14 [A] 03 23 2 5 1 65 04 03 [A] 1 3 7 68 05 53 1 4 3 66 06 44 4 [A] 1 4 67 07 36 2 1 1 96 [A] 08 22 3 [A] 6 [A] 2 74 09 38 3 [A] 1 5 60 10 42 1 3 1 68 ID# 01 210 162 3 [A] 2 3 02 118 436 [A] 1 1 1 03 130 155 2 1 2 04 180 012 [A] 2 8 2 05 163 185 1 5 [A] 4 06 157 176 4 [A] 2 6 [A] 07 728 [A] 888 2 1 1 08 020 [A] 175 1 3 [A] 2 09 098 999 2 2 5 [A] 10 176 210 1 1 2 [A] Bold numbers = incorrect data entry REFERENCES Norusis, M. (1982). SPSS[R] introductory guide: Basic statistics and operations. New York New York, state, United States New York, Middle Atlantic state of the United States. It is bordered by Vermont, Massachusetts, Connecticut, and the Atlantic Ocean (E), New Jersey and Pennsylvania (S), Lakes Erie and Ontario and the Canadian province of : McGraw-Hill The McGraw-Hill Companies, Inc., (NYSE: MHP) is a publicly traded corporation headquartered in Rockefeller Center in New York City. Its primary areas of business are education, publishing, broadcasting, and financial and business services. Book Company, 7. O'Rourke, T. (2000). Data analysis: The art and science of coding and entering data. American Journal of Health Studies, 16, 164-166. Thomas (language) Thomas - A language compatible with the language Dylan(TM). Thomas is NOT Dylan(TM). The first public release of a translator to Scheme by Matt Birkholz, Jim Miller, and Ron Weiss, written at Digital Equipment Corporation's Cambridge Research Laboratory runs W. O'Rourke, MPH MPH Master of Public Health. MPH Master's Degree in Public Health , Ph.D., CHES, is a professor in the Department of Community Health and College of Medicine, University of Illinois at Urbana-Champaign Early years: 1867-1880 The Morrill Act of 1862 granted each state in the United States a portion of land on which to establish a major public state university, one which could teach agriculture, mechanic arts, and military training, "without excluding other scientific , IL, 61820. |
|
||||||||||||||||||

Printer friendly
Cite/link
Email
Feedback
Reader Opinion