PRACTICAL SAMPLING FOR HEALTH PROFESSIONALS.

Ever wonder how a researcher can collect data from just 1,000 Americans and make accurate deductions about the entire population? Ever wonder how a researcher call collect data from a few hundred school children, employees, or patients and generalize to an entire school district, company, industry, or health care institution? In each case, the person made generalizations about a population, a set of persons having a common characteristic (voters, students, employees, patients) from a sample that is a subset of that population. How that sample is selected is of critical importance (Sudman, 1976).

Statistical sampling is based on the premise that, if even a small number of units are randomly selected from a much larger population, that small population of units should reflect the characteristics and/or opinions of the larger population. The larger the sample, the more likely it is to represent that population (Levy & Lasmeshow, 1991). This likelihood, the sampling variance, is often referred to as the sampling "error" (e.g., [+ or -] 5%). For example, obtaining data from 1,000 cases brings with it a sampling error of [+ or -] 3%; 800 cases, [+ or -] 4%; 400 cases, [+ or -] 8%. (These variances can increase if other than simple random sampling is used. They can decrease if the sample is an appreciable percentage of the total population or if results are more definitive than, for example, 50% yes - 50% no.) It is important to note, however, that this sampling error can pale in comparison to the potential error inherent in getting a low response rate. See art earlier article on Increasing Response Rates (O'Bourke, 1999).

Selecting a Sample

Selecting a random sample can be done by assigning a number to each case and then selecting X number of cases. For example, if we had a population of 2000 cases and desired a sample of 500, then we would assign a number to each case starting at 0001 and ending at 2000. We would then proceed to a table of random numbers, which is commonly found at the back of most statistics or survey methods books, and select the first 500 four-digit numbers ranging from 0001 to 2000.

Alternatively, we could do a systematic random sample whereby we divide the population number (2,000) by the desired sample size (500) to get an interval. In this case 2000/500=4. We then would choose a "random start" by randomly selecting a number between 1 and 4. Let's say we selected the number 2. We would then select case 2, 6 (2+4), 10 (6+4), 14 (10 +4), etc. By the time we get to the end, we would have randomly selected 500 cases. A word of caution. While systematic random sampling may be easier, you need to review the list from which you are sampling. Let's say we were sampling from a list of students. If, for example, the ID numbers contained a gender code (last digit odd number = male, even number = female) the previous example would result in a sample of all females. There are a number of excellent books or chapters in books about sampling.

Avoid Convenience Samples, Use Probability Sampling

Under all circumstances avoid convenience sampling. Convenience sampling is where you select people simply because they are available or let people select themselves. Convenience samples are notoriously biased because the cases are self-selected rather than randomly selected (Bell, 1995). For example, surveying college students by selecting them from a campus organization will likely result in an unrepresentative and useless sample because students in a given organization may be atypical. Surveying students about campus life while attending a campus sporting event will exclude those not into sporting events. With probability sampling, all units in a population have a known, nonzero probability of being selected, (there may be good reasons not to select units with equal probability).

Practical Sampling - Three Examples

Sometimes a listing of the population isn't readily available. Sometimes the listing doesn't include the data of interest. For example, in doing a mail survey of college students, there may be a listing but not one including mailing address or a phone number for follow-up.

Let's say you're in a community, school, worksite or medical care setting and wish to draw a sample from hundreds or thousands of files, but don't have a listing or don't wish to thumb through and count every nth record. In this instance you can get an idea of how many records are in one file drawer and extrapolate that to other file drawers. Let's say there are approximately one hundred records per drawer and 10 drawers for a total of 1,000 records. If you wish to sample 50 cases, calculate the total number of record inches (e.g. 10 drawers X 20" = 200"), divide by 50 and you have your interval (200 / 50 = 4) every 4 inches. Then select a random start between I and 4. If the random start was 3, you would select the 50 records found at 3", 6" (3 + 3), 9" (6 + 3), etc. You could then use a ruler to pull out the selected files. (If some files are much thicker titan others, the precision will be reduced, but for most purposes the selections are still adequate.)

As another practical example, let's say you wish to do either a mail or phone survey and wish to select a sample of several hundred people from a large director (e.g. telephone directory, membership directory). By counting the number of pages, number of columns on a page, and average number of listings per column, one can set up a method of selecting cases without having to number all of them. For example, we recently conducted a study for a large university of over 35,000 students by pulling a sample of approximately 600 cases from the campus phone directory.

In this case the directory was approximately 300 pages with four columns per page and 30 names per column. Thus we wanted to select two names per page, but how? We randomly selected a column number from one to four and then another column number (which could be the same or different from the first). Let's say we selected columns 2 and 3. We then selected the case by selecting a random number from 1 to 30 (the number of listings per column). Let's say the first random number was 12 and the second was 23. Using a ruler we then marked where those listings were typically on the page and made a template to be used for every other page. In this way we were able to pull a systematic sample without having to do a lot of counting. While the template didn't guarantee that the same numbers were chosen per page (since some listings were longer or shorter than others) it did guarantee that we would be close and systematic. The potential for bias was minimal. In this instance we were fortunate to be able to compare the basic demographic characteristics of our sample (gender, ethnicity, year in school) to our population. Results indicated the similarity between the sample and population, thus validating the sample.

Another practical example is a sample method called "PPS"- sampling with probabilities proportionate to size. This method is used primarily when sampling establishments -- hospitals, universities, schools, etc. Because there are typically a large number of small establishments and only a few very large ones, if one were to select a random sample (where all units have an equal probability of selection), that sample would include lots of the small units and only a few big ones. It would be likely that most of the very big (important) units would not be selected.

The "measure of size" (MOS) is whatever variable is deemed important: for schools it is typical enrollment; for companies it would be number of employees or budge; for hospitals it could be number of patients seen in a year, number of patient days per year, employees or budget. Which MOS to use depends on the goals of the study and the information available.

To select a PPS sample, each unit is listed along with its MOS. The size measures are cumulatively summed. A sampling interval is determined as the total cumulative sum divided by "m," the number of selections desired. For example, if we wanted to select a PPS sample of universities, we would list each university with its enrollment, sum all those enrollments, and divide by the number of cases we want to survey. (If total enrollment is 5,000,000 and we wish to select 500 cases, our interval (i) would be 10,000.) As we did with a previous example, we select a random number (r) between 1 and i (in this case between 1 and 10,000). The selected numbers are then set as r, r + i, r + 2i, r + 3i, _, r(m-1)i, where r is the random start and i is the sampling interval. A unit is selected if the selection number falls into its sequence of numbers; that is, the selection number is greater than the cumulative sum of all previous clusters, but less than or equal tot he cumulative sum including the designated unit.

Here's an example, where r = 5,000 and i =10,000:

Selections would be those including the numbers: 5,000, 15,000, 25,000, 35,000, 45,000, 55,000 etc.
Measure Cumulative

School of size MOS Selected or Not

A 2,000 2,000 Not selected
B 10,000 12,000 Selected
C 8,000 20,000 Selected
D 30,000 50,000 Selected
E 1,000 51,000 Not selected
F 3,000 54,000 Not selected
G 1,000 55,000 Selected

Utilizing this method insures that our sample will be proportionate to size.

Summary

Proper sampling a crucial element in conducting any study. This article highlights the importance of sampling and explains the premise upon which the statistical sampling is based. Methods of selecting a ran dom sample, either sample or systematic. are described as well as the rationale for avoiding convenience sampling. Finally. three examples of practical sampling, which can be utilized by health professionals in a variety of work settings, are described and examples provided. Adherence to sound sampling principles will improve any study design and enhance subsequent data analyses.

REFERENCES

Bell, F. (1995). Basic Biostatistics. Boston, WCB/ McGraw Hill.

Levy, P. & Lameshow, S. (1991). Sampling of Populations: Methods and Applications. New York, John Wiley & Sons.

O'Rourke, T. (1999). The Importance of an Adequate Survey Response Rate and Ways to Improve It. American Journal of Health Studies, 15(2), 107-109.

Sudman, S. (1976). Applied Sampling. New York, Academic Press.

Thomas W. O'Rourie is a Professor in the Department of Community Health and School of Clinical Medicine, University of Illinois at Urbana-Champaign, IL 61820.
COPYRIGHT 2000 University of Alabama, Department of Health Sciences
No portion of this article can be reproduced without the express written permission from the copyright holder.