Printer Friendly

Frequent Monitoring of Clinical Outcomes: Research and Accountability for Clinical Practice.


This and the eight articles that follow it comprise a special issue on the advantages of conducting what we will call "Level 1 research," a labor-nonintensive, free-wheeling kind of accountability that should improve one's effectiveness as a clinician, whether one's clinical work is done under the title of special education teacher, psychologist, counselor, school psyhologist, school social worker, psychiatrist, or something else. In addition, "Level 1 research meets some of the needs of managed care (Hawkins, Mathews, & Hamdan, 1999). Basically, Level 1 research consists of systematically monitoring clinical outcomes--one's effects on client behavior-- without any need to scientifically prove what is causing those effects. The present article introduces such research, comparing it with two other types, and explains its advantages to clinicians. Then simple methods are described for doing such research. The subsequent eight articles will illustrate with examples of actual clinical cases in which Level 1 resea rch was employed to good effect.

Levels of Research on Clinical Effects

Although research is of fundamental importance in both basic and applied science, it is seldom conducted by clinical practitioners, such as special education teachers, clinical psychologists, psychiatrists, clinical social workers, or counselors. In fact, many clinicians are averse to research, probably because the most familiar research is very academic, labor-intensive, and perfectionistic. Some people even oppose research, even though it is simply "systematic experiencing," arranging experience systematically so that it yields information which is more reliable than one gets through casual impression, reliance on authority, stereotype, myth, or lore.

Not all research is academic and perfectionistic. There is a kind of research that is free-wheeling, highly relevant to a client's individual case, and much less demanding than the perfectionistic type. We will call it "Level 1" research (cf. Hawkins & Hursh, 1992; Hawkins, Mathews, & Hamdan (1999). It and two other levels of research comprise a continuum. Each level can be conducted with either an individual (or a few individuals) or a group, and each has a valuable role to play in improving clinical services. We will describe the three levels, beginning at the more academic, perfectionistic end of the continuum.

Level 3: Scientific Research

Description. This is the level of research with which all scientists are familiar. Its audiences are other scientists, a very skeptical group who demand the best evidence because their role is to build a large body of knowledge with a solid basis in empirical fact. Numerous steps are required to assure that the data collected in scientific research are credible and that any effects obtained are truly attributable to only the independent variable--the intervention one employs-not to maturation or to some undetected event in the participants' (subjects') lives.

If an individual-subject design is used, some form of replication is required, such as using an ABAB withdrawal design, a multiple-baseline design, or an alternating treatments design (cf. Barlow, Hayes, & Nelson, 1984; Barlow & Hersen, 1984). The study would usually be replicated with at least three subjects, depending on several factors. Usually inferential statistics are not necessary when using such designs, because the constant monitoring of outcomes makes it possible to detect any substantial effects by simple visual inspection of the graphs (cf. Baer, 1977; Hawkins, 1989).

If a group-comparison design is used, random assignment of the participants to the various groups is typically necessary, although analysis of covariance can sometimes be substituted. Complex inferential statistics are necessary to evaluate whether the mean performances of the groups differ to a reliable degree.

Illustration of level 3 research. Suppose that a scientist hypothesizes that a particular form of treatment for depression will be more effective than another popular form of treatment and, of course, more effective than no treatment at all S/he may locate several clinicians willing to participate by applying treatment A to certain depressed clients, applying treatment B to certain others, and letting certain others remain on a waiting list. Which condition a given client receives would be decided by some random procedure, to avoid bias.

The participants would likely be evaluated for depression prior to treatment, using some objective measure that is administered in a standard, reliable way. The intervention (or waiting list) would be applied for a certain, fixed time or until certain criteria had been met; then the evaluation would be re-administered to see how much change the intervention had produced. Thus, there would be assessment pre- and post-treatment, although sometimes only post-treatment assessment is used.

Because there would not be a pool of dozens of clients with depression waiting to participate in such an experiment, the study would probably take several years to complete, require extensive coordination, and cost at least hundreds of thousands of dollars. The results would usually be generalizable to a variety of cases of depression, though they might not apply to certain individuals.

An individual-subject evaluation of the same question about treating depressed clients might involve using an alternating treatments design (Barlow & Hersen, 1984). In this design, the client's status would be measured frequently while treatment A is applied for a few days or weeks, then treatment B, then treatment A again, then B, and so on. If, on the average, improvement is more rapid during one treatment than during the other, their differential effectiveness is demonstrated.

Level 2: Semi-scientific Evaluation of Procedures or Programs

Description. Like Level 3 research, Level 2 involves gathering data about the effectiveness of a particular intervention or service program. But this research is unlikely to be theory oriented and is often less perfectionistic about its methods. It and Level 1 research are what Carter (1983) called for. The researcher may wish to convince others--colleagues, a board of directors, a funding source, etc. (cf. Hawkins, Fremouw, & Reitz, 1982; Hutchinson, 1982)--that his/her program or form of treatment is effective, though s/he is not in a position to conduct a scientific quality study to do so. S/he uses some sort of comparison condition or group, though it falls short of the ideal condition or group that an audience of scientists usually demands.

If an individual-subject design is used, the experimenter would need a credible baseline on each participant, usually consisting of at least 5 data points that are not sloping in the direction of the change intended by the intervention. Then, by getting further data daily or weekly while intervening, the experimenter has conducted an "AB design." This design is not a fully adequate experiment by itself because it contains no replication of the effect, leaving the possibility that what appears to be an effect is actually due to accidental factors. While an AB design will not usually convince a skeptical, scientific audience, it is usually adequate for convincing boards of directors, funding agencies, or the general public.

If the experimenter is evaluating a service that is applied to numerous clients, daily or weekly data may not be needed; a pre-post assessment of a large sample of the clients may suffice. However, then the assessment must be applied to some comparison group, since there is no baseline against which to compare the results. The comparison group could be one that receives no service or that gets some other kind of service, such as from another agency or program. Although the participants are not randomly assigned to the two groups, the experimenter strives to see that they are as comparable as feasible. Sometimes a normative sample can be used as a comparison group, such as a clinician's showing that children in his/her special tutoring program advance more than one "grade level"--a normative index--in one year's time (e.g., see Johnson & Layng, 1992).

As with Level 3 research, the measurement must be conducted in a way that is not biased to favor either the experimenter's group or the comparison group. If direct observation is being used, this would include evaluating the interobserver reliability of the data collection. If some sort of test or questionnaire is used, it must be administered in a standardized way.

Illustration of Level 2 research. A group design at Level 2 is illustrated by a followup evaluation being done at a private child care agency, The Pressley Ridge Schools (Fabry, Hawkins, & Luster, 1992). Each youngster who has been treated in one of their programs and discharged is contacted by telephone within a year and interviewed informally. In the process, the answers to a series of questions are obtained. These answers tell the agency how well the youngster is doing in many areas, including education, drug use, employment, and independence. Each datum is checked for accuracy, as possible, by talking with one or more other persons who are part of the youngster's life. Comparison of data across several Pressley programs or with other agencies qualifies this as Level 2 research.

An individual-subject design might be used to evaluate any of these same programs. It would, however, require obtaining repeated measures of at least a few youngsters' behavior--perhaps using the Child and Adolescent Functional Assessment Scale (Hodges, 1990; Hodges, Bickman, & Kurtz, 1991) or a similar device--prior to admission, then continuing while the youngsters are in the program, and perhaps in a followup. If comparison with another program were desired, an individual subject design would be unlikely, because such designs use the subject as his own control. This would mean that repeated measures of a youngster's behavior would have to be obtained while the youngster was in another program and then again while the youngster was in the program being evaluated, which is seldom feasible because of the difficulty predicting which youngsters will move from program A to program B.

Level 1: Accountable Service Delivery

Description. Level 1 research is the focus of this special issue. Unlike Level 3 research, its purpose is not to understand why the participants' behavior changes; it is simply to facilitate and document such change. Thus its purpose is highly practical, in an immediate sense.

Level 1 research is highly flexible; it does not require a group design, control over extraneous variables that might influence the behavior, or the use of inferential statistics. It involves merely collecting frequent data on behaviors that are targeted for change, then graphing those data and examining the graphs to see how much the client's behavior is changing. As we will argue shortly, such data not only help in clinical decisionmaking, they are often of direct therapeutic value in themselves (Martin & Pear, 1996). Furthermore, as Barlow (1980) predicted long ago, managed care agencies are increasingly requiring evidence of the outcomes achieved by clinicians, and the kinds of data suggested here provide especially credible, precise evidence (cf. Carter, 1983).

The data usually involve direct, quantitative measures of specific client behaviors, recorded immediately after the behavior occurs. These are called "direct measures of behavior" (cf. Cone & Hawkins, 1977). Sometimes the data can be obtained from a product of the behavior, rather than observing the behavior itself, such as a mother's inspecting her son's homework--a product of his behavior--each evening and then making a record to show whether or not he had completed it. Further, sometimes the data will be measures of events in the client's environment. For example, if the primary client is a child, an environmental event that might be recorded daily by the parent or child is whether they had a reading-aloud session together. Besides parent behavior, other environmental events might be sibling behavior, teacher behavior, staff behavior, tasks assigned, rewards given, provocations by others, or activities offered.

In Level 1 research, the independent variable--the clinical intervention--is whatever the clinician and client agree to try, whether that is some form of counseling, a change in diet or exercise, a change in activities, a change in the consequences for certain behavior, or some other treatment. Again, since the purpose is not to prove that a particular method of intervention has a particular effectiveness, there is no need to restrict the intervention to a particular, named type. Nor is there need to keep the intervention constant throughout its application; the clinician can vary it as seems appropriate and "throw in the kitchen sink" if that promises to help the client.

The frequency of getting data on the dependent variables, the target behaviors, is very important. Level 1 research usually involves getting data daily or at least once a week, unless such frequent measurement would be irrelevant or infeasible, just as in most individual-subject Level 3 research. Such measurement can be called "continuous" and is like the monitoring of a patient's pulse or blood pressure in intensive care. On the other hand, it is not necessary to be as perfectionistic about the quality of data collection; quite a bit of error can be tolerated so long as it is not systematically misrepresenting the amount or speed of behavior change. The data must be quantitative and usually objective; but in some cases, such as the measurement of pain, they may necessarily be subjective ratings.

The crux of the process is in how the data are then used. Because the primary purpose is to maximize the effectiveness of intervention, the data must be plotted on a simple graph where each day's performance (sometimes each week's) is shown as a data point, and then inspected to determine how much progress is occurring. If the data show a good rate of progress, the clinician should continue the same intervention; if the data show poor progress, the clinician should change the intervention in some way. In addition to helping in such decision-making, the graph of data is usually explained to the client, which often serves a therapeutic purpose that will be explained shortly.

One final characteristic is that the data collected can be changed at any time. Because the purpose is to get data that are useful in evaluating one's effects, when the data being collected are found to be irrelevant or insensitive, the clinician simply changes to a more relevant or sensitive measurement. This is very different from either Level 3 or level 2 research. Table 1 summarizes the ten characteristics of Level 1 research.

Why Measure Continuously and Graph the Data?

There are two main benefits from doing Level 1 research routinely. First, such data collection and usage enhances the clinician's effectiveness. Second, the data are useful in accountability, whether that be accountability to a managed care company, to one's supervisors, to the clients and their families, or to oneself. After elaborating on these benefits, we will discuss the methods that are appropriate.

Enhancing the Clinician's Effectiveness

Anything that affects a client produces data, in the sense that there are results that could, theoretically, be measured. Unless the clinician collects at least some of those data, s/he will often be deceived regarding what the results really were. Further, unless the clinician collects the data frequently, to see how much effect s/he is having, the data cannot influence what the clinician does with that client today and tomorrow.

Why does continuous measurement and graphic analysis of individual clinical outcomes have this facilitating effect? The answer is that the process has six influences:

1) To get data, one is forced to specify behaviors that need to be learned, increased in frequency, or decreased. When a clinician and client specify, they focus behavior-change efforts more precisely, less diffusely. The subsequent graphing and discussing of data also maintain that focus. A focused, consistent effort has a better chance of achieving results than an unfocused or shifting effort.

As any experienced professional can attest, clients can overwhelm a clinician with myriad new problems and issues weekly, so that the clinical efforts drift vaguely, never getting any issue resolved. Unfocused efforts not only prevent progress, they often lead clients to terminate prematurely, due to lack of progress.

2) When the needed behavior changes are put in writing, it is much dearer to the client what must be achieved in order for things to improve. That dear specification itself often produces therapeutic gains, and a client's gathering continuous data keeps it constantly clear what the current goals are and even what the client is supposed to do today.

3) In the process of getting data, the clinician usually finds out more about the environment than s/he had previously learned, and knowing the environment is often crucial to achieving therapeutic outcomes. The clinician will learn the most if s/he observes personally in the client's environment; but even someone else's data promote discussion of specific events that will reveal a great deal.

4) The graphed data guide the clinician's behavior. The graph provides visible, durable stimuli that guide the clinician by reinforcing or redirecting his/her therapeutic efforts. The graph is like a report card, except it is much more frequent and thus more helpful. Graphed data tell the clinician and client how serious the problem currently is and how fast they are approaching their goal. Premature termination of an intervention that actually is working is common in most clinical practice (Martin and Pear, 1996, p. 235); but when a graph shows that the overall trajectory of results is satisfactory, the clinician and client will not give up prematurely, even when there are brief setbacks. At the opposite extreme, clinicians often persist with an ineffective treatment for months or even years; but if a graph kept telling them that their results were not satisfactory, they and client would be cued to revise the intervention.

5) When a clinician and client are deciding what to target and measure, and when they define, get data, and discuss those data, their analytic thinking and creative problem-solving are encouraged. Specific and detailed data cue them to get information on possible causes of the specific behavior, to perform a "functional analysis." A functional analysis is a description of current stimulus contexts that may be causing the behavioral excesses and deficits that comprise the problem, plus some description of relevant learning history (Hawkins, 1986). The value of a functional analysis is that it typically leads the clinician rather directly to possible interventions, because the analysis will probably emphasize such learning principles and processes as antecedent stimulus control, differential reinforcement/punishment, response effort (task difficulty), stimulus discrimination-generalization, response differentiation-generalization, shaping, chaining, conditioned reinforcement/punishment, or the absence of such learning (Hawkins, 1986). To a clinician familiar with learning processes, a functional analysis immediately suggests ways that the behavior might be changed by changing the environment so that it teaches more adaptive behaviors. Thus the graphed data are a form of feedback that leads the clinician to maximize his/her effectiveness through repeated cycles or "tinkering" (Barlow, Hayes, & Nelson, 1984; Parsonson & Baer, 1986) that could be summarized thus: define specific target behaviors, get data, graph the data, inspect the graphs, evaluate level and trajectory of performance, possibly gather other information, persist with or revise intervention, then repeat the cycle.

6) A final reason to get and graph data is that, for many of us, the graphed data differentially reinforce our work, the client's work, and that of others in the client's life, including the data collector. The better the effect, the more reinforcing is the graph. This keeps everyone enthusiastic and working to achieve ever better results.

Unfortunately, clinical practice in an office is often done in a manner comparable to driving a car in a heavy fog and without a clear destination. Clinicians who neglect to define specific, measurable goals, design and implement plans to reach those goals, measure progress continuously, and graph the data are not providing the best clinical services they can. The retrospective, subjective, global reports on which clinicians often rely are subject to various kinds of error, such as the client's failure to notice important events occurred, forgetting, being unduly influenced by the most recent events, being unduly influenced by one or two salient events, attempting to make themselves look better than they really are, and trying to please the clinician. Further, the clinician typically gathers the impressions infrequently, such as once a week, whereas the target behaviors may be occurring many times a day. Such casually-collected reports are not adequate data and should not be accepted by either supervisors or managed care.

Making Clinical Services Accountable

In addition to promoting better clinical decision-making, the use of graphed, objective data make a clinician more accountable to any managed care company involved, to supervisors, to clients and their families, and to themselves. The clinician is more accountable to him/herself in that s/he is making a firm commitment to change the measured behaviors and then documents the degree to which each objective is achieved.

The clinician is more accountable to his/her supervisor, employer, managed care, and client in the same way. Even though some employers, insurers, case managers, welfare departments, and such may not ask for detailed data, the presentation of such data from cases certainly makes a favorable impression. Quantitative data are much clearer and more convincing than only the clinician's subjective, potentially biased judgement of "much improved," "slightly improved," or "somewhat worse." Thus, graphs of data are, at the very least, good public relations.

Research Design for Level 1 Research

Level 1 research involves getting data on individual behaviors of individual clients or, occasionally, treatment groups. Designs for individual subject research are fairly well known (Barlow, Hayes, & Nelson, 1984; Barlow & Hersen, 1984) and will not be reviewed here. But what is important is to realize that those designs are primarily for Level 3 experiments, which is much more demanding than Level 1. In order to achieve the purposes of enhancing treatment and providing accountability, it is not necessary to carry out an experiment, because it is not necessary to prove that the intervention employed is causing the changes. It is only necessary to see that satisfactory change is, indeed, occurring. Thus a clinician often does not need a baseline of five or ten days, as is typical in scientific research, or maybe any baseline at all. All the clinician and client absolutely need to know is that progress occurs during and, preferably, after treatment.

Measurement only during intervention is often called a "B-only" or "treatment-only" design (cf. Martin & Pear, 1996. and, for examples, Moxley, Lutz, Ahlborn, Boley, & Armstrong, 1995; Williams, 1959). For example, if a parent says "He wets the bed every night," a clinician might have the parent collect nightly data as a baseline, but s/he could also accept the parent's statement as roughly accurate and proceed with both the measurement of bedwettings and the intervention to decrease it. This same thing applies if a teacher says "She can only read about 5 or 10 words," for example.

However, it is usually desirable to get at least one or two baseline data points. This is especially true when the person's verbal report is likely to be seriously flawed. For example, if a spouse says "He starts an argument every time I talk to him!," it would be naive to assume that the baseline is anywhere near 100% of their conversations, and any clinician should be curious to see roughly how often arguments really do occur before intervening. The same applies if a teacher says "He never plays with anyone," "She never does her homework," or "He is always interrupting."

Sometimes the mere collection of baseline data changes the complaints of the person who referred the client, because they had grossly overestimated or underestimated the problem (Martin & Pear, 1996). Other times collecting baseline data can even have a lasting effect on the target behaviors. But the important thing to recognize here is that Level 1 research need not involve a baseline; its primary requirement is that the clinician document the degree to which change is taking place during intervention. If baseline or followup measurement is also done, that is extra.

How Can the Clinician and Client Measure Outcomes?

Whatever goals and objectives the clinician and client select, a method for measuring some of them should be used. It is not necessary to measure all of the desired outcomes; measuring one to three of the most critical ones at a time is usually all that can be managed.

The data should reflect one of the following natural dimensions of behavior: frequency (which converts to rate, when divided by time available for the behavior), duration, latency, quality (effectiveness, accuracy), magnitude (intensity, severity, amplitude), and cue effectiveness (responding or non-responding at discreet opportunities). Simple methods of recording each of those will be discussed briefly, and several will be illustrated in the eight articles that follow in this issue. In most cases, the data will be recorded by someone who is part of the client's daily life, even the client him/herself, since that is where the behaviors of importance are occurring.

Frequency of Behaviors (or Other Events)

Recording. This is one of the simplest and most valid dimensions to record. It consists simply of observation and recording of each response onset (or offset), using tally marks on a pad of paper, counts on a counter, a count of the number of products completed, or other such. Usually when frequency is recorded, the target behavior is one that can occur quite freely and special opportunities are not necessary for it to occur.

An example of frequency recording would be a distressed parent counting the frequency of her son's fighting with his sister by making tallies in the squares that represent the days on a wall calendar. Another example would be a depressed college student counting (perhaps by writing a sentence or two describing the event) the frequency of engaging in activities that he and the clinician have listed as enjoyed and important, such as working out at the gym, calling a friend or relative, getting an assignment finished, participating in a sport, or going out with a friend. In each case the person doing the recording (the data collector) can also write the time that each response occurred, which could be helpful in understanding what factors influence the behavior. Other examples of behaviors to which frequency recording might be applied would include calling a friend, finishing a task from a "to do" list, complimenting someone, thinking of a positive future event, or soiling pants.

The recording task must be kept simple, so it is not always wise to expect the data collector to record data all day every day. The clinician and data collector should agree what would be a tolerable intrusion into the data collector's life. For example, although a parent might decide to record a daughter's getting into the parent's bed every time it occurs, a parent with a whining child might only record the numerous whining instances from 9:00 AM to 10:00 AM and from 3:00PM to 4:00PM. As long as the amount of time for doing the recording is kept fairly constant, the data should adequately show any significant changes in the behavior.

Another example of sampling just part of the day would be if a teenager who is having conflicts with her parents agrees to make an entry in her diary the first three times, each day, the parents do something that irritates her. She starts at the same time each day and each entry is to include what time it occurred, what the parent did, which parent it was, what preceded it, what she then did, and why it irritates her. Since the girl records the time of each event, the clinician can calculate how much time passed between the beginning of the day and the third irritating event; thus is possible to calculate and graph the rate of the irritations each day, at least up to the time of the third irritation.

Sometimes opportunities (situational cues) for the target behavior come only at discreet moments, such as when a spouse makes a request, when a phone call is received, or when someone speaks rudely to the client. Sometimes the opportunities are in the form of discreet tasks, such as math problems to work or chores to do. The recording of behavior under such circumstances will be discussed later under "cue control" recording.

Sometimes frequency--or other dimensions, such as quality--can be recorded most efficiently from products of a behavior rather than from directly observing the movements while they occur. For example, most household chores make a change in something that a data collector could later check, to see if the chore had been done.

Graphing. What is usually graphed is the total frequency for the day. If rate is calculated, then of course that is graphed for each day. Usually it is best to graph data for each day, because that will be most sensitive to change, but there can be exceptions. For example, sometimes there is no opportunity for the behavior on certain days, such as on a weekend or holiday. A good practice is to simply graph data only from days when the behavior is possible. If opportunity did exist but the behavior did not occur, of course the data point is placed at zero for that day. After entering a data point, the clinician draws a line (the "data line") from the last data point to the new one.

Duration of Behaviors

Recording. This is used when you are interested in how much of the time the participant engages in the behavior. For example, a failing high school student might record the amount of time she spends studying for each of her courses each day; a parent might record how many half-hours of television his child watches; or a client who wishes to exercise more might record the time s/he begins and ends an exercise routine each day. This last is an example where the graph can easily be kept by the client and is likely to be therapeutic in itself.

Graphing. For each day that the behavior is possible, the total duration of the behaviors is graphed, even if it is zero. If there are two or more episodes of the behavior it is sometimes of interest to graph the average duration per episode or the total duration for the day.

Momentary Time Samples of Behavior

Recording. Duration can often be estimated fairly well through a method of recording called "momentary time sampling" (Hawkins, Axelrod, & Hall, 1977). For example, the wife in a distressed couple might claim that the husband ignores her most of the time. The two might agree that the wife will set her watch to go off at two appropriate times each evening and, when it goes off, the husband will mark "+" or "-" on a card he carries in his pocket, indicating whether he was interacting pleasantly with her at that moment. This method is very labor-nonintensive. It is like taking a snapshot every so many minutes or hours to see whether the behavior was "on" or "off" at that precise moment, then recording, for each snapshot, a symbol meaning "Yes" it was "on" or "No" it was not. Between these samples, the data collector is free to do whatever tasks or recreation s/he wishes. The further apart the samples are in time, the more time is left free for other things; but that also reduces the representativeness of the da ta, so it is important to decide whether the recording should be done every 10 minutes, 30 minutes, hour, or even further apart. A precise timer is crucial; it is unwise to expect the data-collecting to watch a clock or wristwatch, as it will usually bias the data.

Momentary time sampling is obviously poor for very short-duration, or infrequent responses (e.g., tics, swearing, setting fires, starting fights), because it is so easy to miss them altogether and thus get very unrepresentative data. However, it is useful for frequent or longer responses, such as a preschooler's playing with peers, a school child's being in seat or on task, a mental hospital patient's conversing or engaging in games, or a rehabilitation patient's practicing walking.

Graphing. With momentary time sampling, the information to be graphed is percent of samples showing "Yes," the behavior was occurring.

Frequency and Duration, Mixed: Interval Recording

Recording. A method that is often both convenient and relevant is one that resembles the snapshot-taking of momentary time sampling, except that it is more like taking very brief movies. Instead of asking whether the behavior is "on" at precise moments in time, the data collector asks whether the behavior is "on" during any part of an interval of time; therefore it is called "interval recording." For example, a very shy teenage boy may rarely interact with peers, so his teacher might record whether or not he interacts at any time during each of eight successive 15-minute intervals each day. In this case, the intervals are 15 minutes long, and the question is whether the behavior occurred at all during each interval.

Of course the more frequently the behavior occurs or the longer it lasts, the more likely it is to be seen in several intervals. Thus the method reflects both frequency and duration. Yet it reflects neither of them perfectly, because the behavior could occur more than once in an interval, yet the record would show only that it occurred at some time during the interval; and the behavior could completely fill the interval or fill only the first or last .5 second of the interval, yet the data would look the same.

Then what is the value of interval recording? The first is that interval recording is useful when you are measuring behavior that occurs too fast to count each response, such as "talking." It would be very hard to count all the words a person speaks-each word being a single response--but you can easily mark whether the person spoke at any time during each successive 5 minute interval. The second value is that interval recording is useful when the behavior is a state of being, such as a student's being "on task," "attending," or "engaged in interaction." A clinician is usually not interested in how many times a client gets on task, but rather on how much she stays on task. Of course duration recording could be used, but that would require more intense observing than interval recording.

The third value of interval recording is that as soon as a "Yes" is recorded, indicating that the behavior was during some fraction of the interval, no further observing is needed during that interval and the data collector can be doing something else for the rest of the interval. The question that interval recording answers is "Did the behavior occur at all during the interval?, so a "Yes" ends the need for observing. In fact, the intervals can be made so large that the recording job is extremely easy, as large as a whole day. For example, in their Parent Daily Report, Patterson, Reid, Jones, and Conger (1975) used whole-day intervals to find out how aggressive children were progressing. The clinicians got parents to record daily, on an individualized recording sheet, simply whether each of the child's several problem behaviors occurred that day or not. The clinicians even retrieved the data daily from the parent by phone, since the parents were not well enough organized to be expected to record for a full week without prompting and reinforcement.

Depending on the behavior, whole day intervals may be much too large to provide a sensitive indicator of progress. For example, suppose one goal is for a socially isolated child to play more with peers; and suppose that initially the child plays with peers about 2 minutes on most days. If whole-day intervals were used, the data would not be able to show much improvement, because the child's behavior would already appear to be 100% on most days.

Intervals can be as short as 10 seconds without becoming too difficult for a naive data-collector, but such short intervals require the data-collector to give constant attention to the client. For most Level 1 research, intervals of 30 minutes or more will be practical and yet sensitive enough. It is sometimes even reasonable to simply divide the day into three crudely comparable segments, such as "before lunch," "after lunch," and "after supper.

Another way to make interval recording easy is to ask that the data be taken for only a fraction of each day, as was illustrated earlier with recording the frequency of a child's whining during just two hours each day. Of course it is best to select a time of day when the data will be especially reflective of the problem being addressed.

Graphing. As with momentary time samples, the thing to graph is the percent of intervals showing the behavior.

Latency of Behavior

Recording. This usually means recording the time elapsed between a cue for the response and the onset of the response; but it can also mean the time elapsed between the cue and the completion of the response. For example, suppose a father gets impatient with his eight-year-old son's "dawdling and daydreaming" when he is supposed to be getting ready for school. The father might start a timer when he calls the son in the morning and record how many minutes it takes for the boy to get up, toileted, dressed, groomed, and to the breakfast table. The father, of course, will give the boy a reward each morning if the boy meets a certain criterion that they have agreed upon, a criterion that keeps getting a bit more stringent each week, as the boy improves.

Other illustrations would be a school principal's recording how quickly each class responds to a fire alarm; a teacher's recording how long it takes a student to begin an assignment that the teacher has just given; or a parent's recording how long it takes a child to comply with certain requests or commands. Depending on the precision needed, timing can be done by a stopwatch or by simply writing down the time of the cue and the time that the response is begun or completed.

Graphing. What would usually be graphed is the average time required per cue for each day. Of course if the cues occur only once a day, no averaging is needed.

Quality (Effectiveness, Accuracy) of Behavior

Recording. Here the interest is in either the precise topography of the movements or the quality of the product that the movements produce. Martin and Pear (1996) give an example of recording the precise topography. They list the several criteria that must be met in a swimmer's performing an optimal backstroke, criteria that constitute a task analysis of effective backstroke swimming. These criteria or components can be listed on a checklist and a coach can hold that checklist while watching a swimmer practice, checking off whether the swimmer meets each of the criteria. This same sort of procedure could be applied to workers' performance of various jobs or to social skills.

Quality recording has been used for evaluating a teenager's conversation skills (Minkin et al., 1976) and for evaluating teens' interviewing for a job (Braukmann, Maloney, Fixsen, Phillips, & Wolf, 1974).

If it is not feasible to do the task analysis necessary to create a quality-of-performance checklist, an alternative is to simply have the data recorder rate the behavior on, say, a five-point scale. While this does produce quantitative data, it is less objective and of less teaching value, because it does not show which aspects of an effective performance the subject does well and which s/he needs help with. However, a rating can be satisfactory if the rater also writes daily notes about which aspects were done well and which need to improve.

As Gilbert (1978) has pointed out, measuring the component movements themselves may not be the most valid or efficient strategy. The adage "there are many ways to skin a cat" is relevant here. What appears to be the best way to accomplish a certain end may not be the best for certain situations or persons. Thus, instead of noting on a checklist whether a youngster cleans up his/her room in a certain order, one can measure only the final product, the cleanliness of the room. Exactly what movements went into the job are not important.

Of course, if the final product is not good, effective intervention often does require observation of the specific movements. For example, a good mathematics teacher will ask a student who is making errors to do some problems in front of him/her, thinking aloud, so the teacher can detect what specific steps are being skipped or done incorrectly.

Graphing. If a checklist has been constructed showing the components of a good performance or the characteristics of a good product, the graph would probably show "percent of components done correctly," and each attempt could be represented by one data point. If the quality is measured from a product, as with most school assignments, the graph can show what percent of the items were correct each day. If the client performs the task multiple times in a day, probably it will be best to graph the average percent correct each day.

Magnitude (Intensity, Amplitude) of the Behavior

Recording. An aspect of quality that can be recorded in its own right is the size of the response, its magnitude or intensity. For example, if a physical therapist or nurse is trying to develop a better range of movement in a patient's leg, s/he can measure how high the patient lifts the leg on five successive trials, as an index of progress. Schwarz and Hawkins (1970) measured the loudness of a shy girl's voice by noting the deflection of a loudness meter on an audiotape recorder when she spoke.

Sometimes no objective measurement is feasible, but this should not deter the professional who wants to be accountable. The article by Stokes (1999) in this issue describes how his client, who was often in severe pain, recorded the subjective pain at several specified times each day by simply rating its severity, at first on a four-point scale and later on a two-point scale. Obviously the only person who could directly observe the pain's intensity was the boy himself, so a rating is the only feasible method

Graphing. In each of these cases, a graph would typically show the average of the several measurements taken per day. However, if one is to study the behavior more closely and perhaps discover causative factors, it may be useful to try graphing the magnitude for each occasion recorded.

Cue Effectiveness

Recording. Here the question is whether, given the opportunity or cue to respond, the participant does respond. For example, suppose that a clinician is interested in whether a child follows his/her mother's commands within a certain number of seconds. The cue in this case is the command. The mother and clinician would have to agree on what constitutes following the command--or, more likely, beginning to follow the command--and how soon it must occur to be counted as compliance (cf. Clark, 1985; Hembree-Kigin & McNeil, 1995). What is recorded is simply that a command occurred and whether compliance followed within the specified time. The occurrence of the command might be recorded by writing down the time of day, which can be useful information in later discussing the instances with the clinician. The compliance might be recorded by circling that time, with noncompliance being shown by putting an X over it.

Other examples of behavior that might be recorded in such a manner are whether a teen shows up at an appointed place and time, whether a student brings in the assigned homework, or whether a student raises his/her hand when a question is asked of the class. In each case the number of cues or opportunities must be recorded so that one can calculate the percent of cues that were responded to in the desired way.

Graphing. What is graphed is the percent of cues responded to in the desired way each day. If there were no cues given on a particular day, it would not be proper to graph a percent for that day, since there was no opportunity for the behavior. One can either leave that day off the abscissa (horizontal axis of the graph) or, if that day is left on, run the data line right past it to the next data point.

Difficulties Doing Level 1 Research

There are at least two difficulties doing Level 1 research. First, most clinicians have no experience at getting simple, continuous data and are thus not facile at deciding what to record and how to record it. Some clinicians are not very facile at the earlier step of identifying important behavior (or other events) to change. Such clinicians should probably do one of two things: (a) tell their clients that they are learning a new procedure and to be patient with the likely false starts; or (b) they can find a colleague with whom to collaborate, frequently sharing data and methods between them so as to get more ideas. It should be remembered that it is no tragedy when one changes the target behaviors midstream or changes methods of recording. If the client is informed at the outset that all goal-setting and measurement decisions are tentative and likely to be changed, the client can accept such changes.

The second problem is getting the potential data-collector--who is often the identified client or someone in the family--to do the recording. This is solved in two general ways: making the recording task realistic (which is facilitated by involving the person in deciding how and when to record) and convincing the person that the recording is a critical part of his/her responsibilities.

In order to make it clear to clients that data are part of the tasks of assessment and treatment, it is best to introduce them to the need for data during the first contact. A clinician can say something like "Of course we need to know just how significant the problem is, just as a physician measures your pulse, temperature, and blood pressure when you complain of some physical problem. Also, when we begin some kind of intervention, we will need to know whether we're making progress or not. Since I cannot be in your home (or school, or wherever), I will have to rely on you to do the 'pulse-taking' by measuring some of the important behaviors involved."

It is wise to then begin the person recording data immediately, just to get accustomed to it, even if those particular data are not what you will eventually be interested in. In addition, it is important to (a) either have various data-collection forms already prepared or develop them on the spot, with the person's participation; (b) ask the participant to reiterate, at the close of each meeting, what his or her data-collecting and other responsibilities are until the next meeting; (c) begin each new meeting with a review and graphing of the data brought in by the data-collector; and (d), if the person fails to bring in the expected data, take it very seriously by laboriously reviewing the reasons for the data collection or, better, asking him/her to do so. This can be followed by having him/her role play data recording while you role play the person whose behavior is to be recorded.

Besides data-collection forms, it will often be useful to have some inexpensive equipment, such as an audiotape recorder that you may want a client to turn on at times of the day when critical events occur (e.g., when child arrives home from school or spouse from work); or inexpensive kitchen timers that the subject can set in order to be cued to record data or cued to interact with a certain person.


With managed care, clinical practitioners face greater demands for accountability. In addition, any clinician can improve the effectiveness of his/her treatment by obtaining continuous measures of "the bottom line" purpose of clinical services--changes in target behaviors--and graphing those data, using the graphs in deciding what to do next. Several methods of measurement that address several different dimensions of behavior were presented here; more example will be illustrated in the eight articles that follow.


Baer, D. M. (1977). Perhaps it would be better not to know everything. Journal of Applied Behavior Analysis, 10, 167-172.

Barlow, D. H. (1980). Behavior therapy: The next decade. Behavior Therapy, 11, 315-328.

Barlow, D. H., Hayes, S. C., & Nelson, R. O. (1984). The scientist practitioner: Research and accountability in clinical and educational settings. New York: Pergamon.

Barlow, D. H., & Hersen, M. (1984). Single case experimental designs: Strategies for studying behavior change (2nd ed.). New York: Pergamon.

Braukmann, C. J., Maloney, D. M., Fixsen, D. L., Phillips, E. L., & Wolf, M. M. (1974). Analysis of a selection interview training package for pre-delinquents at Achievement Place. Criminal Justice and Behavior, 1, 30-42.

Carter, R. K. (1983). The accountable agency. Beverly Hills, CA: Sage.

Clark, L. (1985). SOS!: Help for parents. Bowling Green, KY: Parents Press.

Cone, J. D., & Hawkins, R. P. (1977). Introduction. In J. D. Cone & R. P. Hawkins (Eds.), Behavioral assessment: New directions in clinical psychology. New York, NY: Brunner/Mazel.

Fabry, B. D., Hawkins, R. P., & Luster, W. C. (1994). Monitoring outcomes of services to youths with severe emotional disorders: An economical followup procedure for mental health and child care agencies. Journal of Mental Health Administration, 21, 271-282.

Gilbert, T. F. (1978). Human competence: Engineering worthy performance. New York: McGraw-Hill.

Hawkins, R. P. (1986). Selection of target behaviors. In R. O. Nelson, & S. C. Hayes (Eds.), Conceptual foundations of behavioral assessment (pp. 331-385). New York, NY: Guilford.

Hawkins, R. P. (1989). Developing potent behavior-change technologies: An invitation to cognitive behavior therapists. The Behavior Therapist, 12, 126-131.

Hawkins, R. P., Aljazireh, L., & Mathews, J. R. (1994). Monitoring outcomes of psychological interventions with simple data: Improving effectiveness scientifically. The West Virginia Journal of Psychological Research and Practice, 3, 27-40.

Hawkins, R. P., Axelrod, S., & Hall, R. V. (1976). Teachers as behavior analysts: Precisely monitoring student performance. In T. A. Brigham, R. Hawkins, J. Scott, & T. F. McLauglin (Eds.), Behavior Analysis in education: Self-control and reading. Dubuque, IA: Kendall-Hunt.

Hawkins, R. P., Fremouw, W. J., & Reitz, A. L. (1982). A model useful in designing or describing evaluations of planned interventions in mental health. In A. J. McSweeny, W. J. Fremouw, & R. P. Hawkins (Eds.), Practical program evaluation in youth treatment (pp. 24-48). Springfield, IL: Charles C. Thomas.

Hawkins, R. P., & Hursh, D. E. (1992). Levels of research for clinical practice: It isn't as hard as you think The West Virginia Journal of Psychological Research and Practice, 1, 61-71.

Hawkins, R. P., Mathews, J. R. & Hamdan, L. (1999). Measuring behavioral health outcomes: A practical guide. New York: Plenum.

Hembree-Kigin, T. L., & McNeil, C. B. (1995). Parent-child interaction therapy. New York: Plenum

Hodges, K (1990). Manual for the Child and Adolescent Functional Assessment Scale. Unpublished manuscript. Department of Psychology, Eastern Michigan University.

Hodges, K, Bickman, L., & Kurtz. S. (1991). Multidimensional measure of level of functioning for children and adolescents. In A. Algarin & R. M. Friedman (Eds.), A system of care for children's mental health: Expanding the research base (pp. 149-154). Tampa, FL: Florida Mental Health Institute, University of South Florida.

Hutchison, W. R. (1982). Fitting evaluation form to its function. In A. J. McSweeny, W. J. Fremouw, & R. P. Hawkins (Eds.), Practical program evaluation in youth treatment (pp. 49-60). Sringfield, IL: Charles C. Thomas.

Johnson, K. R., & Layng, T. V. J. (1992). Breaking the structuralist barrier: Literacy and numeracy with fluency. American Psychologist, 47, 1475-1490.

Martin, G., & Pear, J. (1996). Behavior modification: What it is and how to do it (5th ed.). Upper Saddle River, NJ: Prentice Hall.

Minken, N., Braukmann, C. J., Minkin, B. L., Timbers, G. D., Timbers, B. J., Fixsen, D. L., Phillips, E. L., & Wolf, M. M. (1976). The social validation and training of conversation skills. Journal of Applied Behavior Analysis, 9, 127-140.

Moxley, R. A., Lutz, P. A., Ahlborn, P., Boley, N., & Armstrong, L. (1995). Self-recorded word counts of freewriting in grades 1-4. Education and Treatment of Children, 18, 138-157.

Parsonson, B. S., & Baer, D. M. (1986). The graphic analysis of data. In A. Poling & R. W. Fuqua (Eds.), Research methods in applied behavior analysis: Issues and advances. New York: Plenum.

Patterson, G. R. Reid, J. B., Jones, R. R. & Conger, R. E. (1975). A social learning approach to family intervention, Vol. 1: Families with aggressive children. Eugene, OR: Castalia.

Schwarz, M. L., & Hawkins, R. P. (1970). Application of delayed reinforcement procedures to the behavior problems of an elementary school child. Journal of Applied Behavior Analysis, 3, 85-96.

Stokes, T. R. (1999). Psychotherapy for some sequelae of lukemia. Education and Treatment of Children, 22, 179-188.

Williams, C. D. (1959). The elimination of tantrum behavior by extinction procedures. Journal of Abnormal and Social Psychology, 59, 269.

Table 1

Ten Characteristics of Level I Research

1. The goal is behavior change, not explanation of that change. Thus, a baseline is often not needed. Nor is "control" over (constancy of) extraneous variables or constancy of the intervention across time necessary, again because explanation is not the primary purpose.

2. The intervention is whatever you hope will work, not necessarily some specific package" that someone claims is the solution.

3. Data are obtained "continuously," usually daily; but sometimes less often, depending on such factors as the opportunities for the behavior.

4. The data are direct measures of target behavior or its products.

5. The data are quantitative.

6. The data are graphed daily or, at least, weekly.

7. The graphed data are evaluated for slope (direction and speed of change, if any) and level (problematic or satisfactory).

8. Data are used as a basis for action, for deciding "what to do now." The graph is usually shared with the client.

9. When one measurement outlives its usefulness, it is dropped and another is begun. The whole idea is to get useful data.

10. The intervention is changed whenever the outcomes suggest it needs improvement.
COPYRIGHT 1999 West Virginia University Press, University of West Virginia
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 1999 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Hawkins, Robert P.; Mathews, Judith R.
Publication:Education & Treatment of Children
Geographic Code:1USA
Date:May 1, 1999
Previous Article:The Contribution of Data Based Case Studies to the Education and Treatment of Children.
Next Article:Multifaceted Uses of a Simple Timeout Record in the Treatment of a Noncompliant 8-year-old Boy.

Related Articles
The laboratory's role in outcome assessment.
Measuring Behavioral Health Outcomes: A Practical Guide.
The Contribution of Data Based Case Studies to the Education and Treatment of Children.
Merging Research and Practice: The Example of Positive Peer Reporting Applied to Social Rejection.
Adapting to the new workplace reality: maximizing the role of RNs within a collaborative nursing practice model.
Overcome the pain of drug accountability: Nikki Dowlman, Product Manager, ClinPhone, and Paul McPhillips, founder and partner, BioLogic LLP, explore...

Terms of use | Privacy policy | Copyright © 2022 Farlex, Inc. | Feedback | For webmasters |