Printer Friendly

Determining the effectiveness of the usability problem inspector: a theory-based model and tool for finding usability problems.

INTRODUCTION

Usability evaluation has grown to be part of every good effort to produce a software product that consumers can easily learn and use. The motivation to deliver products that satisfy consumer demand for usable applications has led to the development of large usability testing environments and associated methods to develop effective interaction designs. Despite increased focus on usability and on the processes and methods used to increase usability, a substantial amount of software is unusable and poorly designed. This is attributable, in part, to the lack of a sufficiently broad set of cost-effective usability evaluation tools. For example, traditional lab-based usability evaluation is often too expensive and can be less applicable early in the development cycle.

Our goal was to develop and evaluate a usability inspection method and supporting tool that has a theory-based framework for usability problem analysis and classification, usability data management, and design guidelines. This paper reports on the development of this usability evaluation method and its evaluation in a comparison study with two other usability evaluation methods (UEMs). Although other tools exist for supporting usability development activities (e.g., tools to support low-fidelity or paper prototyping, high-fidelity or programmed prototyping, usability problem classification and analysis, and usability problem data maintenance), these are beyond the scope of this paper.

The Need for More Useful Metrics

Practitioners wishing to select a UEM best suited for their needs look to comparison studies of UEMs. However, researchers performing those studies have not always provided commonly used and understood metrics for knowing if a method is effective. Lund (1998) pointed out that no single standard exists for direct comparison, resulting in a multiplicity of different measures used in UEM studies, capturing different data defined in different ways. Because usability is highly contextual, it is unlikely that standards can exist except at a very high level of abstraction.

Practitioners and researchers still need measures that can be commonly used and understood for determining whether a usability method is effective. However, few studies have clearly identified the target criteria against which success of a UEM is measured. As a result, the body of literature reporting UEM comparison studies does not support accurate or meaningful assessment of, or comparisons among, UEMs. Gray and Salzman (1998) highlighted this problem when documenting specific validity concerns about five popular UEM comparison studies. A key concern noted by Gray and Salzman is the issue of using the right measure (or measures) to compare UEMs in terms of effectiveness. Part of our goal in this work was to contribute to the development of more useful metrics for UEM comparison studies.

Model-Based Approaches

As an integrating framework and theory for our usability inspection tool and our other usability engineering support tools, we adapted and extended Norman's stages-of-action model of interaction (Norman, 1986). Our work is not the first to use Norman's model as a basis for usability inspection, classification, and analysis. Kahn and Prail (1994) used a task performance model that is similar to Norman's stages of action in that it has steps for planning, selecting, acting, perceiving, and evaluating with respect to the user's goal. Several approaches (e.g., Cuomo & Bowen, 1994; Garzotto, Matera, & Paolini, 1998; Lim, Benbasat, & Todd, 1996; Rizzo, Marchigiani, & Andreadis, 1997; Sutcliffe, Ryan, Springett, & Doubleday, 1996) have used Norman's model and found it helpful for communicating about usability problems, identifying frequently occurring problems within the model, and providing guidance for diagnosis of usability problems. In particular, Cuomo (1994) used Norman's stages-of-action model with some success to assess the usability of graphical, direct-manipulation interfaces. Cuomo concluded that the model shows promise, especially for problem classification in the usability testing environment, but that more work is needed to modify Norman's stages-of-action model for use as an inspection technique.

USER ACTION FRAMEWORK AND USABILITY PROBLEM INSPECTOR

Out of the needs expressed by practitioners and researchers, we developed an inspection tool and method, called the usability problem inspector (UPI), to support usability practitioners within the interaction development process for ensuring usability. We began by extending Norman's (1986) stages-of-action model into what we call the interaction cycle. Like Norman's model, the interaction cycle describes cognitive, perceptual, and physical actions users make as they interact with any kind of machine, especially computers. Our extension of Norman's model primarily involved some terminology changes as well as more detail and focus on user physical actions. Norman's model focuses primarily on cognitive and perceptual actions and has less emphasis on physical actions.

We then built the user action framework (UAF), a hierarchically structured knowledge base of usability issues and concepts residing in a relational database, using the parts of the interaction cycle (planning, translation, physical actions, and assessment) as the top-level UAF categories. Under the planning category, for example, we developed a structure of successively more detailed subcategories about how interaction designs support user planning (e.g., the user's system model, goal decomposition, user's knowledge of system state and modalities, user and work context). The UAF content came from interaction design guidelines, human-computer interaction issues in the literature, and from our own experience with more than 1000 real-world usability problems. Figure 1 shows how the interaction cycle is used as the top-level organizing structure for usability concepts and issues to form the UAF.

[FIGURE 1 OMITTED]

The UPI is part of a suite of usability engineering support tools built on the same UAF content and structure. Each tool is a mechanism for applying the UAF contents in a certain way for a certain purpose. Each tool governs the way in which UAF contents are navigated to serve the purpose of the tool. Each tool governs the context in which the tool user sees UAF content and the expression of that content. In the case of the UPI, for example, the UAF content is presented as a set of usability inspection questions. The UPI also supports reporting of answers that indicate possible usability problems, along with full usability problem classification within the UAF structure.

Using the underlying UAF to provide the inspection questions for the UPI ensures that the scope of the inspection questions will represent the full breadth across all the cognitive and physical user actions that a user does while interacting with a computer. These include questions on how well the interaction design being inspect ed supports the user in planning what to do, determining how to do it, doing it, and assessing whether the outcome was successful. The UAF as a foundation for UPI questions also ensures that the inspection questions cover the full range of issues, from the most general to the most specific.

The UPI is implemented as a Web-based tool using HTML and Active Server Pages to connect to a back-end relational database. The choice of which inspection questions in the UAF to present to the inspector to drive an inspection is based on answers to questions at previous nodes in the framework. The inspector has a choice between two modes: task-based inspection or free-exploration inspection; in the latter, the inspector has no particular task in mind. Table 1 shows some of this structure in a graphical map of the UAF, including the major parts of the interaction cycle (i.e., planning, translation, physical actions, and assessment) and one or two levels under each part of the interaction cycle.

This map is implemented as a "fast-access tree" in each UAF tool. The fast-access tree allows the user to traverse the UAF hierarchy by using an expanding directory structure. During the inspection process, the UPI presents the inspector with questions from each UAF node visited about how well the target (being inspected) interaction design supports usability from the specific perspective of that node content. For example, suppose the current UAF node content is about the meaning of cognitive affordances for determining actions to carry out an intention. The inspector will be asked to evaluate the effectiveness of, for example, visual cues such as button labels and menu choice wordings with respect to how well they convey the meaning of the functionality behind the button or menu choice in the context of the task being considered. For the UPI tool user, there is additional information to explain the meaning of each UAF node. In addition to the node name (e.g., "task structure and interaction control"), each node contains much more information to help users understand and distinguish concepts to make choices, including node content, choice descriptions at each node, and "look-ahead" information previewing lower-level contents of each subtree.

As the user of the UPI traverses the hierarchical UAF structure node by node, the inspector is thus presented with a series of questions about a broad range of usability issues in the target system. The question traversal process starts with the inspection session screen, where the first node of the UAF (i.e., planning) is presented to the inspector, as shown in the bottom half of the screen in Figure 2. The example shown in Figure 2 is a sample screen shot of a version of the UPI used to evaluate an electronic address book application. The top half of the screen displays the task so that the inspector can easily remember the context of the inspection. Inspectors examine the current UAF node content (labeled in the screen of Figure 2 as "problem statement") as an inspection question and decide whether a potential issue exists in the context of the current task.

[FIGURE 2 OMITTED]

As an example of how control of the flow of inspection questions is based on inspector answers, consider a question from the UAF node about the meaning of a feedback message (under "assessment > issues about feedback > content, meaning of feedback" in Table 1). The question would ask if the inspector thinks there could be problems with users understanding the meaning of the message. If the inspector says "yes," questions from nodes at the next lower level are considered. Questions from lower-level nodes are related to the topic of their "parent" nodes but are more detailed and more specific. For example, the inspector may see more detailed questions about possible causes of problems in understanding a feedback message. These include a sequence of questions about clarity, precision, completeness, correctness, relevance, user-centeredness, and consistency of that feedback message. If the inspector says "yes" to the question about a problem with clarity, the UPI will present questions about possible causes of problems with clarity, such as precision and conciseness of wording.

If the inspector gives a "no" answer to a question--for example, to the question about understanding the meaning of the feedback message--that topic is abandoned and the subtree containing more specific issues about that node is pruned off. The next question, instead, comes from moving laterally to a "sibling" node (e.g., the next topic about feedback messages). If there are no siblings left, the UPI moves back up to the parent and over to its next sibling, until a usability problem is detected and the answer to the inspection question is "yes."

When a problem is identified, inspectors are presented with a problem report form, as shown in Figure 3. When a problem is detected and the inspector has traversed the full depth of the corresponding area in the UAF database, the inspector reaches one or more end nodes, where very specific usability attributes are listed for selection. As shown in Figure 3, inspectors select one or more usability attributes relevant to the problem and provide a problem name and narrative description to complete the report form. The problem report form also records relevant information, such as the current task and the inspection path taken by user to reach the end node, as shown in the top portion of Figure 3.

[FIGURE 3 OMITTED]

Once inspectors document a problem, they continue traversing the remaining structure of the UAF database, answering inspection questions while examining potential usability issues related to each node applied to the current task. After inspectors traverse the UAF database by selecting "yes" or "no" for each usability problem question, they can then return to the beginning of the UAF to investigate the next task or they can finish the inspection. Inspectors can also use the UPI in a free-exploration mode to investigate an interface design.

The process in the free-exploration mode remains the same as with the task-based approach, except that task information is not recorded. Free-exploration mode allows the inspector to report on potential usability issues without a specific task in mind. In addition, free-exploration mode allows the inspector to review in more detail interface areas that were encountered only briefly on the way to completing a task.

Our approach to the UAF and UPI shares some similarities to the work done by Grammenos, Akoumianakis, and Stephanidis (2000) and by Henninger (2000). Grammenos et al. and Henninger have found a hierarchical structure of guidelines to be useful for providing context-specific usability problem descriptions. Grammenos et al. provided a limited automated inspection of the static components of a user interface design using the Visual Basic Integrated Development Environment. Henninger (2000) implemented a hierarchical structure of guidelines using cases and rules to examine the object attributes of the user interface. Both of these approaches show significant value in the use of guidelines, especially if they are combined with a theory-based approach using Norman's stages-of-action model of interaction (Norman, 1986).

THE THREE USABILITY INSPECTION METHODS OF THE COMPARISON STUDY

Expert-based usability inspection methods have emerged as an alternative to lab-based usability testing, being applicable earlier in the development process as well as less costly (Nielsen & Mack, 1994). Usability inspection methods (e.g., heuristic evaluations, cognitive walkthroughs, formal usability inspections, guideline reviews, and usability walkthroughs) can also be used in circumstances where lab-based usability testing is impractical or to help focus later lab-based testing efforts.

Of the usability inspection methods, heuristic evaluation (Nielsen & Molich, 1990) and cognitive walkthrough (Polson, Lewis, Rieman, & Wharton, 1992; Wharton, 1992; Wharton, Rieman, Lewis, & Polson, 1993) are the ones for which researchers have provided of the most data in comparative studies. However, the results from many of these studies are mixed in terms of identifying a method that best helps the practitioner conduct a thorough and valid inspection (Andre, Williges, & Hartson, 1999; Jeffries, Miller, Wharton, & Uyeda, 1991; Karat, 1994). The results do agree, however, on a need to improve current UEMs in terms of both theoretical foundations and effectiveness for helping evaluators find problems with a more structured approach.

Heuristic Evaluation

Heuristic evaluation is probably the most widely known method in the usability community. Nielsen and Molich (19901 popularized the heuristic technique in the early 1990s by developing it as a cheap, fast, and easy-to-use method for inspection of user interfaces. Nielsen (1992) defined heuristic evaluation as a method for finding usability problems in a user interface design by having a small set of evaluators examine the interface and judge its compliance with recognized usability principles (the "heuristics"). Heuristic evaluation is meant to be used as a discount usability engineering technique to find usability problems in a product design (Nielsen & Molich, 1990).

The major strengths of heuristic evaluation are that it is relatively cheap, finds lots of problems, is intuitive and easy to motivate people to use it, is less dependent on detailed planning, and can be used early in the development process when only an immature prototype is available (Nielsen & Molich, 1990). Conclusions from Doubleday, Ryan, Springett, and Sutcliffe (1997) indicate specific weaknesses of the heuristic evaluation technique. For example, heuristic evaluation often identifies a large number of specific, one-time, and low-priority problems. In addition, because heuristics represent general usability principles, evaluators can be easily led to false alarms (Sears, 1997). For example, a heuristic such as "provide help" can lead some evaluators to interpret the lack of an explicit help button to be a problem, regardless of the purpose of the screen (Sears, 1997).

Cognitive Walkthrough

Cognitive walkthrough (Poison et al., 1992; Wharton, 1992; Wharton et al., 1995) is a popular theory-based usability inspection method that focuses on evaluating a design for ease of learning, particularly by exploration. In contrast to the heuristic evaluation method, the cognitive walkthrough focuses on a user's cognitive activities--specifically, the user's goals and knowledge while performing a specific task (Wharton, Bradford, Jeffries, & Franzke, 1992). Experts using the cognitive walkthrough evaluate each step necessary to perform a task, attempting to uncover design errors that would interfere with learning by exploration. The method finds mismatches between users' and designers' conceptualization of a task, poor choices of wording for menu titles and button labels, and inadequate feedback about the consequences of an action (Wharton, Rieman, Lewis, & Poison, 1994).

The primary strength of cognitive walkthrough is its task-based approach to evaluating user interaction designs. The focus on specific user tasks is intended to help designers assess how well the features of their design fit together to support users' work. Because cognitive walkthrough is based on a cognitive theory of exploration, it requires more knowledge of cognitive science terms, concepts, and skills than do most other usability evaluation methods (Lewis, Poison, Wharton, & Rieman, 1990; Wharton et al., 1992). Some of the documented concerns about cognitive walkthrough include the need for a background in cognitive psychology, the tedious nature of the technique, the types of problems identified, and the extensive time necessary to apply the technique (Desurvire, 1994; Lewis et al., 1990: Rowley & Rhoades, 1992).

UPI

As an inspection tool, the UPI brings together aspects of both heuristic evaluation and cognitive walkthrough, possessing the ease of use of heuristic evaluation and providing interaction-based structure to guide the inspection process, as in cognitive walkthrough. Ease of use of the heuristic method is based on a partial abstraction of hundreds of specific design guidelines down to a small number of easy-to-remember representative ones. The UPI (through the UAF) also uses abstraction to simplify usability issues and provide ease of use at its higher levels, giving a small number of large categories for early decision making. Unlike heuristic evaluation, however, once a problem is detected at a higher level, the UPI refers the practitioner to the structure of relevant details at lower levels to provide more specific explanations of problems in terms of problem types and causes within the interaction design.

Just as very general guidelines are difficult to interpret and apply, so are usability issues presented in only a general way in the heuristic evaluation method. In the UPI we have taken these general concepts (e.g., consistency), particularized them to very specific interaction design situations, and distributed them throughout the UAF structure. Practitioners find these specific attributes at lower levels, already applied to more detailed usability situations. Thus, for example, the usability issue of consistency becomes a large number of more specific consistency issues (e.g., consistency of button label wording, consistency of menu choice wording, consistency of feedback message placement). Further, the UPI is based on a model of the way users interact with the system, as is cognitive walkthrough, aiding and guiding the inspector through a structured and systematic inspection process. Unlike cognitive walkthrough, the UPI goes well beyond the ease-of-learning issues of cognitive affordances by helping the evaluator to consider the lull range of physical usability issues (e.g., Fitts's law, object design, disability accommodation), addressing ease-of-use issues for expert users.

ESTABLISHING MEASURES USED IN THE COMPARISON STUDY

Measuring UEM Effectiveness

It is incumbent on developers of new usability inspection methods or tools to evaluate their effectiveness. In particular, developers must determine how well the method helps inspectors produce a list of problems related to real problems in an interaction design, real meaning that the problem would impact the task performance and/or satisfaction of users in the field. Answering such a question with any level of reliability is often difficult because of the variation in metrics and definitions used by researchers and practitioners (Hartson, Andre, & Williges, 2001), and it is especially difficult because developers have no standard criterion for assessing the "realness" of a candidate usability problem identified by the method. If a UEM evaluator had a complete baseline list of all the real usability problems that exist in a given target interaction design, that evaluator could ascertain the realness of candidate usability problems found by a given UEM by determining whether each such usability problem is in the baseline list. That determination will involve some kind of comparison of a candidate usability problem with each usability problem in the baseline list.

Acquiring that baseline comparison list of real problems, however, is problematic. One technique for determining realness requires the experimenters to establish a "standard" UEM that can be applied to the target interaction design to generate, as a comparison standard, a set of usability problems deemed to be "the real usability problems" existing in the target interaction design of the study. Such a list would reflect any weaknesses and biases built into that UEM but would be a consistent yardstick for computing various performance measures of each other UEM being evaluated. Hartson et al. (2001) identified three basic ways researchers can produce a standard-of-comparison usability problem set for a given target interaction design:

1. seeding with known usability problems

2. lab-based usability testing, regular and asymptotic

3. union of usability problem sets over UEMs being compared

Among these approaches, traditional lab-based usability testing is used most often to provide a baseline set of "real" usability problems in studies of UEM performance. Lab-based testing is a UEM that produces high-quality usability problem sets, but it is expensive. Researchers also use the union of all individual usability problem sets to produce a baseline set of usability problems. This approach has the advantage of requiring no effort beyond applying the UEMs being studied, but it has the drawback of eliminating the possibility to consider validity as a UEM measure, because the basis for metrics is not independent of the data. The essence of this drawback is that all false positives (i.e., candidate problems detected but not real) from each UEM are included in the baseline set (John & Marks, 1997: Sears, 1997). This eliminates validity as a possible measure, as explained further in Hartson et al. (2001).

Most lab tests by practitioners for the usual formative usability evaluation in an iterative development process are deliberately designed with an objective of cost effectiveness, stopping the design iteration before diminishing returns at an acknowledged penalty of missing some usability problems. In contrast, asymptotic lab-based testing by researchers for UEM studies is an extension of traditional lab-based usability testing, using a higher number of participants in order to identify as many potential problems in the interface as possible (Hartson et al., 2001). Here the term asymptotic means the testing is done beyond the point of diminishing returns, to the point where adding more users yields almost no new usability problems identified.

The idea of using measures such as thoroughness and validity to compare UEMs matured with Sears's (1997) study. Sears provided definitions of thoroughness and validity that we use here in equation form to indicate specifically how these measures are calculated. In simple terms, thoroughness is about the completeness of the results, whereas validity is about the correctness of the results. Hartson et al. (2001) added another measure, known as effectiveness: combining thoroughness and validity into an overall figure of merit that forces both metrics to be considered in a comparison. Hartson et al. (2001) used the following equation to calculate the thoroughness of a UEM:

(1) Thoroughness = [absolute value of P [intersection] A]/ [absolute value of A]

in which P is the set of candidate usability problems (including both real and false) detected by some UE[M.sub.p] (P stands for "detected by UE[M.sub.p]"), and A identifies the baseline usability problem set representing the real problems (A stands for "actual") that exist in the design. Thus the intersection of P and A is the set of real usability problems detected by UE[M.sub.p]. The vertical bars indicate cardinality (size) of a set. Thus [absolute value of A] is the number of elements in A, or the number of real problems existing in the design (according to this baseline set). Therefore, Equation 1 says that the thoroughness of a UE[M.sub.p] is the number of real problems found by UE[M.sub.p] as a proportion of real problems existing in a design.

Thoroughness, as defined in Equation 1, does not penalize for the irrelevant problems (or "false positives"; i.e., elements of P - A) that may occur from a method that overpredicts potential problems in the interaction design, but that is where the validity measures come in. In general terms, validity is a measure of how well a method does only what it is intended to do. More specifically, it is a measure of the "correctness" (or realness) of the UEM output. According to Sears (1997), a technique is valid if evaluators are capable of focusing on relevant issues (and rejecting irrelevant issues). Validity is a necessary measure because evaluators using UEMs are known to identify a certain amount of problems that are not relevant (not real) or important (i.e., false positives). Hartson et al. (2001) developed the following equation to calculate validity:

(2) Validity = [absolute value of P [intersection] A]/ [absolute value of A]

Equation 2 says that the validity of UE[M.sub.p] is the proportion of problems found by UE[M.sub.p] that are real (that exist in the design). Thus thoroughness is a measure of how well a UEM finds real problems, whereas validity is a measure of how well a UEM rejects false positives.

Calculating either thoroughness or validity alone, without consideration for the other, is not always sufficient for UEM effectiveness. For example, high thoroughness alone allows for inclusion of problems that are not real, possibly wasting analysis effort, and high validity alone allows real problems to be missed, allowing problems to remain undetected. Following the concept of a figure of merit in information retrieval (Salton & McGill, 1983), a "figure of merit" for UEM effectiveness can be defined as the product of thoroughness and validity (Hartson et al., 2001):

(3) Effectiveness = Thoroughness x Validity.

Effectiveness can range from 0 to 1, reflecting the same ranges in thoroughness and validity. Where either thoroughness or validity is low, effectiveness will be low also. The effectiveness metric prevents claims that a UEM is good just because it has good thoroughness or just because it has good validity. This balanced overall metric forces both thoroughness and validity to be considered at the same time in a comparison and downgrades the net result if either one is low. Because of the factors they have in common, thoroughness, validity, and effectiveness are not completely independent. They are related measures that emphasize different aspects of the same data. John and Marks (1997) have taken effectiveness one more step by looking at it in terms of the persuasive power of predicted problems from expert-based inspections--that is, by examining how many predicted problems actually lead to changes in the design. This kind of effectiveness measure may prove to be very beneficial to the overall evaluation of UEMs.

Comparing UEMs

To examine the effectiveness of the UPI, we conducted a comparative study with two other, popular inspection methods--the heuristic evaluation and the cognitive walkthrough--in a three-part process:

1. establishing a baseline set of real usability problems from an asymptotic lab-based usability test of an address book program

2. performing independent inspections of the system being inspected using expert usability practitioners, each assigned to one of three inspection methods (UPI, heuristic evaluation, or cognitive walkthrough)

3. using the baseline set of usability problems from the lab-based usability test to conduct a comparative analysis of the inspection methods using measures such as thoroughness, validity, and effectiveness

Data from previous UEM studies helped us to anticipate potential differences among the three expert-based inspection methods considered in this research. Based on these data, we predicted that methods using a theory-based approach (i.e., UPI and cognitive walkthrough) would produce significantly higher validity and effectiveness scores than would the ad hoc guidelines-based approach used by the heuristic evaluation technique. We also expected that heuristic evaluation would produce a higher thoroughness score because it is relatively easy to use and the literature indicates that this technique often finds problems quickly. Finally, we predicted that the UPI and heuristic evaluation should be more cost effective in terms of number of evaluators required, because these methods have the potential to cover a broader range of interaction actions than does cognitive walkthrough. This prediction is based on using the thoroughness measure to estimate how many evaluators are required to identify a certain percentage of the problems existing in the design.

LAB-BASED USABILITY TEST

We performed an asymptotic lab-based usability test to generate a set of real usability problems known to affect users in the lab interacting with an address book program. We then used usability problems collected from this lab-based usability test in a comparative analysis of problems identified from the expert-based inspection methods.

Asymptotic Design

Nielsen (1994), Virzi (1990, 1992), and Wright and Monk (1991) have shown that the equation

(4) P = 1 - [(1 - p).sup.n]

predicts the proportion or fraction of the total existing problems (P) that one would expect to find, given the average detection rate (p) of each iteration/session with a user and the number of iterations/users (n). The detection rate is the fraction of existing problems found in one iteration (e.g., with one user in lab testing or by one inspector in an inspection). Equation 4 can also be used to judge how many participants (n) are required in an evaluation study to identify a specific percentage of problems (P) in the interface (Virzi, 1992). Based on Virzi's (1992) recommendation, we used this formula to judge how many participants we would need in the lab-based usability test. In this study, we estimated that 20 participants would be adequate to discover most of the problems in the interface.

Equation 4 is simply a statement of probability and not, for example, a result of a study in the usability field. It assumes each problem is equally likely to be found and applies to any process with a "hit" rate. For example, it could be used to predict the expected number of correct decisions in medical diagnosis, such as one for screening for all patients with a certain disease using a testing process with given detection rates, or in testing for production flaws in a manufacturing process.

Based on the literature and our own experience, we estimated that our detection rate would be in the range illustrated in Figure 4. For example, Lewis (1994) and Virzi (1992) found that average problem detection rates can range from .16 to .42 for any one individual. Figure 4 illustrates the probability formula in Equation 4 for an individual detection rate (p) in the range of .20 to .33. The individual detection range of .20 to .33 is based on our own experience with evaluators inspecting relatively simple interaction design applications. The figure shows bow the problem discovery probability (P, the fraction of existing problems expected to be found) rises with each added user up to where it levels off asymptotically in the range of 13 to 20 users (n), depending on the individual detection rate (p). The remaining unknown was the individual detection rate (p). The data we collected validated our estimate, as discussed in the Results section. To be conservative, we decided to use the full 20 user participants (n = 20) to approximate asymptotically the full set of problems existing in the interface.

[FIGURE 4 OMITTED]

Method

Participants. As a result of the analysis for asymptotic testing discussed earlier, 20 students (18 men, 2 women) enrolled in an introductory human-computer interaction course at Virginia Polytechnic Institute and State University (Virginia Tech) participated in the lab-based usability test. Students volunteered and received extra credit for their participation in the lab-based usability test. One participant was a graduate student, and the other 19 were undergraduates majoring in computer science, computer engineering, or industrial and systems engineering. As required by the study, none of the participants had any experience with the address book program being evaluated.

Materials and equipment. An address book program, implemented on a Macintosh computer, was used as the target application for collecting usability problems from the participants in the lab-based usability test. The address book program was a mature commercial product with a large range of functionality. This address book program was selected on the basis of its relatively simple interface, corresponding relationship to a paper-based address book, limited fielding in the consumer market, and known usability issues documented in a previous evaluation at Virginia Tech (McCreary, 2001). The features and interface of the address book program approximated most home and office software products within the scope of user experience, interest, and applicability. The lab-based usability test took place in the Usability Methods Research Laboratory at Virginia Tech, where we used a video camera, videocassette recorder, and monitor to record both audio and video during the test session. Live screen images were mixed with the audio and video to include the information (audio, video, and computer screen image) required for further analysis of usability problems.

Procedures. Each participant began the session with a tour of the Usability Methods Research Laboratory, followed by an explanation on how the data collection would occur during the study. After reading a description of the study, participants completed a pretest questionnaire and read a description of the address book interface and the tasks they would perform during the session. The participants performed the following six tasks:

1. insert a record

2. save a file under a different name

3. find a specific record

4. sort a file

5. make a new group

6. import data from a file

These six tasks were selected from a previous study of the interface showing that these tasks tested most of the functionality of the program. Each participant performed the six tasks using the address book program while verbalizing their thoughts by thinking aloud during task performance. After completing each task, the participant expressed any issues not noted during the task and then took a short break before beginning the next task.

The experimenter recorded usability problems based on the critical incidents identified during the participant's performance and verbal protocol taken during each task. The experimenter recorded a critical incident if the participant did one of the following: (a) gave up on task completion and asked the evaluator for help, (b) made an error that required recovery in order to proceed with successful task completion, (c) verbalized confusion and/or difficulty when performing the task, or (d) visibly showed a delay in accomplishing a part of the task.

Participants were asked to work at a comfortable pace through the six tasks and were given no time limit for task completion. Participants were allowed to give up on a task when they verbally expressed that they had no idea on how to proceed any further. Because each successive task was not dependent on the previous task, participants who gave up were allowed to move on to the next task after a short break.

Results

Each test participant produced a list of problems that the evaluator observed and recorded. Based on the expert judgment of the experimenter, problems that appeared on multiple occasions across participants received the same label and description in order to generate an accurate list of common and unique problems. The experimenter examined videotapes of the recorded sessions if more information was needed on a specific problem following the lab-based test. A total of 59 unique usability problems were identified from the lab-based usability test, as shown in Table 2. Table 2 lists problems by identification number (1-39), relevant task number (1-6), description, interface location, and their frequency of occurrence (out of 20 users). The list is also sorted by frequency of occurrence, from highest to lowest.

Based on the 39 unique problems, the mean number of usability problem instances recorded for each user was 11.25 (SD = [+ or -] 1.62). Five users encountered as many as 13 usability problems, whereas 2 users encountered as few as 8 usability problems.

The mean detection rate for problems identified in the lab-based usability test was .29 (mean problem identification of 11.25 divided by 39 problems). Thus, as it turns out, for the data from the lab-based usability test the probability formula in Equation 4 predicts that as few as 13 users (at p = .29) would have been needed to find 99% (P = .99) of the problems existing in the address book application. In any case, our data more than validated the decision to use 20 participants in the lab-based usability test.

The 39 real usability problems identified in this lab-based study formed the basis for calculating thoroughness, validity, and effectiveness metrics of alternative UEMs. We compared the UPI with two other UEMs used by professional usability evaluators in the expert-based comparison study.

EXPERT-BASED INSPECTION COMPARISON STUDY

The primary purpose of the comparison study was to determine the effectiveness of the UPI when compared with two other expert-based inspection methods: heuristic evaluation and cognitive walkthrough. The study also provided an opportunity to apply carefully considered measures that can be commonly used, criteria for computing metrics, and metrics to demonstrate effectiveness (Hartson et al., 2001) within a UEM comparison study.

Method

Participants. All participants were recruited from industry in order to form a relatively homogeneous group of usability professionals with real-world experience in usability testing, interface design, and/or the use of expert-based inspection methods. Doubleday et al. (1997) have shown that the most proficient inspectors are those with usability experience and formal training in areas such as human factors, computer science, human-computer interaction, and cognitive psychology. We randomly assigned 30 participants (14 men, 16 women) to one of the three expert-based inspection methods (10 for each method). These 30 participants came from seven different companies, with six companies each supplying 4 participants and one company supplying 6 participants.

All participants had a minimum of 2 years' experience (M = 8.9 years) in the field, supporting activities such as test and evaluation, design, research, and teaching. The participants came from organizations where usability engineering (design, test, or evaluation) was a formal part of their daily experience. From a demographic questionnaire using a scale of 1 (never) to 6 (daily), participants reported having the most experience in interface design (M = 4.03), followed by usability testing (M = 3.20) and then the heuristic evaluation technique (M = 3.17), and they had the least experience with the cognitive walkthrough technique (M = 2.20). All participants possessed at least a bachelor's degree in computer science, human factors, psychology, or industrial engineering. A majority (26 of 30) of the participants possessed an advanced degree (master's or Ph.D.). The average age of the participants was 36 years, ranging from 28 to 50 years.

Materials and equipment. The same address book program from the lab-based test was used for the comparison study. The relatively simple interface of the address book application allowed participants to quickly understand the interface issues without previous experience with this specific address book program. None of the participants was familiar with this application. A Macintosh PowerBook 520c hosted the address book application in order to transport the application to each usability participant's work site. Materials for the UPI method included a Microsoft Access database, Netscape Communicator, Microsoft Personal Web Server, and a series of Active Server Pages that linked the UPI HTML pages to the back-end UAF database. Participants in the heuristic evaluation method used materials from Nielsen's (1993) description of the method, with the addition of task scenarios, in accord with the way the method is commonly used in the usability evaluation community (Virzi, 1997). Participants in the cognitive walkthrough method used materials described in Wharton et al. (1993).

Procedures. The comparative study used a between-groups design, with evaluators randomly assigned to one of three groups: UPI, cognitive walkthrough (CW), or heuristic evaluation (HE). To compare the effectiveness of the methods under similar conditions, training and evaluation time was limited to a total of 2 hr for each method. The limited training and evaluation time was justified by three reasons. First, all participants were usability practitioners and did not require extensive training, given that all had at least a working knowledge of expert-based inspection methods. Second, as volunteers for this study, these usability practitioners had limited time to offer for the evaluation session. Third, the short evaluation time approximated what is now becoming a typical situation in product development cycles: limited time to complete evaluation activities.

Participants in all three groups completed a 30-min training program on their respective method. Training materials for the UPI method consisted of briefing slides that explained the content and structure of the UPI, an information sheet on the address book interface, instructions on how to conduct the evaluation, a listing of the six tasks used in the lab-based usability test, and a user-class definition (on whose behalf the inspection was to be conducted). The six tasks from the lab-based usability test, the information sheet, and the user-class definitions were the same for all three methods. UPI participants used the Web-based tool to document the problems they identified with the interface. UPI participants had 1 hr to complete their inspection, using the six representative tasks from the lab-based usability test. After this task-based inspection, participants had 15 min to perform a free-exploration inspection, which was not task dependent. During the free-exploration session, UPI participants could review and inspect any part of the address book interface.

The training materials for the CW participants consisted of a tutorial based on Wharton et al. (1993), the information sheet on the address book interface, instructions on how to conduct the evaluation, a listing of the six tasks used in the lab-based usability test, and a user-class definition to focus the inspection. CW participants used paper forms to document the problems they identified with the interface. Each form included the task description and the step needed to successfully perform the relevant portion of the task. CW participants were given 75 min to complete their inspection using the six representative tasks from the lab-based usability test. The cognitive walkthrough method is defined only for a task-based approach, so the inspection did not include a free-exploration portion.

The training materials for the HE participants consisted of a tutorial adapted from Nielsen (1993), including the concept of task scenarios to provide context for the inspection. The task-scenario approach is a known variant of the heuristic evaluation often used in industry to provide assistance in completing the evaluation session. Participants also received an information sheet on the address book interface, instructions on how to conduct the evaluation, and a user-class definition to guide the inspection. HE participants used paper forms to document the problems they identified with the interface. Each form included a place to record the problem description, how the problem was found, and the heuristic label relevant to the problem. HE participants were given 75 min to complete their inspection using an approach based on task usage scenarios. The task usage scenarios provided to the participants were based on the original six tasks from the lab-based usability test. Although HE participants received a list of the typical tasks users perform, these tasks conveyed only what users do, not the specifics of how to do a particular task.

Following completion of the inspection session, each participant completed the posttest questionnaire and was thanked for participating in the study. The entire inspection session for each method, including training and evaluation, lasted 2 hr.

Data analysis. Each inspection session produced a list of problems for each participant. All problems were tagged with participant and condition identifiers and combined into one list. Each problem fell into one of two categories: (a) the problem was the same as one identified from the lab-based usability test or (b) the problem was found by one or more of the tested methods but not by the lab-based test (and, therefore, was not counted as "real"). Duplicate problems identified by the same evaluator were removed from the list. The list of problems was normalized by merging problem descriptions of duplicate problems across conditions into one common description. That is, problem descriptions from different evaluators describing what, in the experimenter's expert judgment, was the same problem were merged into one, taking the attributes of each and forming a complete description of the problem. If a method identified a problem from the lab-based problem set, the lab-based description of the problem was used because it had already been well defined before the comparison study. All the analysis discussed in the following sections used normalized problem sets. This normalized problem set defines our "problem types." Calculations for thoroughness, validity, and effectiveness scores are based on the aggregated mean of evaluator scores.

Results

Participants using the expert-based inspection methods identified a total of 96 problem types. Of these 96 problem types, 62 were new problems and 34 came from the original 39 real problems already identified from the lab-based usability test. None of the other 5 problem types identified by the lab test was detected by any of the inspection methods.

Figure 5 shows the problem types divided into seven partitions according to problems unique to each method and those common to two or more methods. The area of each circle graphically reflects the total number of problem types identified in each method. The number in each part of the diagram represents the number of problem types found by the combination of tested UEMs corresponding to that section of the diagram. The number in parentheses indicates the number of problems that were also originally discovered by users in the lab-based usability test. For example, the "15 (4)" in the part of the UPI circle that does not intersect another circle means that 15 problem types were identified by the UPI and not by the other UEMs or the lab test. It also means that 4 problem types were identified by the UPI and the lab test but not by the other UEMs. Similarly, the "14 (6)" in the intersection between the UPI and HE circles means that 14 problem types were found by both the UPI and the HE but not by the CW or the lab test. It also means that 6 problems were found by both the UPI and HE, as well as by the lab test, but not by the CW.

[FIGURE 5 OMITTED]

Measures of thoroughness, validity, and effectiveness. All thoroughness, validity, and effectiveness scores were calculated using Equations 1, 2, and 3 based on the 39 real user problems isolated in the lab-based usability test. Results of the calculations for thoroughness, validity, and effectiveness are shown in Table 3, and a side-by-side comparison is illustrated in Figure 6.

[FIGURE 6 OMITTED]

Correlations among the three measures of thoroughness, validity, and effectiveness are shown in the correlation matrix in Table 4. As expected, the results showed the measures of thoroughness, validity, and effectiveness to be highly correlated (p < .01), given that each of the measures shares a common numerator. Based on the significant correlation among thoroughness, validity, and effectiveness, a multivariate analysis of variance (MANOVA) was used to test the differences among the three inspection methods. Results from the MANOVA indicated a significant difference among the three inspection methods using Wilks's lambda as the test statistic (lambda = .405), F(6, 50) = 4.74, p < .01. Based on the significance from the MANOVA results, univariate tests determined whether significant differences existed for each measure.

Subsequent one-way analyses of variance (ANOVAs) on thoroughness, validity, and effectiveness showed a significant difference for thoroughness, F(2, 27) = 3.49, p < .05, validity, F(2, 27) = 16.60, p < .001, and effectiveness, F(2, 27) = 11.94, p < .001. The Bonferroni t test was used to conduct a post hoc analysis for multiple comparisons across the three inspection methods. Table 5 summarizes the statistics from the Bonferroni t test. The results of the post hoc analysis indicate

* the UPI method (M = .233) was significantly more thorough at identifying problems found in the lab-based usability test than was the heuristic evaluation method (M = . 179), p < .05;

* the HE method (M = .482) was significantly less valid than either the UPI (M = .785) or CW (M = .699) methods, p < .01;

* the HE method (M = .089) was significantly less effective than either the UPI (M = .182) or CW (M = .146) methods, p < .05; and

* the UPI and CW methods were not significantly different with respect to thoroughness ([M.sub.UPI] =.233 vs. [M.sub.CW] = .202), validity ([M.sub.UPI] = .785 vs. [M.sub.CW] = .699), or effectiveness ([M.sub.UPI] = .182 vs. [M.sub.CW] = .146).

We calculated problem detection rates, p, for each method using mean thoroughness scores based on the mean values from Table 3. We used the probability function, P = 1 - (1 - p)", of Equation 4 to compute a curve for each method, relating the fraction of problems expected to be found to the number of participants in each evaluation, as was done for Figure 4. Figure 7 shows the detection probability (fraction of problem types expected to be found) using the mean thoroughness score as the detection rate, p, for each method. The result in Figure 7 predicts that 6 inspectors would be needed in the UPI method to detect 80% of the problems identified in the lab-based usability test. The CW method would require 7 inspectors, and the HE method would require 8 inspectors to detect the same level (80%) of problems from the lab-based usability test.

[FIGURE 7 OMITTED]

Expert evaluator survey. After completing the lab-based usability test, the expert evaluators completed a posttest questionnaire to assess attributes of their respective inspection method. Figure 8 displays the results of the posttest questionnaire graphically, indicating general trends with respect to participant ratings. All questions used a Likert-type scale from 1 (strongly disagree) to 5 (strongly agree), with 3 as the neutral midpoint. A MANOVA was used to test the difference among the three inspection methods in terms of the five questions on the post-test. Results from the MANOVA did not reveal a significant difference among the three inspection methods, Wilks's lambda = .578, F(10, 46) = 1.45, p > .10.

[FIGURE 8 OMITTED]

GENERAL DISCUSSION

Data from the expert-based inspection comparison study point to several findings that are important considerations for conducting UEM comparison studies. We discuss considerations for both the lab-based usability study and the expert-based inspection study.

Lab-Based Usability Test

The lab-based usability test generated a baseline problem set to be used as a standard-of-comparison set of real usability problems known to affect users. Through asymptotic user testing with 20 participants, 39 problem types were documented with the address book interface. Data from the lab-based usability test indicated that users identified 39 different usability problems in the interface and that fewer than 20 users would have been adequate to uncover most of those problems.

Expert-Based Inspection Comparison Study

The baseline set of real problems from the lab-based usability test formed the key component of the criteria for comparing expert-based inspection methods. The theory-based approaches of the UPI and cognitive walkthrough proved to be more valid and effective than the ad hoc guidelines-based approach used in the heuristic evaluation technique. Because the analysis focused on one particular interface application, the results are limited to similar home and office applications that do not require extensive training or substantial experience. Although other interface applications were available for analysis, the address book program was representative of many of the available applications that home and business users currently use. In addition, the address book program was essentially a "walk-up-and-use" product, at which expert-based inspection methods are primarily directed. Finally, the tasks and usage scenarios selected to drive the expert evaluation has some impact on the type of problems found for each method. Although we provided the same task context to all evaluators, a different set of tasks or task descriptions may provide a different set of results.

Thoroughness, Validity, and Effectiveness

Calculating thoroughness, validity, and effectiveness values highlights important differences among expert-based inspection methods. Even though these measures emphasize different aspects of the same data, they are not three completely independent ways of assessing the same hypothesis. Depending on the goals of the researcher or practitioner, one of these measures may be the most relevant.

In this study the heuristic evaluation method was found to be significantly less thorough than the UPI method, contrary to our original hypothesis that heuristic evaluation would produce a higher thoroughness score because it often finds problems quickly. Other researchers (e.g., Jeffries et al., 1991; Sears, 1997) have confirmed the conclusion that the heuristic evaluation method finds a large number of "nonreal" problems. Despite this tendency to find false positives, researchers have often reported the heuristic method as being productive (or thorough) because of the large number of problems it does identify and its relative ease of use (Dutt, Johnson, & Johnson, 1994; Virzi, Sorce, & Herbert, 1993). We, too, expected that the heuristic evaluation, even with its propensity to identify false positives, would perform much better in terms of thoroughness. The results in this study highlight the importance of the thoroughness measure over measures that focus only on problem counts, as we will discuss in a later section. Thoroughness clearly accounts for the intersection of identified problems and a known set of real problems, not merely the total count of problems a method identifies. The low result in thoroughness for the heuristic evaluation method is most likely related to its ad hoc guidelines-based approach that does not have a theoretical basis.

Similar results were obtained for both the validity and effectiveness measures, supporting the hypothesis that methods using a theory-based approach (i.e., UPI and cognitive walkthrough) would produce significantly higher validity and effectiveness scores than would an ad hoc guidelines-based approach such as heuristic evaluation. Overall, the heuristic evaluation method was less valid and less effective than either the UPI or cognitive walkthrough method. Validity indicates how much extra effort is being wasted on issues that are not important. As an example, the heuristic evaluation method produced a validity score of .482. This means that 51.8% of the inspection effort was wasted in finding problems that were not part of the real set. In contrast, inspectors using the UPI identified 78.5% valid problems, on average, wasting only 21.5% of their efforts on problems that turned out to be unrelated to the lab-based usability problem set. The cognitive walkthrough inspectors identified 69.9% valid problems on average and spent 30.1% of the time identifying problems outside of the lab-based usability problem set.

We also predicted that the UPI and heuristic evaluation should be more cost effective in terms of number of evaluators required because these methods have the potential to cover a broader range of interaction actions than does cognitive walkthrough. This hypothesis was supported for the UPI but not for the heuristic evaluation. The heuristic evaluation method did not perform well as a cost-effective method (number of evaluators required) because this prediction was based on using the thoroughness measure. The effectiveness measure combines both thoroughness and validity into a figure of merit score. Such a measure compensates for the fact that a method scoring high in thoroughness could conceivably generate a large number of invalid problems. Because the heuristic evaluation method scored low in both thoroughness and validity, its score on effectiveness was also the lowest.

One important finding is that the UPI and cognitive walkthrough were relatively equivalent for all three measures. Although the UPI had a slight advantage in the raw numbers for all three measures, its advantage was not significant. Both the UPI and the cognitive walkthrough are built on theory-based approaches, and both have a task-based focus. The heuristic evaluation method used in this study included task usage scenarios to give context to the inspection. Sears (1997) used task-based inspection methods and found them to be much more valid than a free-exploration method (i.e., heuristic evaluation without any task scenarios). However, in one case, Sears found a task-based approach (cognitive walkthrough) to be less thorough than a free-exploration approach (heuristic evaluation without task scenarios). Task-based approaches, especially those with a theoretical basis, generally take more time than does free exploration, but they expose the inspector to issues that are probably going to affect real users. An important note is that the UPI did have a free-exploration portion included in the inspection session, although inspectors completed the free-exploration part after the task-based focus. This free-exploration part of the UPI may have affected the results, given that 12 problems were identified during this part of the evaluation, with 7 of the problems in common with problems from the lab-based usability test.

Overall, task-based methods with an interaction-based framework (i.e., the UPI and cognitive walkthrough) tend to identify fewer false alarms and miss fewer important problems than do free-exploration methods that do not integrate a task component. Even though the heuristic evaluation in this study used task scenarios, these scenarios were not built into a method or framework, as were the tasks for the cognitive walkthrough and the UPI. For example, cognitive walkthrough uses specific questions related to each task to drive the inspection. The UPI also uses the tasks to drive the inspection by asking specific questions about the task in relation to usability issues. From the Sears (1997) study, and the data reported here, it appears that a task-based versus free-exploration dimension does not reliably account for the difference between methods. A more important variable appears to be how closely the method links the tasks to specific inspection questions as well as a theoretical framework to guide the inspection.

Results Based on the Union of Problem Sets

One way researchers can produce a standard-of-comparison usability problem set for a given target interaction design is to use the union of all problems sets. Very different results are obtained when the lab-based usability data are discarded and replaced with the union of problem sets as the standard comparison group when calculating thoroughness. For example, the heuristic evaluation method emerges as the leader because of the larger number of problems identified. In fact, the heuristic method contributes many usability problems (otherwise considered false positives) to the union that other methods will "fail" to detect. A one-way ANOVA for thoroughness using the union of problem sets from the expert-based inspections methods showed a significant difference among the three groups, F(2, 27) = 5.78, p < .01. Thoroughness for the heuristic evaluation method (M = .156) was significantly higher than the thoroughness scores for the UPI (M = .123) and cognitive walkthrough methods (M = .117), p < .05.

We chose not to base our results on the union of problem sets because it potentially introduces more nonreal problems (false positives) and can never be used to calculate the validity measure. Calculating validity based on the union of all problem sets guarantees that the intersection of each UEM usability problem set and the standard usability problem set (the union) will always be the UEM usability problem set itself. This means that all usability problems detected by each method are in the criterion set and therefore are considered real, yielding a validity of 100% for all participating methods (see Hartson et al., 2001, for detailed explanation). Consequently, we based the measures of thoroughness, validity, and effectiveness on the lab usability test as the comparison standard for interpreting existing "real" problems instead of basing conclusions merely on the union of all problems. In addition, we wanted to use a standard comparison set of usability problems and metrics to avoid the weakness Lund (1998) pointed out concerning current UEM studies--that is, no single standard for direct comparison and a multiplicity of different measures with various definitions.

Expert Evaluator Reports

Results from the posttest questionnaire did not show significant differences among the three methods but did reveal some interesting subjective opinions. For example, participants did not view the UPI as a method they could use to perform a quick evaluation. This opinion is most likely attributable to the large volume of the question database (from the underlying UAF). Inspectors may find it difficult to realize that large portions of the database are quickly pruned with a "no" answer to certain questions. In addition, the current version of the UPI involves a careful traversal process that has several checks and balances to ensure that the user is tracking to the desired point. Evaluators can quickly learn certain parts of the database and often desire to return quickly to a node they have visited before. The current implementation of the method still requires them to traverse the structure, but future implementations will make this more efficient.

Some of the participants in the study actually found the cognitive walkthrough easy to learn and apply. These opinions are contradictory to previous research on the cognitive walkthrough, in which it was viewed as the most difficult method to learn and apply (Lewis et al., 1990; Rowley & Rhoades, 1992; Wharton et al., 1992). However, some research on cognitive walkthrough has been conducted with undergraduate and graduate students, who may not have had the necessary experience to fully understand the cognitive aspects involved with the method. Participants in the present research had considerable experience in the field and appropriate academic training that may have been compatible with the requirements of cognitive walkthrough.

Another interesting observation is that participants in the heuristic evaluation method were more willing to recommend this method to their organization, even though the comparison study showed it to be the least effective method. This result could be attributable to the popularity of heuristic evaluation and its ease of use (Rosenbaum, Rohn, & Humburg, 2000). Evaluators may also recognize that given the reality of a development environment, more effort/ precision isn't necessarily useful because not everything gets fixed, anyway. Another explanation is that these evaluators had more experience with the heuristic evaluation technique in their own usability evaluation activities and might see a conflict in not recommending it.

Limitations of the Study

As with any study, this one has limitations that restrict us from making generalizations to other domains. We conducted one study involving one application, and therefore additional research is needed to see if these results generalize to other applications. Not all techniques are equally effective for all types of applications.

One limitation of this study concerns the use of clearly defined tasks to structure the inspection. This is often a criticism of similar UEM studies. Clearly defined tasks often introduce a less realistic interaction for the user or inspector and, therefore, may bias the type of problems uncovered in a lab-based usability test. Brief lab-based testing with clearly defined tasks, as is the case in this study, generally favors ease-of-learning kinds of problems. The cognitive walkthrough is known to be oriented toward identifying ease-of-learning problems and to be not as effective for ease of expert use (Polson et al., 1992; Virzi, 1997; Wharton et al., 1994). As a result, the clearly defined tasks in this study may have a bias in favor of the cognitive walkthrough. At a minimum, predefined tasks are not as generalizable as real usage data; however, they do make direct comparison of results possible. UEM comparison studies can be based on real usage data, and such studies are starting to occur through remote evaluation, but they will require more data and time before similar comparisons can be made.

Another limitation of this study concerns the process of problem extraction and matching (i.e., identifying two different descriptions of the same problem). Lavery, Cockton, and Atkinson (1997) found the problem matching process difficult for programming experts who were given two sets of usability problem statements: one from a task analysis and one from user testing. Lavery et al. concluded that for reliable matching, problem reports must be of comparable granularity and there must be explicit matching rules.

Although we did have report formats for the lab-based usability test and the expert-based inspection study, these report formats were specific to each method, introducing the issue of how to match usability problems. Higher-level problems involving overall task flow, for example, were easier to match. Lower-level problems usually involved a user interface object (e.g., button, menu choice, dialogue box). A key similarity criterion for matching these problems was the existence of a similar effect of the problem on the user relating to a given attribute of a given user interface object. For example, if two different users had trouble understanding a graphical icon because the image was too abstract, we counted that as the same problem. This allowed us to identify multiple different occurrences of the same usability problem, even across users and tasks. For example, one user might be confused by a label on a button for Task 1, whereas another user may be confused by this same label on Task 4 but did not notice it on Task 1. This is one problem about the confusing label. In the expert-based inspection study, we used the problem descriptions directly from the experts and applied our own judgment to match descriptions across inspectors. Again, the relevant interface object and its attributes were used as key discriminators in determining whether a problem was unique or matched the description from another inspector.

Even with criteria for problem matching, the results from this study are limited because of the judgment we applied during the problem extraction and matching stages. We have also found that this process is inherent in every expert-based inspection method because no standard method exists to extract out usability problems, code with specific attributes, and perform a matching function. Lavery et al. (1997) recommended a common report format to reduce these potential problems, but this is nearly impossible to achieve because methods are often defined by how they describe problems. Although various methods exist for problem extraction, all the methods use a subjective process to identify attributes of the problems, even when standard report formats are provided. The UAF-based usability problem classifier tool (Hartson, Andre, Williges, & Van Rens, 1999) is designed to help produce usability problem descriptions with "standard" structure and vocabulary to support matching and to force more complete problem reports.

INFORMAL COMPARISON OF EXPERT-BASED INSPECTION METHODS

One of the criticisms that Gray and Salzman (1998) leveled at reports of UEM comparison studies was about the practice of mixing study results in with discussion based on expert opinion, conjecture, and other narrative that goes beyond what can be supported directly by the data. The foul was not necessarily the opinions offered--in fact, the conductors of studies may be in the best position to offer such opinions, and readers may find them among the most interesting parts of study reports. However, the integrity of the science demands that this kind of discussion, however well informed, be kept separate, and labeled separately, from claims and conclusions that stem from the data so that the reader can delineate the limits between the two. Based on our own use of the UPI, cognitive walkthrough, and heuristic evaluation, we offer some informal comparisons of the UPI with the other methods.

Qualitative Comparison of UPI with Heuristic Evaluation

Strengths and weaknesses of the heuristic evaluation are well documented in the literature (e.g., Doubleday et al., 1997; Nielsen & Molich, 1990; Sears, 1997). Based on our experience with usability evaluation, the UPI approach has two primary advantages over heuristic evaluation: the former has a foundation in a theory-based model of user interaction and is not an abstraction of guidelines.

The main advantage of UPI over heuristic evaluation is one shared with the cognitive walkthrough method. Because the UAF is based on an extension and adaptation of Norman's (1986) stages-of-action model, a usability inspection done with the UPI is based on the cognitive and physical actions a user makes while performing a task. The underlying UAF framework gives structure and guidance to the way UPI inspection questions are asked, targeting them directly to ways the design does or does not support user needs in performing cognitive and physical actions during each stage of interaction--as the user plans goals and intentions, determines action to carry out intentions, does physical actions, and assesses the outcomes with respect to the goals. The heuristic evaluation method does not have an interaction-based model to guide the inspection process.

The second UPI advantage stems from abstraction of guidelines. The heuristics are essentially abstracted guidelines. One problem is that the small number of categories means compressing too many items for classification and analysis, a condition contested by John and Mashyna (1997) because funneling usability problem classifications into a small number of categories yields too much apparent agreement on the classification results.

Although abstraction helps to control complexity by limiting the number of guidelines the analyst has to consider, it also works against an ability to be specific about a usability situation. The heuristics often turn out to be too vague and too general (i.e., already too abstract), requiring a significant amount of interpretation to apply them to a specific design situation. For example, consider the concept of consistency. The guidelines say to "be consistent" in interaction design, and they define that as doing the same kinds of things in the same way each time. Consistency is a useful guideline, but our experience has been that by the time most designers get involved in the design details or inspectors get involved in inspection details, they often don't know when the consistency guideline applies, when to consider it, or how to interpret it for a specific situation. This is partly because consistency issues can take on many forms (Grudin, 1989), with differences in what it means to "be consistent."

In the UAF and its tools, guidelines such as those for consistency are not "factored out" but are distributed and particularized over the entire UAF structure, appearing as individualized guidelines specific to various detailed design situations. The UAF guides the designer or inspector to the detailed usability situation first and then gives a small number of guidelines that apply, each of which is made specific to this situation. Thus the UAF content about consistency appears at low levels of abstraction in specific expressions of how consistency applies, for example, to the wording of a button label or to the way a dialogue box is exited.

Qualitative Comparison of UPI with Cognitive Walkthrough

As we said in the Limitations of the Study section, the bias toward ease of learning that shows up in brief lab testing would favor the cognitive walkthrough method in this study. However, because the UPI method is designed to support both ease of learning and long-term ease of use, it should perform better than cognitive walkthrough in a test involving expert users over a longer term. The physical actions part of the UAF is tailored to support expert users through attention to perceptual and physical affordances that make for the most efficient physical manipulation of user interface objects to accomplish tasks (e.g., Fitts's law, physical fatigue, number of keystrokes, awkwardness of moving back and forth from mouse to keyboard). In addition, according to John and Mashyna (1997), cognitive walkthrough, appears to detect only usability problems of commission, not those of omission. This argument led us to develop the "right" questions to help the inspector scrutinize how well an interface supports users in both ways.

The cognitive walkthrough method is similar to a portion of the highest level of the UAF in that both are based on similar models of user interaction. The cognitive walkthrough, however, mainly emphasizes planning and translation; it does not give much attention to physical actions or assessment. Also, significantly, the cognitive walkthrough method does not embody an underlying knowledge base (found in the UAF) that spells out subquestions about the myriad details of how users' actions are or should be supported in each aspect of those broad top levels. In application of the cognitive walkthrough, the inspector is asked whether the user is likely to do the right thing in making an action, such as clicking on the correct button. The reasoning that the evaluator must make to answer the question is structured and takes into account quite a bit of cognitive complexity theory (Kieras & Polson, 1985; Polson & Lewis, 1990)--one reason that the cognitive walkthrough method requires significant training to use.

The UPI tool, through the UAF database content, leads the evaluator through the details. Plus, whereas cognitive walkthrough focuses on predicted user performance, the UPI additionally guides the evaluator to search out connections to causes of potential failures in user performance, in terms of flaws in the interaction design. For example, if the inspector believes a user will not be able to determine what action to take to carry out an intention, the inspector is led, through a hierarchy of increasingly more specific questions, to narrow down the problem and identify increasingly more specific causes. If the inspector believes that some users won't understand a prompt or a button label, for example, the UPI leads the inspector to seek increasingly specific causes of the problem by asking whether the cause of poor understanding is unclear wording, incomplete wording, incorrect wording, and/or wording that is not user centered.

Each of the possibilities just mentioned can have a similar end effect on the user (e.g., confusion, inability to complete task), but each is a different cause in the interaction design, calling for a different solution in redesign. Also, usability problems often have multiple causes in the interaction design. In such cases, identifying and fixing only one cause does not entirely solve the problem. The UPI will lead the inspector to each possible cause.

The UPI also has more support than does cognitive walkthrough for complete problem reporting. Because of the hierarchical structure and the fact that each node in the UAF represents a usability attribute, the path from the root to a specific node can be thought of as an "encoded" representation of a problem type and its cause, containing a series of increasingly detailed descriptors. This encoding contains all the information necessary for building a complete and specific usability problem report couched in the language of problem causes as flaws in the interaction design, which is exactly what the practitioner or usability engineer needs in order to consider redesign solutions to fix the problem. The cognitive walkthrough does not directly provide any of this additional analysis.

CONCLUSIONS

The purpose of performing a lab-based usability test and combining it with an expert-based inspection comparison study was to determine if our theory-based framework and usability inspection tool could be effectively used to find important usability problems in an interface design. Overall results from this research did show the UPI to be an effective tool for usability inspection. Task-based methods such as the UPI and cognitive walkthrough provide the practitioner with a more effective problem list than do methods that do not have an integrated task focus. We have also concluded that it is important, for both the researcher and practitioner, for UEM comparison studies to use clearly understood and well-defined metrics for measuring effectiveness based on an independently generated set of problems as the criterion for being "real."

In usability analysis, there is not yet a single standard of comparison for UEMs. Comparison studies reported in the literature have used a variety of measures, and not all have explained the comparison criteria clearly. The choice of measures can make a difference, and flawed approaches can produce misleading conclusions. For example, some studies have used measures that include raw problem counts and used the union of problem sets as the comparison group for calculating measures such as thoroughness. Had we used, as an example, a thoroughness measure based on the union of problem sets, our conclusions would have been different and, we believe, less valid.

In sum, the UPI offers a cost-effective alternative or supplement to lab-based formative usability evaluation at any stage of development (e.g., design sketches, prototypes, and fully implemented systems). The niche for the UPI lies between the low cost (of learning and using) and ease of use of the heuristic method and the completeness and task-driven structure of the cognitive walkthrough method involving extensive modeling and analysis. Further research is needed to test all the potential advantages of the UPI.
TABLE 1: Sample of the Hierarchically Organized
Content of the UAF below the Interaction Cycle

Planning
  User's model of system
  Goal decomposition
  Supporting planning for error avoidance
  User's knowledge of system state, modalities, and especially active
    modes
  User and work context
  User's ability to keep track of how much is done
Translation
  Existence
    Existence of a way
    Existence of a cognitive affordance
  Presentation of cognitive affordance
    Perceptual issues (of cognitive affordance)
    Layout, grouping by function
    Consistency and compliance of cognitive affordance presentation
    Preferences and efficiency for presentation of cognitive
      affordances
    Distracting presentation technique
  Content, meaning of cognitive affordance
    Clarity, precision, predictability of meaning (of cognitive
      affordance
    Completeness and sufficiency of meaning (of cognitive affordance)
    Distinguishability (of cognitive affordances)
    Relevance of content (of cognitive affordance)
    Consistency and compliance of cognitive affordance meaning
    Layout and grouping (of cognitive affordances)
  Task structure and interaction control
    Real loss of user control attributable to arbitrary system action
    Error avoidance/recovery (in task structure)
    Supporting human memory limitations
    Consistency and compliance of task structure
    Directness of interaction
Physical actions
  Perception of manipulable and manipulated objects
    Legibility, readability
    Noticeability
    Timing
  Manipulating objects
    Physical control
    Physical layout
    Physical complexity of interaction
Assessment
  Issues about feedback (about interaction for task)
    Existence of feedback
    Presentation of feedback
    Content, meaning of feedback
  Issues about information displays (results for task)
    Existence of information displays, results
    Presentation of information displays, results
    Content, meaning of information displays, results

TABLE 2: Usability Problems Identified During
Lab-Based Usability Testing

Problem   Task                               Interface
   #       #     Description                 Location            Freq.

   1       2     Expected that any actions   Menubar              19
                 dealing with the entire
                 address file would be
                 under the File menu

   2       4     Confused by search model    Sort dialog box      19
                 because of the labels
                 Last Word and First Word

   3       6     Import does not provide     Import dialog box    19
                 prompt for where to put
                 group members

   4       5     Group Search option is      Menubar              17
                 not an obvious choice for
                 adding members to a group

   5       5     The New command under the   Menubar              12
                 InTouch menu is mislea-
                 ding

   6       1     No labels provided to       Main screen          11
                 help differentiate
                 between top and bottom
                 entry fields

   7       4     Unnecessary prompt to       Sort dialog box      11
                 save the file when
                 sorting

   8       3     Find dialog uses            Find dialog box      11
                 "Address" and "Notes"
                 fields, but these fields
                 are not identified on the
                 interface

   9       1     No Add or Do It button      Main screen          10
                 for adding records

  10       6     No global Undo provided     Main screen           9

  11       6     The purpose of the lower    Main screen           7
                 left list box is not
                 apparent

  12       3     System provides poor        Main screen           6
                 feedback for indicating
                 that the entry has been
                 saved

  13       6     Poor feedback showing       Main screen           6
                 current Group selected

  14       5     Add button in Group         Edit Groups           6
                 Editing dialog box is not
                 collocated with entry
                 field

  15       1     Insert button does not      Main screen           6
                 provide enough
                 information to tell user
                 what it does

  16       5     Find dialog box has to be   Find dialog box       6
                 dismissed after
                 completing Group Search

  17       5     Edit Groups options does    Edit Groups           6
                 not allow user to
                 manipulate group
                 membership

  18       3     User has trouble locating   Menubar               5
                 Find command in the
                 menubar; not located with
                 Edit

  19       6     Expected that any actions   Menubar               4
                 dealing with importing
                 files would be under the
                 File menu

  20       4     Results from Sort opera-    Main screen           3
                 tion are hard to inter-
                 pret because of the free-
                 form nature of the
                 address line

  21       4     Sort dialog box does not    Sort dialog box       3
                 go away after the sort is
                 complete

  22       6     User cannot select          Main screen           3
                 multiple items to move to
                 another group

  23       6     Unnecessary confirmation    Main screen           2
                 message for deleting
                 individual records

  24       4     No Cancel button provided   Sort dialog box       2
                 for Sort

  25       5     Edit Groups is not an       Menubar               2
                 obvious choice for adding
                 or making new groups

  26       5     New Group button at         Group Search          2
                 bottom of Group Search
                 dialog box is difficult
                 to notice

  27       5     In Edit Groups, Done is     Edit Groups           1
                 not an obvious choice
                 after entering new group

  28       1     Clicking on Insert high-    Main screen           1
                 lights a blank line
                 outside of the user's
                 focus

  29       6     Shortcuts not provided      Main screen           1
                 for rapid deletion of
                 records

  30       4     The difference between      Sort dialog box       1
                 the Sort and Done buttons
                 is not clear; inconsis-
                 tent with OK and Cancel
                 conventions

  31       6     Unnecessary extra step      Menubar               1
                 required to make new
                 group before importing

  32       4     Default value for Sort      Sort dialog box       1
                 should be Last Word, and
                 it should be at top of
                 list

  33       6     Import does not provide     Import dialog box     1
                 option to look at file
                 before importing

  34       5     The difference between      Group Search          1
                 the Search and Done
                 buttons is not clear;
                 inconsistent with OK and
                 Cancel conventions

  35       2     Not sure if the Save        Main screen           1
                 command saves the whole
                 address book or just the
                 record

  36       5     Feedback after a group      Group Search          1
                 search is hard to notice

  37       5     Boxes on left side of       Main screen           1
                 main screen are not
                 labeled

  38       1     Adding a new record         Menubar               1
                 cannot be accomplished
                 through menus as expected

  39       4     System does not provide     Sort dialog box       1
                 example of Ascending and
                 Descending order options

TABLE 3: Mean Values of Thoroughness, Validity, and Effectiveness of
Expert-based Inspection Methods Using User Test Data as the Standard
Set of Real Usability Problems

                          Inspection Method

                      UPI          HE            CW

Measure          M      SD     M      SD     M      SD

Thoroughness    .233   .025   .179   .032   .202   .068
Validity        .785   .111   .482   .132   .699   .119
Effectiveness   .182   .028   .089   .036   .146   .065

TABLE 4: Correlation Matrix for Thoroughness, Validity,
and Effectiveness Measures

                Thoroughness   Validity   Effectiveness

Thoroughness       1.000         .570 *      .876 *
Validity            .570 *      1.000        .881 *
Effectiveness       .876 *       .881 *     1.000

* p < .05.

TABLE 5: Bonferroni t Test Summary for Thoroughness,
Validity, and Effectiveness

Dependent Variable   Comparison   Mean Difference    SE

Thoroughness           UPI/HE         .054 *         .02
                       UPI/CW         .031           .02
                       HE/CW         -.023           .02

Validity               UPI/HE         .303 **       .054
                       UPI/CW         .086          .054
                       HE/CW         -.217 **       .054

Effectiveness          UPI/HE         .094 **       .019
                       UPI/CW         .037          .019
                       HE/CW         -.057 *        .019

* p < .05, ** p < .01.


ACKNOWLEDGMENTS

This research was based, in part, on a doctoral dissertation completed by the first author while he was a graduate student at Virginia Tech under sponsorship from the Air Force Institute of Technology (AFIT). AFIT provided travel funding to collect comparison data at various expert sites for the comparison study. We wish to thank the editor and reviewers for the great amount of time and effort they spent on the manuscript. Their reviews were certainly helpful in revising our paper, and we wish we could thank them individually by name.

REFERENCES

Andre, T. S., Williges. R. C., & Hartson, H. R. (1999). The effectiveness of usability evaluation methods: Determining the appropriate criteria. In Proceedings of the Human Factors and Ergonomics Society 43rd Annual Meeting (pp. 1090-1094). Santa Monica, CA: Human Factors and Ergonomics Society.

Cuomo, D. L. (1994). A method for assessing the usability of graphical, direct-manipulation style interfaces. International Journal of Human-Computer Interaction, 6, 275-297.

Cuomo, D. L., & Bowen, C. D. (1994). Understanding usability issues addressed by three user-system interface evaluation techniques. Interacting with Computers, 6, 86-108.

Desurvire, H. W. (1994). Faster, cheaper! Are usability inspection methods as effective as empirical testing? In J. Nielsen & R. L. Mack (Eds.), Usability inspection methods (pp. 173-202). New York: Wiley.

Doubleday, A., Ryan, M., Springett, M., & Sutcliffe, A. (1997). A comparison of usability techniques for evaluating design. In Designing Interactive Systems (DIS '97) Conference Proceedings (pp. 101-110). New York: Association for Computing Machinery.

Dutt, A., Johnson, H., & Johnson, P. (1994). Evaluating evaluation methods. In G. Cockton, S. W. Draper, & G. R. S. Weir (Eds.), People and computers IX (pp. 109-121). Cambridge, UK: Cambridge University Press.

Garzotto, F., Matera, M., & Paolini, P. (1998). Model-based heuristic evaluation of hypermedia usability. In Proceedings of the Working Conference on Advanced Visual Interfaces--AVI '98 (pp. 135-145). New York: Association for Computing Machinery.

Grammenos, D., Akoumianakis, D., & Stephanidis, C. (2000). Integrated support for working with guidelines: The Sherlock guideline management system. Interacting with Computers. 12, 281-311.

Gray, W. D., & Salzman, M. C. (1998). Damaged merchandise? A review of experiments that compare usability evaluation methods. Human-Computer Interaction, 13, 203-261.

Grudin, J. (1989). The case against user interface consistency. Communications of the ACM, 52. 1164-1173.

Hartson. H. R., Andre, T. S., & Williges, R. C. (2001). Criteria for evaluating usability evaluation methods. International Journal of Human-Computer Interaction, 13, 373-410.

Hartson, H. R., Andre, T. S., Williges, R. W., & Van Rens, L. (1999). The user action framework: A theory-based foundation for inspection and classification of usability problems. In H. Bullinger & J. Ziegler (Eds.), Human-computer interaction: Ergonomics and user interfaces (Vol. 1. pp. 1058-1062). Mahwah, NJ: Erlbaum.

Henninger, S. (2000), A methodology and tools for applying context-specific usability guidelines to interface design. Interacting with Computers, 12, 225-243.

Jeffries. R., Miller, J. R., Wharton, C., & Uyeda, K. M. (1991). User interface evaluation in the real world: A comparison of four techniques. In CHI '91 Conference Proceedings (pp. 119-124). New York: Association for Computing Machinery.

John, B. E., & Marks, S. J. (1997). Tracking the effectiveness of usability evaluation methods. Behaviour and Information Technology, 16, 188-202.

John, B. E., & Mashyna, M. M. (1997). Evaluating a multimedia authoring tool. Journal of the American Society for Information Science, 48, 1004-1022.

Kahn, M. J., & Prail, A. (1994). Formal usability inspections. In J. Nielsen & R. L. Mack (Eds.), Usability inspection methods (pp. 141-171). New York: Wiley.

Karat, C. (1994). A comparison of user interface evaluation methods. In J. Nielsen & R. L. Mack (Eds.), Usability inspection methods (pp. 203-233). New York: Wiley.

Kieras, D. E., & Polson, P. G. (1985). An approach to the formal analysis of user complexity. International Journal of Man-Machine Studies, 22, 365-394.

Lavery, D., Cockton, G., & Atkinson, M. P. (1997). Comparison of evaluation methods using structured usability problem reports. Behaviour and Information Technology, 16, 246-266.

Lewis, C., Polson, P., Wharton, C., & Rieman, J. (1990). Testing a walkthrough methodology for theory-based design of walk-up-and-use interfaces, In CHI '90 Conference Proceedings (pp. 235-242). New York: Association for Computing Machinery.

Lewis, J. R. (1994). Sample sizes for usability studies: Additional considerations. Human Factors, 36, 368-378.

Lim, K. H., Benbasat, I., & Todd, P. (1996). An experimental investigation of the interactive effects of interface style, instructions, and task familiarity on user performance. ACM Transactions on Computer-Human Interaction, 3, 1-37.

Lund, A. M. (1998). The need for a standardized set of usability metrics. In Proceedings of the Human Factors and Ergonomics Society 42nd Annual Meeting (pp. 688-691). Santa Monica, CA: Human Factors and Ergonomics Society.

McCreary, F. A. (2001). InTouch usability evaluation (Tech. Report TR-01-11). Blacksburg, VA: Virginia Polytechnic Institute and State University, Department of Computer Science.

Nielsen, J. (1992). Finding usability problems through heuristic evaluation. In CHI '92 Conference Proceedings (pp. 373-380). New York: Association for Computing Machinery.

Nielsen, J. (1993). Usability engineering. Boston: Academic.

Nielsen, J. (1994). Heuristic evaluation. In J. Nielsen & R. L. Mack (Eds.). Usability inspection methods (pp. 25-62). New York: Wiley.

Nielsen, J., & Mack, R. L. (Eds.). (1994). Usability inspection methods. New York: Wiley.

Nielsen, J., & Molich, R. (1990). Heuristic evaluation of user interfaces. In CHI '90 Conference Proceedings (pp. 249-256). New York: Association for Computing Machinery.

Norman, D. A. (1986). Cognitive engineering. In D. A. Norman & S. W. Draper (Eds.), User centered system design: New perspectives on human-computer interaction (pp. 31-61). Hillsdale, NJ: Erlbaum.

Polson, P., & Lewis, C. (1990). Theory-based design for easily learned interfaces. Human-Computer Interaction, 5, 191-220.

Polson, P., Lewis. C., Rieman, J., & Wharton, C. (1992). Cognitive walkthroughs: A method for theory-based evaluation of user interfaces. International Journal of Man-Machine Studies, 36, 741-773.

Rizzo, A., Marchigiani. E., & Andreadis, A. (1997). The AVANTI project: Prototyping and evaluation with a cognitive walkthrough based on the Norman's model of action. In Designing Interactive Systems (DIS '97) Conference Proceedings (pp. 305-309). New York: Association for Computing Machinery.

Rosenbaum, S., Rohn, J. A., & Humberg, J. (2000). A toolkit for strategic usability: Results from workshops, panels, surveys. In CHI 2000 Conference Proceedings (pp. 337-344). New York: Association for Computing Machinery.

Rowley, D. E., & Rhoades, D. G. (1992). The cognitive jogthrough: A fast-paced user interface evaluation procedure. In CHI '92 Conference Proceedings (pp. 389-395). New York: Association for Computing Machinery.

Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill.

Sears, A. (1997). Heuristic walkthroughs: Finding the problems without the noise. International Journal of Human-Computer Interaction, 9, 213-234.

Sutcliffe, A., Ryan, M., Springett, M., & Doubleday, A. (1996). Model mismatch analysis: Towards a deeper evaluation of users' usability problems (School of Informatics Report). London: City University.

Virzi, R. A. (1990). Streamlining the design process: Running fewer subjects. In Proceedings of the Human Factors and Ergonomics Society 34th Annual Meeting (pp. 291-294). Santa Monica, CA: Human Factors and Ergonomics Society.

Virzi, R. A. (1992). Refining the test phase of usability evaluation: How many subjects is enough? Human Factors, 34, 457-468.

Virzi, R. A. (1997). Usability inspection methods. In M. G. Helander, T. K. Landauer, & P. V. Prabhu (Eds.), Handbook of human-computer interaction (2nd ed., pp. 705-715). Amsterdam: Elsevier Science.

Virzi, R. A., Sorce, J., & Herbert, L. B. (1993). A comparison of three usability evaluation methods: Heuristic, think-aloud, and performance testing. In Proceedings of the Human Factors and Ergonomics Society 36th Annual Meeting (pp. 309-313). Santa Monica, CA: Human Factors and Ergonomics Society.

Wharton, C. (1992). Cognitive walkthroughs: Instructions, forms, and examples (Tech. Report CU-ICS-92-17). Boulder: University of Colorado.

Wharton. C., Bradford, J., Jeffries. R., & Franzke, M. (1992). Applying cognitive walkthroughs to more complex user interlaces: Experiences, issues, and recommendations. In CHI '92 Conference Proceedings (pp. 381-388). New York: Association for Computing Machinery.

Wharton, C., Rieman, J., Lewis, C., & Polson, P. (1993). The cognitive walkthrough method: A practitioner's guide (Tech. Report CU-ICS-93-07). Boulder: University of Colorado.

Wharton, C., Rieman, J., Lewis, C., & Polson, P. (1994). The cognitive walkthrough method: A practitioner's guide. In J. Nielsen & R. L. Mack (Eds.), Usability inspection methods (pp. 105-140). New York: Wiley.

Wright, P., & Monk, A. (1991). A cost-effective evaluation method for designers. International Journal of Man-Machine Studies, 35, 891-912.

Terence S. Andre is the deputy department head for Cadet Operations in the Department of Behavioral Sciences and Leadership at the U.S. Air Force Academy. He received his Ph.D. in 2000 in industrial and systems engineering from Virginia Polytechnic Institute and State University.

H. Rex Hartson is a professor of computer science at Virginia Polytechnic Institute and State University. He received his Ph.D. in 1975 in computer and information science from the Ohio State University.

Robert C. Williges is the Ralph H. Bogle Professor of Industrial and Systems Engineering, professor of psychology, professor of computer science, and director of the Human-Computer Interaction Laboratory at Virginia Polytechnic Institute and State University. He received his Ph.D. in 1968 in engineering psychology from the Ohio State University.

Address correspondence to Terence S. Andre, HQ USAFA/DFBL, 2354 Fairchild Dr.. Suite 6L 101, USAF Academy, CO 80840-6228: terence.andre@usafa.af.mil.

Date received: January 18, 2001

Date accepted: January 3, 2003
COPYRIGHT 2003 Human Factors and Ergonomics Society
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2003 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Author:Andre, Terence S.; Hartson, H. Rex; Williges, Robert C.
Publication:Human Factors
Date:Sep 22, 2003
Words:15288
Previous Article:The effect of gesture on speech production and comprehension.
Next Article:The impact of mental fatigue on exploration in a complex computer task: rigidity and loss of systematic strategies.
Topics:

Terms of use | Privacy policy | Copyright © 2020 Farlex, Inc. | Feedback | For webmasters