Speech-based interaction in multitask conditions: impact of prompt modality.
Even simple driving tasks can become complex for drivers who are also dealing with in-vehicle interactive systems, such as navigation systems that provide real-time route guidance (e.g., Schraagen, 1993). Interaction with such systems, which may take place while driving, requires the user to enter the destination. The driver can be concurrently engaged in both the driving task and the task of parameter entry into the navigation system.
The driving context can be characterized generically as a hands-busy, eyes-busy situation. A designer of a navigation system for such situations is often faced with an interaction design challenge: How should parameters be entered? If the answer is via a keyboard or other direct manipulation device, then such a solution potentially imposes increased workload on the user because of the possible conflict between the demands of the situation (driving and entering parameters) and the available human resources (hands and eyes). This can become detrimental to both driving performance and safety.
Speech-based user interfaces are considered appropriate as a complementary interaction channel when there are real hands-busy, eyes-busy multitask situations (e.g., Lee, Caven, Haake, & Brown, 2001; Marshall, 1992; G. L. Martin, 1989; T. B. Martin, 1976; McCallum, Campbell, Richman, & Brown, 2004; Simpson, McCauley, Rolan, Ruth, & Williges, 1985; White, 1997). However, the benefits of using speech-based interaction in multitask situations are not obvious, and research findings are not conclusive. Some studies have shown that assigning speech-based interaction to one task, in dual-task situations, benefited the overall dual-task performance (e.g., Graham & Carter, 2001; Hapeshi & Jones, 1989; Murata, 1998; Wickens, Sandry, & Vidulich, 1983). However, other studies showed that using speech-based interaction in multitask contexts did not affect the overall performance (e.g., Damos & Lyall, 1986) or even degraded performance (e.g., Baber, Mellor, Graham, Noyes, & Tunley, 1996; Lee et al., 2001; Linde & Shively, 1988).
Various factors play a role in the success or failure of speech-based interaction in multitask situations. Among those is the automated speech recognition (ASR) technology itself and its recognition accuracy, the ambient noise, the domain expertise of the user, and the user's familiarity with the system. In addition, several basic issues are involved in the design of dialogue in speech-based interaction (Bradford, 1995). Among those are the dialogue styles or strategies, the design of system prompts and feedback, and the approaches to error handling (Rudnicky, 1995; Schmandt, 1994; Yankelovich, 1996; Yankelovich, Levow, & Marx, 1995).
One of the aspects that make the design of speech-based interaction such a challenge is that speech is short term and sequential in nature and consequently may take a heavy toll on human working memory (Bradford, 1995). A strategy used often in order to reduce memory load is to guide the user in a question-and-answer dialogue, prompting the user with the appropriate commands (Baber, 1993; Hansen, Novick, & Sutton, 1996; Waterworth, 1982). A question associated with this strategy is which modality should deliver the prompts to the user.
Resource theory (e.g., Moray, 1967; Kahneman, 1973) is a theoretical framework within which one can address this question empirically. Simply put, the performance of tasks requires allocation of resources from a limited pool. The increased allocation of resources (or "effort," in Kahneman's terms) to one task will improve that task's performance, but at the same time it will decrease the resources available to other concurrent tasks, and their performance will consequently be degraded. In a hands-busy, eyes-busy situation, resource theory would predict that the use of visual prompts in speech dialogue will interfere with performance in situations in which the visual channel is engaged in a primary task such as driving. However, using visual prompts can be more beneficial as part of a speech-based dialogue because it will probably take less time to read the prompt than to listen to the sequential spoken prompt, thus improving overall dialogue efficiency.
The objective of the study reported here was to empirically explore this trade-off in a specific multitask situation--interacting with an in-vehicle navigation system with speech as an input channel--while comparing visual and auditory modalities for delivering prompts. The general experimental approach employed was based on a paradigm incorporating a primary tracking task and a secondary speech-based data input task (Baber et al., 1996; Gawron, 2000). The rationale for employing this approach was to simulate a real-life situation in which the human operator is engaged in an eyes-busy, hands-busy task (driving) and at the same time is interacting with another system (data entry into an in-vehicle navigation system). A similar approach was used by Graham and Carter (2001). This paradigm enabled performance measurement of both tasks and the identification of the trade-off between them.
The experiment sample consisted of 60 participants (36 men and 24 women). The ages of the men ranged from 17 to 46 years, and the age range of the women was from 21 to 39 years. All participants were students from the Engineering Faculty of Tel Aviv University, Israel. Participants were assigned randomly to one of the three conditions of this experiment. All participants were proficient with MS Windows and mouse operations. All participants, according to their self-reports, had normal speech and no known manual dexterity or eye-hand coordination problems. All participants were native Hebrew speakers.
The experimental conditions were based on combinations of the following two tasks.
Primary task: Tracking The basic tracking task required the participant to keep a cross-shaped cursor on a fixed-speed moving circular target. The cursor and the target were displayed in a window fixed on the center of the screen. Tracking was performed with a joystick. A simple numerical counter was displayed at the top-right corner of the window (see Figure 1) in order to present a more difficult primary task. The participant was required to press the trigger of the joystick when the display in the counter exceeded a prescribed value. This additional requirement increased the demands of the primary task and enabled the measurement of the impact of different workloads on the use of speech-based user interfaces.
[FIGURE 1 OMITTED]
Secondary task: Data entry. This task required the participant to enter two destinations per trial into a hypothetical navigation system. Each destination was entered in a different format: either the name or the zip code of the destination. Dialogue prompts were presented as either a spoken prompt or a visual prompt displayed on the screen, next to the tracking window. The guided speech dialogue was designed as in the following partial example:
System: Enter name or zip?
System: Enter destination name.
Participant: Tel Aviv.
When a participant entered the first destination by name (according to trial instructions), then the second destination was entered by zip code, and vice versa. If a participant encountered a recognition error, he or she was required to say "error," and the system repeated the last prompt. After three consecutive errors, the dialogue was terminated. In addition, the participant could not interrupt while the system played back a voice prompt.
The experimental design was a fully factorial design based on two factors: tracking and prompt modality. Tracking was a between-participant factor and consisted of three conditions: a control condition, with data entry only and no tracking; a basic tracking task with data entry; and a difficult tracking task with data entry. Prompt modality was a within-participant factor and consisted of two conditions: visual prompts and spoken prompts. All participants performed each of these two conditions in two separate trials in a counterbalanced order. It can be assumed that participants did not become familiar with the dialogue sequence, regardless of the condition, because they were exposed to it only twice. Consequently, it can be assumed that there were similar short-term memory demands in all conditions.
The speech recognition apparatus used in the two experiments was based on a Learnout & Hauspie (L & H) speaker-dependent, continuous-speech speech recognition development suite. The system included the following components: ASR 1500 Version 5.1.2, a speech recognition engine; ASRAPI Version 5.1.2, a command library for applications development; Evaluator Version 4.11.007, a program for running the application with the speech recognition; and Lexicon Toolkit Version 4.02.004, a program for creating and maintaining the speaker-dependent data.
A special application was developed with MS C++ in order to run the speech dialogue (interfacing with the recognition engine), run the tracking program, display messages to the participant, play back spoken prompts, and record and store the participants' performance data.
The experiments were run on a personal computer with MS Windows. The configuration included a 15-inch (58-cm) color monitor, a standard keyboard, and a standard zero-order (displacement) control Logitech joystick. All speech prompts were prerecorded with a male voice and saved in a *.wav format. The experiment program used those files when spoken prompts were required. Spoken prompts were played back via a pair of desktop speakers connected to the computer.
The basic procedure of the experiment was as follows.
General instructions. The participant received a general description of the experiment and its objective.
Experimental condition assignment. The experimental program determined randomly the assignment of the participant to one of the between-participant experimental conditions.
Condition-specific instructions. The participant received the condition-specific instructions, which included explanations about the tracking task and the speech-based interaction (depending on the experimental condition assignment).
Tracking training. Each participant performing a tracking task received 20 s of tracking training.
Speech training. Speech training was based on the recommended procedure with the L & H speech recognition system. The participant repeated each word at least three times until the speech recognition program indicated that the word was trained. Each word was taken from a list of words related to the experiment, which included the digits zero through nine and six place names. In addition, the following commands were trained: error, finish, add, and reenter (as was described previously). It should be noted that the vocabulary for recognition was in Hebrew.
Experiment performance. Once the participant finished speech training, the actual experiment began. Each participant performed two trials, one with spoken prompts and one with visual prompts, entering two destinations in each trial. Participants received trial instructions, the required destination name, destination zip code, and the order of entering these data. A given trial began by having the participant click on a button on the screen. A trial ended when the participant said "finish."
Performance measurements were recorded in real time for both the primary and the secondary tasks.
Data entry performance. The performance of the secondary task was measured in terms of task duration. Task duration was measured from the beginning of a trial until both destinations were entered and the command "finish" was recognized.
Tracking performance. For measuring performance on the primary tracking task, the root mean square (RMS) of the distance between the target and the tracking cursor was measured and integrated across the duration of each trial. The exact calculation is presented in Equation 1,
(1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.],
in which T = total trial time, e = tracking error (distance between cursor center and target center), and t = a time point.
In addition, the time between the appearance of a numeric value exceeding the prescribed value and the time the participant pressed on the trigger of the joystick was measured. This response time was taken as an additional index for performance of the more difficult primary task.
Data Entry Performance
The mean duration of the navigation data entry task for each of the six experimental conditions was computed. The means are presented in Figure 2.
[FIGURE 2 OMITTED]
A two-way analysis of variance (ANOVA; 2 prompt modalities x 3 tracking conditions) with repeated measures on one factor was performed on the data. In general, data entry was longer, F(1, 47) = 156.93, p < .001, with speech prompts than with visual prompts. In addition, a significant two-way interaction was found, F(2, 47) = 21.29, p < .001, indicating that these differences were dependent on the tracking task. With speech prompts, task duration remained constant and was not affected by the tracking task. However, with the visual prompts, task duration was affected by the tracking task. Data entry duration was longer with difficult tracking than with basic tracking and no tracking. |n other words, with the difficult tracking task, data entry duration was similar regardless of prompt modality, spoken or visual.
Mean RMS of the tracking performance was computed for the two tracking conditions (basic and difficult) for each prompt modality (spoken or visual). The means are presented in Figure 3.
[FIGURE 3 OMITTED]
A two-way ANOVA (2 prompt modalities x 2 tracking conditions) with repeated measures on one factor was performed on the data. In general, tracking performance was significantly better (a lower mean RMS) with the spoken prompts, F(1, 38) = 7.08, p = .01. No significant interaction was found between the two factors. In other words, within a given prompt modality, tracking performance was similar for both basic and difficult tracking.
Tracking performance as a function of prompt modality was further examined in the difficult tracking condition. Mean response times to the appearance, in the tracking window, of a value exceeding a prescribed value was computed for each of the speech data entry conditions. The mean response time for a dialogue with spoken prompts was 0.8 s (SD = 0.27 s), and the mean response time for a dialogue with visual prompts was 1.82 s (SD = 0.3 s). These two means were significantly different, F(1, 19) = 222.42, p < .001. In other words, response times were significantly shorter when spoken prompts were used.
Summary of the Results
Overall, the duration of speech-based data entry with spoken prompts was longer than with visual prompts. Moreover, data entry with spoken prompts was consistently longer and similar regardless of the presence and difficulty of the primary tracking task. It seems that this measure was primarily affected by the presence and duration of the spoken prompts themselves and does not necessarily reflect the impact of spoken prompts in the multitask situation that was examined here. In contrast, entry durations with visual prompts, although consistently shorter than with spoken prompts, were affected by tracking task difficulty. Durations in the control (no tracking) and basic tracking conditions were significantly shorter than with spoken prompts. However, data entry duration with visual prompts in the difficult tracking task condition was as long as data entry with spoken prompts. This reflects a significant decrease in the benefit of visual prompts as a function of an increase in the visual workload characteristics of this multitask situation.
Tracking performance was affected by prompt modality in the speech-based data entry condition. Performance was significantly better when data entry was with spoken prompts, as compared with when the prompts were visual. Similar findings were reported by Graham and Carter (2001) with respect to feedback modality in voice dialing during performance of a tracking task. Tracking performance in this study was not affected by the tracking difficulty itself. Furthermore, the additional task in the difficult tracking condition was also affected by prompt modality: There were shorter response times with spoken prompts.
Taken together, the findings here confirmed the predictions based on resource theory. The use of visual prompts as part of a speech-based dialogue interfered with the performance of visually based tracking. Wickens (1980) developed the resource notion further in multiple resource theory, saying that the competition for resources is not straightforward and is dependent on several factors, such as reception and response modalities and cognitive processing codes and stages. The account for the dialogue performance with visual prompts is as follows: There was a spatial-visual primary task (tracking) concurrent with a verbal secondary task (data entry) with visual reception (prompts displayed on the screen) and verbal response (user's speech data entry). This in turn may have created competition in the visual reception channel between the primary tracking task and the visual aspect in the data entry secondary task.
Resource allocation theory can also account for the degraded performance of data entry with visual prompts and difficult tracking as compared with no tracking or basic tracking. In this case, there was an increased demand for resources in the spatial-visual channel because of the more difficult primary tracking task. This may have resulted in significantly fewer resources available for speech-based data entry with visual prompts, which was associated with longer data entry durations.
Practical and Research Implications
The practical question of this study was which modality to use for guiding the user with a speech-based user interface in a multitask, hands-busy, eye-busy context. The findings here suggest that the use of spoken prompts, as compared with visual prompts, can better support overall performance of a visual primary task (e.g., driving). Consequently, the most important practical implication is to design a speech-based system with spoken prompts when the performance of a primary task such as driving is critical and must not be interrupted in any way. However, the findings also suggest that the use of spoken prompts can degrade the performance of the speech-based interaction itself (e.g., data entry). This can be problematic when performance of the secondary task is important as well. Thus more attention should be given to the design of effective and efficient spoken prompts.
These findings support the recommendation that some of the beneficial characteristics of visual prompts be applied to the design of speech prompts. The primary characteristic of the visual prompt is that it supports preattentive processes and thus enhances rapid pattern recognition. For example, one does not need to read each character and word in order to recognize and comprehend the complete prompt. To apply this principle in the design of speech prompts, one needs to make them brief and adaptive, interruptible, and contextually predictable in a well-structured dialogue. For example, it is sensible to assume that novice users may need more guidance at the beginning but that less guidance will be required as they become more experienced with the system. Consequently, the length of spoken prompts can become shorter as the dialogue progresses. In addition, as the user becomes more familiar with the dialogue, the user should be allowed to interrupt the spoken prompt by proceeding with the dialogue. Another direction is to design the interaction in a way that will progressively reduce the user's dependence on system guidance, sometimes referred to as a mixed-initiative dialogue strategy (e.g., Walker, Fromer, Di Fabbrizio, Mestel, & Hindle, 1998). Yet another is to design the structure of the dialogue and the content of the prompt as a predictable pattern so that the user does not need to listen to the entire prompt before recognizing what it is. For example, design a consistent structure and sequence for the various dialogue components; for instance, start with guidance to the user, follow with feedback, and then deal with error messages if needed.
One cannot simply conclude that a speech-based user interface in multitask situations should always have spoken prompts as part of the dialogue. Other factors will influence performance in multitask situations, such as the auditory condition of the environment (e.g., noise), the accuracy of the speech recognition technology, and the extent to which the user is familiar with the system. Further research should explore the possible interactions among these various factors in order to further generalize the applicability of the present findings.
This study was performed while the author was an adjunct researcher in the Department of Industrial Engineering, Engineering Faculty, Tel Aviv University, Israel.
Baber, C. (1993). Designing interactive speech technology. In C. Baber & J. M. Noyes (Eds.), Interactive speech technology (pp. 1-17). London: Taylor & Francis.
Baber, C., Mellor, B., Graham, R., Noyes, J. M., & Tunley, C. (1996). Workload and the use of automatic speech recognition: The effects of time and resource demands. Speech Communication, 20, 37-53.
Bradford, J. H. (1995). The human factors of speech-based interfaces: A research agenda. SIGCHI Bulletin, 27(2), 61-67.
Damos, D. L., & Lyall, E. A. (1986). The effect of varying stimulus and response modes and asymmetric transfer on the dual-task performance of discrete tasks. Ergonomics, 29, 519-553.
Gawron, V. J. (2000). Human performance measures handbook. Mahwah, NJ: Erlbaum.
Graham, R., & Carter, C. (2001). Voice dialing can reduce the interference between concurrent tasks of driving and phoning. International Journal of Vehicle Design, 26(1), 30-47.
Hansen, B., Novick, G. D., & Sutton, S. (1996). Systematic design of spoken prompts. In Proceedings of the CHI '96 Conference on Human Factors in Computing Systems (pp. 157-164). New York: ACM Press.
Hapeshi, K., & Jones, D. M. (1989). Concurrent manual tracking and speaking: Implications for automatic speech recognition. In M. J. Smith & G. Salvendy (Eds.), Work with computers: Organizational, stress and health aspects (pp. 412-418). Amsterdam: North-Holland.
Kahneman, D. (1973). Attention and effort. Englewood Cliffs, NJ: Prentice Hall.
Lee, J. D., Caven, B., Haake, S., & Brown, T. L. (2001). Speech-based interaction with in-vehicle computers: The effect of speech-based E-mail on driver's attention to the roadway. Human Factors, 43, 631-640.
Linde, A., & Shively, R. (1988). Field study of communication and workload in police helicopters: Implications for cockpit design. In Proceedings of the Human Factors Society 32nd Annual Meeting (pp. 237-241). Santa Monica, CA: Human Factors and Ergonomics Society.
Marshall, J. P. (1992). A manufacturing application of voice recognition for assembly of aircraft wire harness. In Proceedings of Speech Tech/Voice Systems Worldwide (n.p.). New York: Media Dimensions.
Martin, G. L. (1989). The utility of speech input in user-computer interfaces. International Journal of Man-Machine Studies, 30, 355-375.
Martin, T. B. (1976). Practical applications of voice input to machines. Proceedings of the IEEE, 64(4), 487-501.
McCallum, M. C., Campbell, J. L., Richman, J. B., & Brown, J. L. (2004). Speech recognition and in-vehicle telematics devices: Potential reductions in driver distraction. International Journal of Speech Technology, 7, 25-33.
Moray, N. (1967). Where is attention limited? A survey and a model. Acta Psychologica, 27, 84-92.
Murata, A. (1998). Effectiveness of speech response under dual task situations. International Journal of Human-Computer Interaction, 10(3), 283-292.
Rudnicky, A. I. (1995). The design of spoken language interfaces. In A. Syrdal, R. Bennett, & S. Greenspan (Eds.), Applied speech technology (pp. 403-428). Boca Raton, FL: CRC Press.
Schmandt, C. (1994). Voice communication with computers: Conversational systems. New York: Van Nostrand Reinhold.
Schraagen, J. M. C. (1993). Information presentation in in-car navigation systems. In A. M. Parkes & S. Franzen (Eds.), Driving future vehicles (pp. 171-185). London: Taylor & Francis.
Simpson, C. A., McCauley, M. E., Rolan, E. F, Ruth, J. C., & Williges, B. H. (1985). System design for speech recognition and generation. Human Factors, 27, 115-141.
Walker, M. A., Fromer, J., Di Fabbrizio, G., Meste], C., & Hindle, D. (1998). What can I say?: Evaluating a spoken language interface to Email. In Proceedings of the CHI '98 Conference on Human Factors in Computing Systems (pp. 582-589). New York: ACM Press.
Waterworth, J. A. (1982). Man-machine speech dialogue acts. Applied Ergonomics, 13, 203-207.
White, R. G. (1997). Lessons from flight research relevant to voice operated driver systems. In 4th World Congress on Intelligent Transport Systems (Paper No. 2001; pp. 21-24). Berlin: ICC.
Wickens, C. D. (1980). The structure of attentional resources. In R. Nickerson (Ed.), Attention and performance VIII (pp. 239-257). Mahwah, NJ: Erlbaum.
Wickens, C. D., Sandry, D. L., & Vidulich, M. (1983). Compatibility and resource competition between modalities of input, central processing and output. Human Factors, 25, 227-248.
Yankelovich, N. (1996). How do users know what to say? Interactions, 3(6), 32-43.
Yankelovich, N., Levow, G.-A., & Marx, M. (1995). Designing SpeechActs: Issues in speech user interfaces. In Proceedings of the CHI '95 Conference on Human Factors in Computing Systems (pp. 569-576). New York: ACM Press.
Avi Parush is a professor of psychology at Carleton University. He received his Ph.D. in experimental psychology in 1984 from McGill University, Montreal, Quebec, Canada.
Date received: March 31, 2003
Date accepted: November 22, 2004
Address correspondence to Avi Parush, Department of Psychology, Carleton University, B552 Loeb Building, 1125 Colonel By Dr., Ottawa, ON, Canada, K1S 5B6; firstname.lastname@example.org.
|Printer friendly Cite/link Email Feedback|
|Date:||Sep 22, 2005|
|Previous Article:||Sharing control between humans and automation using haptic interface: primary and secondary task performance benefits.|
|Next Article:||Relationships among display features, eye movement characteristics, and reaction time in visual search.|