Linguistic Cues and Memory for Synthetic and Natural Speech.
The field of machine speech generation has witnessed phenomenal growth over the past 30 years. Today speech output systems are critical components in military and industrial warning systems, feedback devices in aerospace vehicles, education and training modules, aids for the handicapped, consumer products, and technologies designed to increase the functional independence of older adults. As a consequence it is important to investigate whether task performance differs when computer speech is substituted for natural speech.
This is a particularly critical question in the case of speech generated by text-to-speech (TTS) synthesizers. TTS systems produce words by combining stored representations of phonemes and other small units of speech according to a set of rules. Research has demonstrated that compared with natural voice, speech produced by TTS synthesizers places an increased burden on perceptual and cognitive resources during the comprehension process. Performance deficits have been discovered at many stages of processing, from phonemes to paragraphs, and have been detected on a variety of measures (see Duffy & Pisoni, 1992, for a review).
One of the most basic yardsticks for comparison between natural and synthetic speech is single-word intelligibility. A typical procedure (Modified Rhyme Test) is to ask listeners to identify the word they heard from among a set of words differing by only one phonetic feature (House, Williams, Hecker, & Kryter, 1965). On this measure some TTS systems compare favorably with natural speech, with error rates only about 2% higher than natural speech (e.g., Logan, Greene, & Pisoni, 1989). However, measures of listeners' processing speed consistently reveal substantial differences between natural and synthetic voice input. That is, people take longer to comprehend computer-generated speech, even when single-word intelligibility is essentially equivalent. For example, the latency threshold for recognition of single words is higher for synthetically produced stimuli than for words produced by a human speaker (Manous & Pisoni, 1984). Also, data from sentence verification tasks reflect slower reaction times to judge th e truth of sentences spoken by a speech synthesizer than for natural speech sentences (Pisoni, Manous, & Dedina, 1987).
Other measures of processing speed reveal similar differences. Paris, Gilson, Thomas, and Silver (1995) reported that when people were asked to shadow (repeat verbatim, as they listen to recorded passages, what a speaker is saying), they were less accurate when they listened to synthetic speech than when they listened to natural speech. Delogu, Conte, and Sementina (1998) found that when participants listened to passages for comprehension, they were slower to detect auditory clicks as a secondary task when the passages were presented in synthetic speech than when presented in natural speech.
A number of factors could account for the processing speed differences between synthetic and natural speech. First, there are differences in the amount of information conveyed by natural and synthetic speech at the phonemic level. Synthetically generated phonemes are "impoverished" relative to natural speech because many acoustic cues are either poorly represented or not represented at all (Pisoni, 1981). Researchers have generally agreed that natural language cues become more important when bottom-up cues are less robust. For example, classic studies on comprehension of natural speech presented in noise reveal the importance of contextual cues as a compensatory mechanism (e.g., Miller, Heise, & Lichten, 1951). Given that synthetically generated speech conveys less adequate phonemic information, it is not surprising that meaningful context has been demonstrated to be of particular importance for comprehending synthetic speech (Hoover, Reichle, Van Tasell, & Cole, 1987; Mirenda & Beukelman, 1987; Pisoni & Hun nicutt, 1980).
Another important difference between natural speech and speech generated by TTS systems may be the extent to which prosodics (intonation, voicing, stress, durational patterns, rhythm, etc.) are appropriately modeled. Prosodic cues provide perceptual segmentation and redundancy, speeding the real-time processing of continuous speech. They guide expectancies, cause search processes to end when contact is made between an acoustic representation and a cognitive representation, and influence the actual allocation of processing capacity in terms of power, temporal location, and duration (e.g., Darwin, 1975; Grosjean, 1983; Haggard, 1975; Martin, 1972, 1975; Meltzer, Martin, Mills, Imhoff, & Zohar, 1975; Sanderman & Collier, 1997; Wingfield, 1975).
Synthetic speech systems, however, are limited in their prosodic capabilities, particularly with respect to emulating appropriate stress and intonation patterns. Correct use of contrastive stress requires an appreciation of the meaning of an utterance based on an accurate parsing of its syntactic and semantic components. Artificial intelligence sentence parsers exist, but they are too expensive and too computationally slow to be incorporated into affordable, real-time speech synthesizers. Instead, prosody in TTS systems is generally limited to the addition of pitch contours to phrase units marked by punctuation. Because these variations are implemented by a set of rules, the resulting prosodic markers are less robust than for human speech and may even be incorrect (Nusbaum, Francis, & Henly, 1995).
Although meaningful context is known to increase the intelligibility of synthetic speech, little research exists with respect to the influence of prosody. Terken and Lemeer (1988) reported that prosody increases the rated "attractiveness" of synthetic speech, and Slowiaczek and Nusbaum (1985) found that the presence of pitch contour improved the intelligibility of syntactically complex sentences. Also, Sanderman and Collier (1997) reported that appropriate prosody speeded verification response times for synthetically generated statements, as compared with utterances with inappropriate prosody. However, to our knowledge, direct comparisons of the effects of these variables for natural and synthetic speech have not been made.
The present experiment was modeled after a study by Stine and Wingfield (1987) in which the relative importance of prosodic and contextual cues on immediate recall of natural speech was compared for younger and older adults. Immediate recall of spoken discourse is an ecologically valid test that involves speech recognition, analysis for comprehension, and encoding into memory.
Stine and Wingfield (1987) used immediate recall to determine if top-down cues, such as intonation patterns and meaningful context, would be particularly important for older adults who are thought to have limited working memory capacity, slower speeds for processing information in working memory, or both. Four types of word strings were used: (a) normal sentences, (b) normal sentences spoken in monotone (prosodic cues absent), (c) semantically anomalous sentences with normal intonation (contextual cues absent), and (d) meaningless word strings spoken in monotone (both cues absent). When presented with normal sentences, younger and older listeners had roughly equivalent recall abilities. However, the removal of prosodic or contextual cues had a greater impact on older participants than on younger listeners. For younger adults, the removal of prosody from normal sentences had little effect on immediate recall, whereas for older adults, the removal of prosodic cues disrupted performance. For both younger and ol der adults, recall of semantically anomalous sentences was more difficult than recall of normal sentences. However, older adults were more disadvantaged by the absence of context than were younger adults. Finally, recall of meaningless word strings was poor across all participants, but again, younger adults performed somewhat better than older participants.
By analogy we reasoned that if processing of synthetic speech requires more working-memory capacity or a longer time in working memory than does natural speech perception, and if prosodic and contextual cues speed processing, then these cues would be especially useful for synthetic speech perception. Accordingly the present experiment was designed so that participants listened to four types of word strings, similar to those described previously. The stimuli were presented by either a human speaker or a voice generator using a TTS system. It was anticipated that recall would be poorer for synthetic speech than for natural speech, particularly when prosodic cues, contextual cues, or both were absent. The reasoning was that working memory will be taxed when segmenting utterances without prosodic cues, and word recognition itself may suffer with the loss of contextual cues.
A total of 78 students from undergraduate psychology classes participated for course credit. The age range of participants was 18-43 years (M = 24 years, SD = 6.5 years). Usable data were limited to 69 participants (19 males and 50 females) who met the following qualifications: (a) English as their native language or as a primary language learned concurrently with their other language, (b) no hearing impairments, and (c) no extensive prior experience with synthetic speech.
Overview of Design
Participants were assigned randomly in equal numbers to one of three speech modes: Natural Speech, Digital Equipment's DECtalk TTS synthesizer, or Sound Blaster's Monologue for Windows TTS synthesizer. The two synthesizers represent the higher (DECtalk) and lower (Sound Blaster) ends of the cost continuum of current commercial TTS systems. Male voices were used in all speech modes.
In order to measure single-word intelligibility for each speech mode, participants were first administered the Modified Rhyme Test (House et al., 1965). Next, the immediate recall task was presented. Participants heard 80 utterances, 20 of each of four types: Normal (prosodic and contextual cues present), no prosody (normal sentences with prosody removed), no context (semantically anomalous sentences with prosody), and unstructured (unrelated words with no prosody). The participants were asked to repeat each string verbatim immediately after its presentation. Thus the experimental design was a 3 (speech mode) x 4 (utterance type) factorial. Speech mode was a between-participants variable, and utterance type was manipulated within participants.
As an additional measure, listeners' subjective ratings of voice "naturalness" and "intelligibility" were collected. Participants heard 24 additional utterances (eight in each speech mode and two of each utterance type), which they rated on these dimensions.
Modified Rhyme Test. The Modified Rhyme Test was used to establish single-word intelligibility for each of the three speech modes. Participants heard 50 words, and for each were asked to circle the word they heard from a set of six rhyming alternatives.
Immediate recall. The stimuli for this study were drawn from sentences developed at the Harvard University Psychoacoustic Laboratory (Egan, 1948; Institute of Electrical and Electronics Engineers, 1969). The Harvard sentences are normal English sentences that avoid clic[greater than or equal to]s, proverbs, stereotyped constructions, and too frequent use of any one word. The basic sentences were combined to create 80 sentences, each containing 15-20 words (M = 17.5 words). Because this length well exceeds the capacity of working memory ceiling effects were unlikely. In addition, in order to avoid ceiling effects, particularly in conditions in which semantic context was present, the linked sentences were created so that their relationship was not highly predictable (e.g., "The sky that morning was clear and bright, but the junkyard had a moldy smell.")
The 80 sentences were then divided randomly into four groups of 20: normal sentences, no prosody, no context (nouns, verbs, adverbs, and adjectives were interchanged randomly within sentences so that meaning was lost but the syntactic structure was retained), and unstructured (word order was scrambled so that neither meaning nor syntax remained). Examples of each utterance type are as follows: (a) normal -- "Add salt before you fry the egg, then pour the stew from the pot into the plate," (b) no prosody -- a normal sentence delivered without prosodic cues, (c) no context -- "Add house before you find the truck, then pour the wisp from the plate into the cloud," and (d) unstructured -- "In with she plate storm after of proof wide best bring black raft brown the grass find."
This procedure of randomly creating four sets of 20 stimuli was repeated three times, yielding four different groupings of 80 stimuli. Each set was used an approximately equal number of times in each speech mode condition.
Speech samples. The natural speech samples were generated by a male speaker. The speech synthesizers were a DECtalk PC, Version 4.2, Paul's Voice (Digital Equipment, 1994) and a Sound Blaster Monologue for Windows, Version 1.5 OEM (First Byte, 1991-92). Both synthesizers were controlled by an IBM-compatible 486 PC with a memory-resident DOS driver and DECtalk Express speech board (for DECtalk) or a Windows 3.1 driver and Creative Lab's Sound Blaster 16 audio card (for Sound Blaster). The sentence stimuli were entered into text files and synthesized using normal spelling and default settings.
Pitch, rate, and amplitude were matched for all three speech modes. Past research has established optimal values for pitch and rate for synthetic speech (Simpson & Marchionda-Frost, 1984). The speech samples were manipulated to conform to these specifications. For pitch, a target value of 100 Hz was selected, and speaking rate was controlled to fall within a range of 150-180 words/min. Amplitude for all three speech modes was controlled at the time of presentation by output settings. The goal was to approximate normal conversational speech levels of approximately 60-63 dB SPL (Pavlovic, 1987).
The speech stimuli were sampled and recorded using Creative Technology's Wave-Studio, Version 1.1 (1992) software and Creative Lab's Sound Blaster 16 audio card, both controlled by an IBM-compatible 486 PC with the Windows 3.1 environment. The WaveStudio is a Windows application for recording, playing, and editing waveforms. Digitized waveform files were created using an 8-bit format and 22-kHz sampling rate.
To create the no prosody and unstructured stimuli in the natural speech condition, individually recorded words were concatenated into strings. In the case of DECtalk, these stimuli were created by inserting commas after each word in the text file to disable the system's parsing function. Comma pause time was 120 ms. For Sound Blaster, prosody could not be controlled by inserting commas because pause length is not adjustable and is unacceptably long. Instead all punctuation was removed from the text and the speed was adjusted as necessary to create an acceptably unstressed delivery. Note that for all three speech modes, any existing within-word ("lexical") prosody was retained.
The participants were tested individually in a quiet room. After giving informed consent, they completed a demographics questionnaire. Next, they were administered the Modified Rhyme Test. Then the immediate recall task was described. Before they began, four samples were presented, one of each of the four stimulus types. Participants were instructed to listen to each stimulus and then repeat as much of it as they could remember. They were told to repeat any and all parts they recalled, even if they were not sure of the word or phrase order. Their responses were recorded on audio tapes.
After completing the immediate-recall task, participants were asked to listen to 24 speech stimuli and to make ratings of each in terms of its intelligibility and naturalness. There were eight strings from each speech mode, representing two tokens of each of the four stimulus types. The stimuli were presented in random order. After listening to each stimulus, participants rated it on a scale from 1 to 10. For intelligibility, a 10 corresponded to "understood all of the words" and a 0 corresponded to "understood none of the words." For naturalness, a 10 corresponded to "sounds just like a human speaker" and a 0 corresponded to "sounds not at all like a human speaker." Experimental sessions lasted about 1 hr. A 3- to 5-min break was allowed halfway through the session.
Four raters scored the immediate recall, each scoring approximately one-fourth of the data set. To achieve consistency in scoring, a set of instructions, guidelines, and examples was given to each rater. Words were counted as correct when they deviated slightly from the one heard (i.e., incorrect tense, plural instead of singular, words contracted, etc.). Two raters each rescored the data from 8 participants, so that reliability of ratings could be estimated. Scores did not vary more that 1% among the raters. Cronbach coefficient alphas for the two sets were both .99.
For all statistical comparisons, an alpha level of p [less than] .01 is used. Comparisons among means following significant F ratios were made by the Newman-Keuls test.
Modified Rhyme Test
Single-word intelligibility scores for the three speech modes were as follows: 93.7% for natural speech, 83.1% for DECtalk, and 85.2% for Sound Blaster. The figure for natural speech is similar to others in the literature (e.g., Ralston, Pisoni, Lively, Greene, & Mullenix, 1991). The score for DECtalk however, is lower than the 96.7% average reported for an earlier version of this system (e.g., Greene, Manous, & Pisoni, 1984; Pisoni, Nusbaum, & Greene, 1985). We are unaware of any previously published intelligibility data for Sound Blaster.
The total number of words correctly recalled by each participant was analyzed by a 3 (Speech Mode) x 4 (Stimulus Type) mixed-model analysis of variance (ANOVA). Speech mode was a between-participants variable and stimulus type was a within-participants variable. The main effect of speech mode was significant, F(2, 66) = 10.57, MSE = 3646.86, as was the main effect of stimulus type, F(3, 198) = 570.72, MSE = 598.01.
The interaction between speech mode and stimulus type was also significant, F(6, 198) = 8.32, MSE = 598.01. Average percentages of words recalled in each condition are presented in Table 1. For both normal sentences and no context stimuli, participants who listened to natural speech performed significantly better than those in either synthetic voice group. However, recall for DECtalk speech did not differ reliably from that for Sound Blaster. The pattern of results for the no prosody stimuli was surprising; performance levels for all voices were virtually identical in this condition. As can be seen in Table 1, it appears that the presence or absence of sentential prosody had no effect on recall accuracy of stimuli generated by TTS synthesizers. In the case of unstructured stimuli, recall was significantly better for natural speech than for Sound Blaster. However, differences between natural speech and DECtalk and between DECtalk and Sound Blaster were not significant.
A second two-way ANOVA was conducted to ascertain whether practice or fatigue effects occurred. The analysis compared overall performance for participants in the three speech mode conditions on two blocks of trials (first half, last half). Although the main effect of speech mode was significant, F(2, 66) = 10.30, MSE = 7336.85, there was no significant effect of trial blocks, nor was the interaction significant. Thus performance patterns did not appear to change over time.
All participants rated stimuli in each speech mode and for all stimulus types. Because they had differential experience by virtue of having listened to one of the voices during the immediate recall test, prior test condition was included as a variable. Intelligibility and naturalness ratings were analyzed by a 3 (Prior Test Condition) x 3 (Speech Mode) x 4 (Stimulus Type) mixed ANOVA. Prior test condition was a between-participants variable and the latter two were within-participants variables.
Neither the main effect of prior test condition nor the interactions involving it were significant. There was a significant main effect of speech mode, F(2, 132) = 165.39, MSE = 2.62, and of stimulus type, F(3, 198) = 98.79, MSE = 1.65. Table 2 illustrates the mean intelligibility judgments for each stimulus type and speech mode.
Overall, participants judged natural speech (M = 8.21) to be the most intelligible, followed by DECtalk (M = 7.03) and Sound Blaster (M = 5.71). As for stimulus type, participants found normal stimuli (M = 8.39) to be the most intelligible, followed by no-prosody stimuli (M = 7.31), no-context stimuli (M = 6.87), and Unstructured stimuli (M = 5.36). The significant interaction results from a single deviation from this pattern in the natural speech condition. Participants considered the no-context stimuli presented by the human voice (M = 9.27) almost as intelligible as the normal stimuli presented in that voice (M = 9.86).
The main effect of prior test condition was marginally significant, F(2, 66) = 4.45, MSE = 17.04, p = .015. Overall, participants who had listened to natural speech during the immediate recall task were more conservative in their ratings of voice naturalness (M = 4.77) than were participants who had earlier heard either DECtalk (M = 5.81) or Sound Blaster (M = 5.42). None of the two-way interactions involving prior test condition were significant.
The main effects of speech mode, F(2, 132) = 228.44, MSE = 4.41, stimulus type, F(3, 198) = 134.53, MSE = 2.09, and their interaction, F(6, 396) = 50.91, MSE = 1.74, were all significant. Table 3 presents the means for each condition.
As with intelligibility, participants judged natural speech (M = 7.50) to be superior in naturalness to both DECtalk (M = 4.73) and Sound Blaster (M = 3.79), and DECtalk was once again preferred to Sound Blaster. Normal stimuli were given the highest ratings (M = 6.70), and unstructured stimuli were given the lowest ratings (M = 4.18). However, unlike intelligibility ratings, overall naturalness ratings for no-context stimuli (M = 5.95) were higher than for no-prosody stimuli (M = 4.54). Note that the higher naturalness ratings for the no-context stimuli versus no-prosody stimuli are caused by the universally higher ratings given to natural speech in this condition. That is, similar to the results for intelligibility, participants considered the no-context stimuli presented by the human voice (M = 9.49) almost as natural as the normal stimuli presented in that voice (M = 9.71). Apparently, for natural speech the presence of sentential prosody is more important in determining both intelligibility and naturaln ess than is meaningful context.
The three-way interaction of prior test condition, speech mode, and stimulus type was also significant, F(12, 396) = 2.29, MSE = 1.74. Generally, participants who had prior exposure to synthetic speech during the recall testing gave more favorable ratings to the synthetic speech voices than did participants who had been exposed to natural speech during that session. Also, participants who had heard either of the synthetic voices during the recall test deemed DECtalk to be superior to Sound Blaster in naturalness, more so than did participants who had heard natural speech. Apparently, lack of familiarity with synthetic speech was associated with more negative valuations of naturalness and less sensitivity to differences between the two synthetic voices.
As linguistic cues were eliminated, immediate recall of speech declined. The one exception to this pattern was in the case of synthetic speech when sentential prosody was removed. Contrary to expectations and to the case of natural speech, removing prosody did not reduce performance in the synthetic speech conditions. The explanation for this appears to be that prosodic cues, such as they exist in these speech systems, are not helpful, so removing them causes no decrement in performance. The beneficial effects of prosody on recall of natural speech, however, were very apparent. Performance deteriorated when these cues were removed; indeed, in the absence of sentential prosody, recall of natural speech was equivalent to that of synthetic speech. Participants often remarked that the human speaker "sounded like a computer" when prosody was removed. This finding is in contrast to that of Stine and Wingfield (1987), who reported that removal of prosody from normal sentences presented in a natural voice had little impact on immediate recall for young adults.
A possible explanation may lie in the methods used to remove prosody from natural speech. In the present study we concatenated individually recorded words, a procedure that not only guarantees removal of sentential prosodic parameters but also eliminates between-word coarticulation. In normal speech the articulation of initial phonemes in a word is influenced by the articulation of the final phoneme in the preceding word. Like prosody, coarticulation is known to speed processing because it transmits information about more than one phoneme at a time (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967). In contrast to our procedure, Stine and Wingfield (1987) used a speaker "who was well practiced in reading words on beats with equal stress monitored by peaking the recording level meter of the tape recorder on each word" (p. 273). We speculate that prosody and coarticulation are very difficult for a human speaker to suppress. Whether they were completely eliminated in Stine and Wingfield's study is uncert ain.
It should be noted that lexical prosody was not removed in any of the word strings. Although differences in lexical prosody may have contributed to the superior performance of natural speech on the Modified Rhyme Test, the fact that removal of sentential prosody from normal sentences eliminated the natural speech advantage on immediate recall suggests that prosodic parameters at the sentence level are particularly important in processing longer strings. Sentential prosody serves two functions: phrasing and focusing. Phrasing has a grouping function and helps the listener extract structural information, whereas focus helps the listener locate important information (Sanderman & Collier, 1997; Terken, 1993). These findings, then, point to this particular function of prosody as the specific contributor of on-line performance decrements. Thus performance losses should be expected inasmuch as the speed of real-time processing logically depends upon the success of ongoing parsing mechanisms.
If real-time processing is slowed by the removal of sentential prosody, it may be attributed in part to a reallocation of processing resources. Specifically, listeners who have come to expect the redundancies provided by prosodic cues may shift to a more shallow form of processing (e.g., Craik & Lockhart, 1972), as attention is directed to more superficial acoustical information and drawn away from deeper linguistic analyses. This shift toward superficial processing has been found in a number of studies involving increased resource demands on working memory for both synthetic speech (Luce, 1982; Paris et al., 1995; Ralston et al., 1991) and natural speech (Stine & Wingfield, 1987).
In accordance with findings by Stine and Wingfield (1987), the present results also reflect the importance of contextual cues for language processing. The removal of contextual cues led to even poorer recall performance than did removal of prosody. Performance deteriorated substantially in those conditions in which contextual cues were absent (no context and unstructured). Reliance on semantic context is a compensatory mechanism that listeners frequently utilize, particularly when there are deficiencies in phonemic-acoustic input, as with synthetic speech. Also, the facilitating influence of syntactic cues is reflected in the poor performance on the unstructured stimuli. These stimuli differed from all the others by virtue of the absence of syntactic structure. In this condition immediate recall was a function of working memory span. Natural speech research has demonstrated the interplay of syntax and prosody (Blaauw, 1994; Collier & Hart, 1975; Cooper & Paccia-Cooper, 1980; Jarvella, 1971; Wingfield, 1975). Syntactic units support the perceptual segmentation (chunking) of running speech, and prosodic cues aid in that resolution process by adding information about clausal boundaries. Under normal conditions, syntax, context, and prosody together support language processing at an automatic level.
A second interest in this study was to examine subjective ratings of intelligibility and naturalness. When the patterns of ratings presented in Tables 2 and 3 are compared, it becomes obvious that judgments of both of these aspects, particularly for natural speech, were influenced greatly by the presence or absence of prosody. Whereas intelligibility ratings for the TTS voices appeared to be a primary function of stimulus type, the human voice was rated as highly intelligible even when the utterance was meaningless, so long as prosody remained. For the naturalness attribute, ratings of the human speaker were dictated entirely by the presence or absence of prosody. Although the presence of prosody in DECtalk increased judged naturalness somewhat, this variable had little effect on ratings for Sound Blaster. Although the question has been raised as to whether or not listeners can separate judgments of naturalness from impressions of intelligibility (Nusbaum et al., 1995), it appears that our participants were able to focus on each attribute individually.
A comparison of Tables 2 and 3 reveals that all three voices were ranked higher on intelligibility than on naturalness, and natural speech was rated as superior to either of the synthetic voices, particularly for naturalness. These findings correlate well with three facts: (a) intelligibility of high-quality synthetic speech can approach that of natural speech (Greene, Logan, & Pisoni, 1986; Logan et al., 1989); (b) listeners can reliably detect differences in naturalness between synthetic and natural speech at the level of the glottal waveform source, even with the highest-quality systems (Nusbaum et al., 1995); and (c) the rankings of the three voices in this study for both intelligibility and naturalness correspond to those found for recall performance.
The results of this research suggest several design implications:
1. Prosodic cues. Prosodic modeling, as instantiated in the TTS synthesizers used in the present research, does little to facilitate comprehension. This fact may explain why even high-quality synthetic speech still imposes a greater mental workload on listeners than does natural speech. Because performance is adversely affected, the use of synthetic voice in a task that requires rapid response to linguistic content, or in tasks involving linguistically complex or demanding secondary tasks, is questionable (e.g., Luce, Feustel, & Pisoni, 1983; Nusbaum & Pisoni, 1985; Pisoni et al., 1985). Also, appropriate prosodic cues increase listeners' ratings of intelligibility and naturalness. These factors are undoubtedly important in determining user acceptability of synthetic voice.
2. Contextual cues. It is advisable for designers to incorporate as many contextual cues as possible within the limits of the specific task. Simpson and Williams (1980) recommended adding semantic context to synthetic cockpit warnings based on their findings that the additional linguistic redundancy provided by such cues reduced overall attention required for comprehension and did not increase response time. Context becomes increasingly important as intelligibility decreases because listeners rely on other knowledge sources for word recognition. In applications in which few contextual cues are present, or in those situations in which acoustical cues may be masked by high ambient noise, or perhaps in the confusion of emergencies, the use of synthetic speech with prosodic and contextual deficiencies may be less appropriate than messages relayed by other means (e.g., over-learned icons or echoes).
3. Comparison of TTS system quality. Although single-word intelligibility may provide some useful information regarding synthetic voice quality, this measure cannot assess differences that may exist in sentential prosody. Based on our findings, we would suggest that the ability of a TTS system to emulate appropriate prosody is of critical importance. Thus tests such as the Modified Rhyme Test need to be supplemented with comparisons involving larger speech units. Benoit, Grice, and Hazan (1996)recently devised one such technique: Using a set of semantically unpredictable sentences to detect subtle differences in TTS intelligibility.
In conclusion, our findings demonstrate the importance of higher linguistic cues (prosodics, syntax, semantics) for synthetic speech processing and further implicate the prosodic modeling of TTS systems as the source of a major performance differential between natural speech and synthetic speech processing. Prosodic cues provide a powerful guide for the parsing of speech and provide helpful redundancy. When these cues are not appropriately modeled, the impoverished nature of synthetic speech places an additional burden on working memory that can exceed its capacity to compensate under real-time demanding circumstances. Designers should strive to represent prosodic cues more accurately in order to avoid high cognitive processing costs associated with even the highest-quality TTS systems.
We gratefully acknowledge the assistance of Charlene Jacks, Rashmi Badri Maharaj, Dayren de Pedro, and Tiffany Steltenkamp in scoring the recall data.
Carol R. Paris is a research psychologist at the Naval Air Warfare Center Training Systems Division, Orlando, FL. She received a Ph.D. in psychology from University of Central Florida in 1996.
Margaret H. Thomas is a professor of psychology at University of Central Florida. She received a Ph.D. in psychology from Tulane University in 1971.
Richard D. Gilson is a professor of psychology at University of Central Florida. He received a Ph.D. in psychology from Princeton University in 1968.
J. Peter Kincaid is Principal Scientist at the Institute for Simulation & Training at the University of Central Florida. He received a Ph.D. in psychology from Ohio State University in 1971.
Benoit, C., Grice, M., & Hazan, V. (1996). The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using semantically unpredictable sentences. Speech Communication, 18, 381-392.
Blaauw, E. (1994). The contribution of prosodic boundary markers to the perceptual difference between read and spontaneous speech. Speech Communication, 14, 359-375.
Collier, R., & Hart, J. T. (1975). The role of intonation in speech perception. In A. Cohen & S. G. Nooteboom (Eds.), Structure and process in speech perception (pp. 107-121). New York: Springer-Verlag.
Cooper, W. E., & Paccia-Cooper, J. (1980). Syntax and speech. Cambridge, MA: Harvard University Press.
Craik, F. I. M., & Lockhart, R. S. (1972). Levels of processing: A framework for memory research. Journal of Verbal Learning and Verbal Behavior, 11, 671-684.
Darwin, C. J. (1975). On the dynamic use of prosody in speech perception. In A. Cohen & S. G. Nooteboom (Eds.), Structure and process in speech perception (pp. 178-194). New York: Springer-Verlag.
Delogu, C., Conte, S., & Sementina, C. (1998). Cognitive factors in the evaluation of synthetic speech. Speech Communication, 24, 153-168.
Duffy, S. A., & Pisoni, D. B. (1992). Comprehension of synthetic speech produced by rule: A review and theoretical interpretation. Language and Speech, 35, 351-389.
Egan, J. P. (1948). Articulation testing methods. Laryngoscope, 58, 995-991.
Greene, B. G., Logan, J. S., & Pisoni, D. B. (1986). Perception of synthetic speech produced by rule: Intelligibility of eight text-to-speech systems. Behavior Research Methods, Instruments, and Computers, 18, 100-107.
Greene, B. G., Manous. L. M., & Pisoni, D. B. (1984). Perceptual evaluation of DECtalk: A final report on version 1.8. Research in Speech Perception (Progress Report No. 10). Bloomington: Indiana University Speech Research Laboratory.
Grosjean, F. (1983), How long is the sentence? Prediction and prosody in the on-line processing of language. Linguistics, 21, 501-529.
Haggard, M. (1975), Understanding speech understanding. In A, Cohen & S. G. Nooteboom (Eds.), Structure and process in speech perception (pp. 3-15). New York: Springer-Verlag.
Hoover, J., Reiche, J., Van Tasell, D., & Cole, D. (1987). The intelligibility of synthesized speech: Echo II versus Votrax. Journal of Speech and Hearing Research. 30, 425-431.
House, A. S., Williams, C. E., Hecker, M. H. L., & Kryter, K. (1965). Articulation-testing methods: Consonantal differentiation with a closed-response set. Journal of the Acoustical Society of America. 37. 158-166.
Institute of Electrical and Electronics Engineers. (1969). IEEE recommended practice for speech quality measurements (IEEE No. 297). New York: Author.
Jarvella, R. J. (1971). Syntactic processing of connected speech. Journal of Verbal Learning and Verbal Behavior, 10, 409-416.
Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy, M. (1967). Perception of the speech code. Psychological Review, 74, 431-461.
Logan, J. S., Greene, B. G., & Pisoni, D. B. (1989). Segmental intelligibility of synthetic speech produced by rule. Journal of the Acoustical Society of America, 86, 566-581.
Luce, P. A. (1982). Comprehension of fluent synthetic speech produced by rule. Journal of the Acoustical Society of America, 71, UU11.
Luce, P. A., Feustel, T. C., & Pisoni, D. B. (1983). Capacity demands in short-term memory for synthetic and natural speech. Human Factors, 25, 17-32.
Manous, L. M., & Pisoni, D. B. (1984). Effects of signal duration on the perception of natural and synthetic speech. Research on Speech Perception (Progress Report No. 10). Bloomington: Indiana University Speech Research Laboratory.
Martin, J. G. (1972). Rhythmic (hierarchical) versus serial structure in speech and other behavior. Psychological Review, 79, 487-509.
Martin, J. G. (1975). Rhythmic expectancy in continuous speech perception. In A. Cohen & S. G. Nooteboom (Eds.), Structure and process in speech perception (pp. 161-176). New York: Springer-Verlag.
Meltzer, R. H., Martin, J. G., Mills, C. B., Imhoff, D. L., & Zohar, D. (1975). Reaction time to temporally displaced phoneme targets in continuous speech. Journal of Experimental Psychology: Human Perception and Performance, 2, 277-290.
Miller, G. A., Heise, G. A., & Lichten, W. (1951). The intelligibility of speech as a function of the context of the test materials, Journal of Experimental Psychology, 41, 329-335.
Mirenda, P., & Beukelman, D. R. (1987). A comparison of speech synthesis intelligibility with listeners from three age groups. Augmentative and Alternative Communication, 3, 120-128.
Nusbaum, H. C., Francis, A. L., & Henley, A. S. (1995). Measuring the naturalness of synthetic speech. International Journal of Synthetic Speech, 1, 7-19.
Nusbaum, H. C., & Pisoni, D. B. (1985). Constraints on the perception of synthetic speech generated by rule. Behavior Research Methods, Instruments, & Computers, 17, 235-242.
Paris, C. R., Gilson, R. D., Thomas, M. H., & Silver, N. C. (1995). Effect of synthetic voice intelligibility upon speech comprehension. Human Factors, 37, 335-340.
Pavlovic, C. V. (1987). Derivation of primary parameters and procedures for use in speech intelligibility predictions. Journal of the Acoustical Society of America, 82, 413-422.
Pisoni, D. B. (1981). Speeded classification of natural and synthetic speech in a lexical decision task. Journal of the Acoustical Society of America, 70, S98.
Pisoni, D. B., & Hunnicutt, S. (1980). Perceptual evaluation of MITalk: The MIT unrestricted text-to-speech system. In 1980 IEEE International Conference on Acoustics, Speech, and Signal Processing (pp. 572-575). New York: Institute of Electrical and Electronics Engineers.
Pisoni, D. B., Manous, L. M., & Dedina, M. J. (1987). Comprehension of natural and synthetic speech: Effects of predictability on the verification of sentences controlled for intelligibility. Computer Speech and Language, 2, 303-320.
Pisoni, D. B., Nusbaum, H. C., & Greene, B. G. (1985). Perception of synthetic speech generated by rule. Proceedings of the IEEE, 73, 1665-1676.
Ralston, J. V., Pisoni, D. B., Lively, S. E., Greene, B. G., & Mullenix, J. W. (1991). Comprehension of synthetic speech produced by rule: Word monitoring and sentence-by-sentence listening times. Human Factors, 33, 471-491.
Sanderman, A. A., & Collier, R. (1997). Prosodic phrasing and comprehension. Language and Speech, 40, 391-409.
Simpson, C. A., & Marchionda-Frost, K. (1984). Synthesized speech rate and pitch effects on intelligibility of warning messages for pilots. Human Factors, 26, 509-517.
Simpson, C. A., & Williams, D. H. (1980), Response time effects of alerting tone and semantic context for synthesized voice cockpit warnings. Human Factors, 22, 319-330.
Slowiaczek, L. M., & Nusbaum, H. C. (1985). Effects of speech rate and pitch contour on the perception of synthetic speech. Human Factors, 27, 701-712.
Stine, E. L., & Wingfield, A. (1987). Process and strategy in memory for speech among younger and older adults. Psychology and Aging, 2, 272-279.
Terken, J. (1993). Synthesizing natural-sounding intonation for Dutch: Rules and perceptual evaluation. Computer Speech and Language, 7, 27-48.
Terken, J., & Lemeer, G. (1988). Effects of segmental quality and intonation on quality judgments for texts and utterances. Journal of Phonetics, 16, 453-457.
Wingfield, A. (1975). The intonation-syntax interaction: Prosodic features in perceptual processing of sentences. In A. Cohen & S. G. Nooteboom (Eds.), Structure and process lit speech perception (pp. 146-160). New York: Springer-Verlag.
Mean Percentage of Correct Words as a Function of Speech Mode and Stimulus Type Stimulus Type Speech Mode normal sentence no prosody no context unstructured natural speech .74 .60 .51 .24 DECtalk .60 .60 .35 .20 Sound Blaster .58 .58 .34 .16 Mean Intelligibility Ratings as a Function of Speech Mode and Stimulus Type Stimulus Type Speech Mode normal sentence no prosody no context unstructured natural speech 9.86 7.80 9.27 5.90 DECtalk 8.20 7.74 6.13 6.07 Sound Blaster 7.10 6.39 5.22 4.11 Mean Naturalness Ratings as a Function of Speech Mode and Stimulus Type Stimulus Type Speech Mode normal sentence no prosody no context unstructured natural speech 9.71 5.58 9.49 5.23 DECtalk 5.78 4.19 4.70 4.27 Sound Blaster 4.60 3.86 3.67 3.05
|Printer friendly Cite/link Email Feedback|
|Author:||Paris, Carol R.; Thomas, Margaret H.; Gilson, Richard D.; Kincaid, J. Peter|
|Article Type:||Statistical Data Included|
|Date:||Sep 22, 2000|
|Previous Article:||Stature, Age, and Gender Effects on Reach Motion Postures.|
|Next Article:||Evaluation of Alternative Waveforms for Animated Mimic Displays.|