Printer Friendly
The Free Library
19,588,385 articles and books
Member login
User name  
Password 
 
Join us Forgot password?

Analysis and modeling of temporal characteristics of speech for Estonian text-to-speech synthesis/[TEXT NOT REPRODUCIBLE IN ASCII].


1. Introduction

The task of text-to-speech synthesis is to convert orthographic text into natural-sounding speech. For the artificial speech to sound realistic to the human ear, it should comprise realistic intonation intonation

In phonetics, the melodic pattern of an utterance. Intonation is primarily a matter of variation in the pitch level of the voice (see tone), but in languages such as English, stress and rhythm are also involved.
, rhythm and stress patterns. More specifically, the text-to-speech system must be able to generate durations of sounds and pauses not notably different from the values of the actual speech.

Currently, prosodic pros·o·dy  
n. pl. pros·o·dies
1. The study of the metrical structure of verse.

2. A particular system of versification.
 modelling in Estonian text-to-speech synthesis (Mihkla, Meister, Eek 2000) is largely based on generalized measurements of speech units in isolated words and sentences. The resulting output (synthesized) speech, however, is often monotonous and has poor fluency, which sets application limitations on the synthesizer synthesizer

Machine that electronically generates and modifies sounds, frequently with the use of a digital computer, for use in the composition of electronic music and in live performance.
. As indicated by Nick Campbell, the durations of sounds in isolated words or sentences are largely different from durations of sound in the fluent speech (Campbell 2000). The speech contains complicated temporal patterns, which the text-to-speech system must be able to imitate for the speech to sound natural. The availability of oral speech corpuses provides an opportunity to achieve the text-prosody transformation with the help of statistical models.

In this work the first attempts are made to improve the naturalness of the output speech of an Estonian speech synthesiser Noun 1. synthesiser - an intellectual who synthesizes or uses synthetic methods
synthesist, synthesizer

intellectual, intellect - a person who uses the mind creatively

2.
 with the help of statistical duration models of fluent speech. We applied the technology of regression analysis In statistics, a mathematical method of modeling the relationships among three or more variables. It is used to predict the value of one variable given the values of the others. For example, a model might estimate sales based on age and gender.  to find out the essential features of sound durations and to compose a prediction model. The results of modelling the durations are compared with expert opinions given by Estonian phoneticians. With the aim to providing for a natural rhythm of the output speech, the relation of pauses and boundary lengthenings with syntactic parsing See parse.

parsing - parser
 of the text is studied.

2. Source material

Because we are concerned with a text-to-speech synthesiser, the source material was a sample of texts read by announcers. On the basis of one-to-one correspondence of text and speech, it is possible to move from a symbol-based representation of prosody prosody: see versification.
prosody

Study of the elements of language, especially metre, that contribute to rhythmic and acoustic effects in poetry.
 to the acoustic one and also to establish whether and to what extent the syntactic parsing of the text is related to the prosodic parsing of the speech.

The source material consisted of passages of speech from the CD-version of a detective story detective story: see mystery.
detective story

Type of popular literature dealing with the step-by-step investigation and solution of a crime, usually murder.
 read by an actor (Stout 2003) and passages of speech and texts from longer news read by announcers of Estonian Radio. Altogether, 12 speech passages were analysed, each 1-2 minutes long. All passages of speech were segmented into sounds and pauses.

3. Analysis of pauses and boundary lengthenings

Prior to the application of a general statistical model, pauses and pre-pausal lengthenings in speech were analysed, based on this material. The pauses and prepausal lengthenings in Estonian speech have been studied cursorily or intermittently, as a by-product in the context of other tasks. Ilse Lehiste (1981) verified whether prepausal lengthenings were in correlation with the length of subsequent pauses and she established an extremely weak link. Diana Krull (1997) studied prepausal lengthenings in dialogue in two-syllable words in the context of quantity degree. Arvo Eek and Einar Meister (2003) looked at end-of-sentence lengthenings on the basis of tempocorpus. However, they examined only words of a specific structure, and focused on quantity degree features. Therefore the need became evident to measure, for Estonian language Estonian (eesti keel ; IPA: [ˈeːs.ti ˈkeːl]) is the official language of Estonia, spoken by about 1.  text-to-speech synthesis, pauses and boundary foot lengthenings, on the basis of a text read out from real speech.

With a view to analysing the pauses and foot lengthenings, the durations of pauses derived from the speech wave were measured, and the foot lengthenings were calculated. For the calculation of foot lengthenings, the durations of sounds comprising the foot were summed, after which the actual duration was compared to the mean duration of the given foot structure in the speech of that announcer. The first hypothesis was to verify whether pauses and foot lengthenings could be classified (for instance, whether the pauses between phrases (1) differ significantly from the sentence end or paragraph end pauses).

Table 1 presents the mean durations of pauses as per announcers and the generalised mean. Looking at the generalised means suggests that in case of a text read out at normal speech rate the classification of speech pauses is fully possible. The statistical analysis of samples corroborates this surmise. Analysis of pairs of the logarithmic logarithmic

pertaining to logarithm.


logarithmic relationship
when the logs of two variables plotted against each other create a straight line.
 durations of pauses with the help of a Student t-test reveals that the t-statistic values on significance level p = 0.01 noticeably exceed the t-critical two-tail quantile quantile

division of a total into equal subgroups; includes terciles, quartiles, quintiles, deciles, percentiles.
 (cf. Table 2) on probability of significance of hypothesis p < 0.0001. Hence it seems proved that the mean values of durations of pauses differ and the classification of pauses is fully possible, which fact could be applied in speech synthesis speech synthesis

Generation of speech by artificial means, usually by computer. Production of sound to simulate human speech is referred to as low-level synthesis. High-level synthesis deals with the conversion of written text or symbols into an abstract representation of
. The dispersion and variance however are large; therefore in speech recognition, for instance, such classification is to no avail.

When analysing, with the help of Student t-test the data of foot lengthenings (cf. Table 2) we had to accept the null hypothesis null hypothesis,
n theoretical assumption that a given therapy will have results not statistically different from another treatment.

null hypothesis,
n
: the foot lengthenings are from samples of the same mean value.

Next examined was whether and to what extent the prosodic parsing of speech correlates with syntactic parsing where the latter is indicated by punctuation marks and conjunctions. As shown in Table 3, there is invariably in·var·i·a·ble  
adj.
Not changing or subject to change; constant.



in·vari·a·bil
 a pause in speech (2) at the paragraph end and the sentence end. In case of a colon and dash too there is a strong correlation between syntax and prosody. Half the commas are related to pauses. The least marked in speech are phrases starting with those co-ordinating conjunctions which do not require the comma.

Among the punctuation marks, lengthening is obviously related to the dash. Apparently the connotation con·no·ta·tion  
n.
1. The act or process of connoting.

2.
a. An idea or meaning suggested by or associated with a word or thing:
 is suggested by the shape of the sign--the stretched line prompts the drawl drawl  
v. drawled, drawl·ing, drawls

v.intr.
To speak with lengthened or drawn-out vowels.

v.tr.
. Suggestive of suggestive of Decision making adjective Referring to a pattern by LM or imaging, that the interpreter associates with a particular–usually malignant lesion. See Aunt Millie approach, Defensive medicine.  the link between pauses and boundary lengthenings is the English term 'prepausal lengthening'. This term applies, on the basis of this Estonian language speech material, only 70% of the time (only 143 pauses are preceded by word or foot lengthening). According to according to
prep.
1. As stated or indicated by; on the authority of: according to historians.

2. In keeping with: according to instructions.

3.
 perception tests carried out by I. Lehiste (Lehiste, Fox 1993) Estonian speakers expect significantly less final lengthening on the last syllable of the sentence than English speakers do.

But if we wish to lend synthetic speech synthetic speech
n.
Speech that is produced by an electronic synthesizer activated by a keyboard, enabling individuals who are incapable of speech to communicate.
 a natural rhythm, it does not suffice if we just find out the mean durations of pauses and lengthenings. Instead, we should rather model their durations and temporal positions in a context-sensitive way.

4. Statistical modelling of segmental segmental /seg·men·tal/ (seg-men´t'l)
1. pertaining to or forming a segment or a product of division, especially into serially arranged or nearly equal parts.

2. undergoing segmentation.
 durations

Because we are still seeking the most suitable statistical method to predict the durations, we carried out regression analysis on the basis of partial source material (passages from the detective story read by the actor). The input data to the statistical analysis of durations was the sequence of sounds (phonemes) and sound durations obtained by segmenting the speech wave. On the basis of a text corresponding to the speech, we formed a vector of features (with 17 features) for every sound. Those argument features were described on several hierarchical levels (phoneme phoneme

Smallest unit of speech distinguishing one word (or word element) from another (e.g., the sound p in tap, which differentiates that word from tab and tag). The term is usually restricted to vowels and consonants, but some linguists include differences of pitch,
, syllable, foot, word, phrase and sentence levels). We proceeded from the presumption that every sound has intrinsic duration, a vowel vowel

Speech sound in which air from the lungs passes through the mouth with minimal obstruction and without audible friction, like the i in fit. The word also refers to a letter representing such a sound (a, e, i, o, u, and sometimes y).
 belongs to a concrete class of sounds (front vowels, plosive plosive (plō´siv),
n any speech sound made by impounding the airstream for a moment until considerable pressure has been developed and then suddenly releasing it (e.g.,
b, d, and
g).
 consonants, nasals etc) whose properties translate to the members of the given class; and also, that adjacent sounds impact on one another and that the duration may be influenced by both the word and the sentence structure. The output of the model or the functional feature (response)--duration--was presented as logarithmic LN (duration), because the logarithmic duration conforms more to the normal distribution. Because the argument features (explanatory variables) were numerous, an optimum selection had to be made among them, i.e. we had to locate the features most likely to affect the response.

The initial results of statistical modelling of multiple regression analysis revealed that the model created is statistically significant (cf. Table 4). The analysis of regression coefficients disclosed that significant features for predicting the duration of the sound were the class and length (short or long) of the current sound, the class of the next sound, the position of the sound in syllable, the position of the syllable in foot, the length of the word in feet, and the location of the word in phrase. Curiously the quantity degree of the foot, despite being the cornerstone of Estonian word prosody, was not a significant feature for prediciting the duration of a sound. Those modelling results, however, have been obtained relying on only partial data volumes. Table 5 presents the features estimated as significant by experts and the statistically significant argument features obtained by regression analysis. Acting as experts were six Estonian phoneticians. The conclusions of the experts and the results of regression analysis coincided on average to 49%.

The analysis of prediction residuals or errors (cf. Figure 1) showed that in the distribution of errors there were three "data clusters" distanced from one another. A closer look revealed that the two right-hand clusters were constituted by pauses. The residuals may be considered, at a visual estimate, to be homoscedastic.

[FIGURE 1 OMITTED]

5. Conclusions and future work

This paper has described the preliminary results and the first attempts to make the prosody of the output speech of a text-to-speech synthesiser of Estonian more natural. The analysis of prediction errors showed that the sounds and pauses should be handled separately at analysis. To predict the duration of sounds and pauses using statistical methods the volume of material analysed should be expanded, with various methods tested (e.g. neural networks).

REFERENCES

Campell, N. 2000, Timing in Speech. A Multilevel Process.--Prosody. Theory and Experiment, Dordrecht-Boston-London, 281-334.

Eek, A., Meister, E. 2003, Foneetilisi katseid ja arutlusi kvantiteedi alalt (I). Haalikukestusi muutvad kontekstid ja valde.--KK, 815-837.

Krull, D. 1997, Prepausal Lengthening in Estonian: Evidence from Conversational Speech.--Estonian Prosody: Papers from a Symposium. Proceedings of the International Symposium on Estonian Prosody, Tallinn, Estonia, October 29-30, 1996, Tallinn, 136-148.

Lehiste, I. 1981, Sentence and Paragraph Boundaries in Estonian.--CIFU V, Pars VI, 164-169.

Lehiste, I., F o x, R. 1993, Influence of Duration and Amplitude on the Perception of Prominence by Swedish Listeners.--Speech Communication 13, 149-154.

Mihkla, M., Meister, E., E e k A. 2000, Eesti keele tekst-kone suntees: grafeem-foneem teisendus ja prosoodia modelleerimine.--Arvutuslingvistikalt inimesele, Tartu (Tartu Ulikooli uldkeeleteaduse oppetooli toimetised 1), 309-320.

Stout, R. 2003, Deemoni surm. CD-versioon. Loeb Andres Ots, Tallinn.

MEELIS MIHKLA, JURI KUUSIK (Tallinn)

* Support from the Estonian Science Foundation, grant No. 5039, and state program "Estonian language and national memory" has made the present work possible.

(1) In this work, the phrase means the clause or element of enumeration 1. (mathematics) enumeration - A bijection with the natural numbers; a counted set.

Compare well-ordered.
2. (programming) enumeration - enumerated type.
, which has been determinated within the sentence by punctuation mark or conjunction.

(2) In this work we have treated as a prosodic pause an interruption of speech over 50 ms.

[TEXT NOT REPRODUCIBLE IN ASCII ASCII or American Standard Code for Information Interchange, a set of codes used to represent letters, numbers, a few symbols, and control characters. Originally designed for teletype operations, it has found wide application in computers. ].
Table 1
Durations of pauses and boundary lengthenings (ms) in speech

Dictors                 Phrase        Sentence      Paragraph
                        end             end           end
                        pauses         pauses         pauses

Actor1 (m)               352            558            1025
Announcer1 (f)           303            828            902
Announcer2 (m)           286            769            1132
Generalised mean         323            678            1021

Dictors                 Phrase        Sentence      Paragraph
                         end            end            end
                     lengthenings   lengthenings   lengthenings

Actor1 (m)               200            220            315
Announcer1 (f)           124            112            117
Announcer2 (m)            95             90            122
Generalised mean         155            161            217

Table 2
Student t-test results for comparison of pairs of sample means
(Ph-Se--between phrase and sentence, Ph-Pa--between phrase and
paragraph, Se-Pa--between sentence and paragraph)

                       Pauses

                       Ph-Se      Ph-Pa      Se-Pa

T stat                 8.87       12.25      5.91
T critical two-tail    2.62       2.76       2.72
P (T <= t)             < 0.0001   < 0.0001   < 0.0001

                       Foot lengthenings

                       Ph-Se      Ph-Pa      Se-Pa

T stat                 0.81       0.26       0.65
T critical two-tail    2.65       2.84       2.90
P (T <= t)             0.42       0.79       0.52

Table 3
Connection of pauses and foot lengthenings with the text parsing

                  No. of    No. of corresponding   No. of corresponding
                 parsings   pauses in the speech   foot lengthenings in
                  in the                           the speech
                   text      Cnt       %              Cnt        %

Paragraph end       21        21      100              15       71
Sentence end        58        58      100              39       67
Comma               80        41       51              42       53
Conjunction         22        7        32              12       55
Colon               7         7       100               4       57
Dash                14        13       93              13       93

Table 4
Summary of fit and the analysis of variance for the regression
model of durations

                     Summary of Fit

Mean of Response                    -2.753
Root MSE                            0.2886

                     Analysis of Variance

Source               DF             Sum of Squares
Model                26             478.5
Error                4906           408.8
C Total              4932           887.2

                     Summary of Fit

Mean of Response     R-Square                        0.5393
Root MSE             Adj R-Sq                        0.5368

                     Analysis of Variance

Source               Mean Square    F Stat                    Pr > F
Model                18.403         220.87                    <0.0001
Error                0.0862
C Total

Table 5
Expert opinions versus results of regression analysis
(ExpN--N expert, Reg--results of regression analysis,
1--significant explanatory variable, 0--unsignificant variable)

Explanatory variable             Exp1   Exp2   Exp3   Exp4

Previous phoneme class            0      0      0      0
Previous phoneme length           1      1      1      1
Current phoneme class             1      1      1      0
Current phoneme length            1      1      1      1
Next phoneme class                1      1      0      0
Next phoneme length               1      0      1      1
Phoneme position in syllable      1      1      0      1
Stress of syllable                1      1      1      1
Type of syllable                  1      0      0      1
Quantity degree of foot           1      1      1      1
Syllable position in foot         1      1      1      1
Length of foot in syllables       1      1      0      1
Foot position in word             1      0      0      1
Length of word in feet            1      1      0      1
Word position in phrase           1      1      1      1
Length of phrase in words         1      0      0      1
Length of sentence in phrases     1      0      0      0
Total "correct" answers           8      11     8      7
%                                47%    65%    47%    41%

Explanatory variable             Exp5   Exp6   Reg

Previous phoneme class            0      0      0
Previous phoneme length           0      0      0
Current phoneme class             1      0      1
Current phoneme length            0      0      1
Next phoneme class                0      0      1
Next phoneme length               0      0      0
Phoneme position in syllable      0      0      1
Stress of syllable                1      1      0
Type of syllable                  1      1      0
Quantity degree of foot           0      1      0
Syllable position in foot         0      1      1
Length of foot in syllables       0      1      0
Foot position in word             0      1      0
Length of word in feet            0      0      1
Word position in phrase           0      0      0
Length of phrase in words         0      1      1
Length of sentence in phrases     0      0      0
Total "correct" answers           9      7
%                                53%    41%
Total average 49%
COPYRIGHT 2005 Estonian Academy Publishers
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2005 Gale, Cengage Learning. All rights reserved.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Author:Mihkla, Meelis; Kuusik, Juri
Publication:Linguistica Uralica
Date:Jun 1, 2005
Words:2398
Previous Article:The early history of Estonian speech prosody studies/[TEXT NOT REPRODUCIBLE IN ASCII].
Next Article:On the temporal structure of Estonian secondary-stressed feet/[TEXT NOT REPRODUCIBLE IN ASCII].



Related Articles
Oratorical manipulation and critical reading: a study of the proem of Lysias's against Eratosthenes.
Modelling speech temporal structure for Estonian text-to-speech synthesis: feature selection.
Some comments about Paul Ariste's doctoral dissertation on phonetics of Hiiumaa Estonian dialects/[TEXT NOT REPRODUCIBLE IN ASCII].
Perception of convergent forms in Estonia's Russian.
Livonian gradation: types and genesis/[TEXT NOT REPRODUCIBLE IN ASCII].
Sources of variability in the duration of stressed and unstressed syllable nuclei in Erzya: inter-idiolect data of spontaneous speech/[TEXT NOT...
Variation in the adaptation of Finnic loanwords in Russian/[TEXT NOT REPRODUCIBLE IN ASCII].
On the Uralic (*)m-accusative/[TEXT NOT REPRODUCIBLE IN ASCII].
Rhythm related effects in Erzya/[TEXT NOT REPRODUCIBLE IN ASCII].
Typology of grammaticalized evidentiality in Bulgarian and Estonian/[TEXT NOT REPRODUCIBLE IN ASCII].

Terms of use | Copyright © 2012 Farlex, Inc. | Feedback | For webmasters | Submit articles