Printer Friendly
The Free Library
22,728,043 articles and books

Development of a Hungarian medical dictation system.



This paper reviews the current state of a Hungarian project which seeks to create a speech recognition system for the dictation of thyroid gland medical reports. First. we present the MRBA MRBA Mississippi River Basin Alliance
MRBA Michigan Road Builders Association
MRBA Mississippi Road Builders' Association
MRBA Material Review Board Action
 speech corpus A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions in a format that can be used to create acoustic models (which can then be used with a speech recognition engine).

A corpus is one such database.
 that was assembled to support the training of general-purpose Hungarian speech recognition ,systems. Then we describe the processing of medical reports that were collected to help the creation of domain-specific language A programming language designed for a particular purpose. For example, Tex is a language used for typesetting, SQL is used to query databases, and Mathematica is used for computations.  models. At the acoustic modelling level we experimented with two techniques - a conventional HMM HMM

heavy meromyosin.
 one and an ANN-based solution--which are both briefly described in the paper. Finally, we present the language modelling methodology currently applied in the system, and round off with recognition results on test data taken from four speakers. The scores show that on a somewhat restricted sub-domain of the task we are able to produce word accuracies well over 95%.

Keywords: speech recognition, dictation systems, 2D-cepstrum

Povzetek: Prispevek predstavlja pregled trenutnega stanja madzarskega projekta, ki skusa vzpostaviti sistern razpoznavanja govora za narekovanje zdravniskih izvidov na temo zleze scitnice.

1 Introduction: state of the art and goals of the project

Automating the dictation of texts is one of the main applications of speech recognition. Mainly because of the huge training corpora corpora

plural form of corpus.


corpora albicantia
see corpus albicans.

corpora arenacea
sandy or gritty bodies, found in the pineal body; appear to be of glial or stromal origin; have the structure of
, the increased processor speeds and the refined search techniques dictation systems have reached such a level of sophistication so·phis·ti·cate  
v. so·phis·ti·cat·ed, so·phis·ti·cat·ing, so·phis·ti·cates

v.tr.
1. To cause to become less natural, especially to make less naive and more worldly.

2.
 that the commercial products now offer sufficiently good accuracy even for arbitrary normal-pace fluent speech [12]. Experience tells us, however, that for a really good performance it is still worth applying some tricks like an initial speaker enrollment process where the machine can adapt to the voice of the speaker, or the restriction of the dictation topic to some specific (e.g. medical or legal) domain. Such dictation systems already exist for the biggest languages, but the situation for those languages that can offer only a small market is not as good. For Hungarian at the present time there exists no general-purpose large vocabulary continuous speech recognizer (LVCSR LVCSR Large-Vocabulary Continuous Speech Recognition ). Among the university publications even papers that deal with continuous speech recognition are hard to find, and these give results only for restricted vocabularies [15]. Although on the industrial side Philips have adapted its SpeechMagic system to two special application domains in Hungarian, it is sold at a price that is affordable for only the largest institutes [9]. The experts usually cite two main reasons for the lack of Hungarian LVCSR systems. First, there are no sufficiently large In mathematics, the phrase sufficiently large is used in contexts such as:
is true for sufficiently large
, publicly available speech databases that would allow the training of reliable phone models. The second reason is the special difficulties of language modelling that arise due to the highly agglutinative ag·glu·ti·na·tive
adj.
Concerning or characteristic of agglutination.
 nature of Hungarian.

In 2004 the Research Group on Artificial Intelligence at the University of Szeged and the Laboratory of Speech Acoustics of the Budapest University of Technology and Economics History
The legal predecessor of the university was founded in 1782 by Emperor Joseph II, named Institutum Geometrico-Hydrotechnicum (Institute of Geometry and Hydrotechnics).
 began a project with the aim of collecting and/or creating the basic resources needed for the construction of a continuous dictation system for Hungarian. The project lasted for three years (2004-2006), and was financially supported by the national fund IKTA-056/2003. For the acoustic modelling part, the project included the collection and annotation 1. (programming, compiler) annotation - Extra information associated with a particular point in a document or program. Annotations may be added either by a compiler or by the programmer.  of a large speech corpus of phonetically rich sentences. As regards the language modelling part, we restricted the target domain to the dictation of some limited types of medical reports. Although this clearly led to a significant reduction compared to a general dictation task, we chose this application area with the intent of assessing the capabilities of our acoustic and language modelling technologies. Depending on the findings, later we hope to extend the system to more general dictation domains. This is why the language resources were chosen to be domain-specific, while the acoustic database contains quite general, domain-independent recordings.

Although both participating teams used the same speech database to train their acoustic models, they focused on two different dictation tasks and experimented with their own acoustic and language modelling technologies. The team at the University of Szeged focused on the task of the dictation of thyroid scintigraphy scintigraphy /scin·tig·ra·phy/ (sin-tig´rah-fe) the production of two-dimensional images of the distribution of radioactivity in tissues after the internal administration of a radiopharmaceutical imaging agent, the images being obtained  medical reports, while the Budapest team dealt with gastroenterology gastroenterology

Medical specialty dealing with digestion and the digestive system. In the 17th century Jan Baptista van Helmont conducted the first scientific studies in the field; William Beaumont published his own observations in 1833.
 reports. This paper just describes the research and development efforts of the Szeged team. The interested reader can find a survey of the research done by the Laboratory of Speech Acoustics in [16].

2 Speech and language resources

In the first phase of the project we designed, assembled and annotated a speech database called the MRBA corpus (the abbreviation abbreviation, in writing, arbitrary shortening of a word, usually by cutting off letters from the end, as in U.S. and Gen. (General). Contraction serves the same purpose but is understood strictly to be the shortening of a word by cutting out letters in the middle,  stands for the "Hungarian Reference Speech Database") [16]. Our goal was to create a database that allows the training of general-purpose dictation systems which run on personal computers in office environments and operate with continuous, read speech. The contents of the database were designed by the Laboratory of Speech Acoustics. As a starting point Noun 1. starting point - earliest limiting point
terminus a quo

commencement, get-go, offset, outset, showtime, starting time, beginning, start, kickoff, first - the time at which something is supposed to begin; "they got an early start"; "she knew from the
, they took a large (1.6 MB) text corpus In linguistics, a corpus (plural corpora) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis, checking occurrences or validating linguistic rules on a specific universe.  and after automatic phonetic transcription Noun 1. phonetic transcription - a transcription intended to represent each distinct speech sound with a separate symbol
transcription, written text - something written, especially copied from one medium to another, as a typewritten version of dictation
 they created phone, diphone and triphone statistics from it. Then they selected 1992 different sentences and 1992 different words in such a way that 98.8% of the most frequent diphones had at least one occurrence in them. These sentences and words were recorded from 332 speakers, each reading 12 sentences and 12 words. Thus all sentences and words have two recordings in the speech corpus. Both teams participated in the collection of the recordings, which was carried out in four big cities, mostly at universities labs, offices and home environments. In the database the ratio of male and female speakers is 57.5% to 42.5%. About one-third of the speakers were between 16 to 30 years of age, the rest being evenly distributed among the remaining age groups. Both home PCs and laptops were used to make the recordings, and the microphones and sound cards of course varied as well. The sound files were cleaned and annotated at the Laboratory of Speech Acoustics, while the Research Group on Artificial Intelligence manually segmented and labelled one third of the files at the phone level. This part of the corpus is intended to support the initialization in·i·tial·ize  
tr.v. in·i·tial·ized, in·i·tial·iz·ing, in·i·tial·iz·es Computer Science
1. To set (a starting value of a variable).

2. To prepare (a computer or a printer) for use; boot.

3.
 of phone models prior to training on the whole corpus.

Besides the general-purpose MRBA corpus, we also collected recordings that are specific for the target domain, namely thyroid scintigraphy medical reports. From these recordings 20-20 reports read aloud by 4 persons were used as test data in the experiments reported here.

For the construction of the domain-specific language models, we got 9231 written medical reports from the Department of Nuclear Medicine of the University of Szeged. These thyroid scintigraphy reports were written and stored using various software packages that were employed at the department during 1998 to 2004. So first of all we had to convert all the reports to a common format, followed by several steps of routine error correction. Each report consists of 7 fields: header (name, ID number etc. of the patient), clinical observations, request of the referral doctor, a summary of previous examinations (if any), the findings of this examination, a one-sentence summary, and a signature. From the corpus we omitted the first and the last, person-specific fields, for the sake of personal data privacy. Then we discarded those reports that were incomplete such as those that had missing fields. This way only 8546 reports were kept, which, on average, contained 11 sentences and 6 words per sentence. The next step was to remove any typographical errors from the database, of which there were surprisingly many (the most frequent words occurred in 10-15 mistyped forms). A special problem was that of unifying the Latin terms, many of which are allowed to be written both with a Latin or a Hungarian spelling in medical texts (for example therapia vs. terapia). The abbreviations also had to be resolved. The corpus we got after these steps contained approximately 2500 different word forms (excluding numbers and dates), so we were confronted with a medium-sized vocabulary dictation task.

3 The user interface

Our GUI (Graphical User Interface) A graphics-based user interface that incorporates movable windows, icons and a mouse. The ability to resize application windows and change style and size of fonts are the significant advantages of a GUI vs. a character-based interface.  was really designed with the goal of serving many users on the same computer. The other main design aspect was to combine simplicity with good functionality. With our program only a microphone and a text editor (Microsoft Word A full-featured word processing program for Windows and the Macintosh from Microsoft. Included in the Microsoft application suite, it is a sophisticated program with rudimentary desktop publishing capabilities that has become the most widely used word processing application on the market.  or a similar word processing word processing, use of a computer program or a dedicated hardware and software package to write, edit, format, and print a document. Text is most commonly entered using a keyboard similar to a typewriter's, although handwritten input (see pen-based computer) and  program) are needed for dictating medical reports.

Every user has one or more profiles containing all the special information characterizing his or her voice for a given language and vocabulary. The language models and the acoustic core modules can be installed separately, and the system can optionally adapt to the individual characteristics of the users. The user interface basically consists of a toolbar A row or column of on-screen buttons used to activate functions in the application. Many toolbars are customizable, letting you add and delete buttons as required. Toolbars may be fixed in position or may float, which means they can be dragged to a more convenient location in the  at the top of the desktop. Using the toolbar all the main functionalities related to the initial parameter settings can be accessed, such as choosing a specific user, choosing the actual task and selecting the output window (Fig. 1). Other functionalities can only be accessed from the actual text editor. The most important of these features could be that the user can ask the speech recognition system for other possible variants of the recognized sentences in cases where he/she discovers the recognized word or sentence to be incorrect.

[FIGURE 1 OMITTED]

4 Acoustic modelling I: HMM phone models over MFCC MFCC Mel Frequency Cepstral Coefficients (Speech Processing)
MFCC Marriage, Family, and Child Counselor
MFCC Marriage, Family Child Counselor
MFCC McFarland Cascade Company
MFCC Mission Flight Control Center
 features

At the level of acoustic modelling we have been experimenting with two quite different technologies. One of these is a quite conventional Hidden Markov Model A hidden Markov model (HMM) is a statistical model in which the system being modeled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters from the observable parameters.  (HMM) decoder that works with the usual reel-frequency cepstral coefficient (MFCC) features [4]. More precisely, 13 coefficients are extracted from 25 msec frames, along with their [DELTA] and [DELTA][DELTA] values, at a rate of 100 frames/sec. The phone models applied have the usual 3-state left-to-right topology. Hungarian has the special property that almost all phones have a short and a long counterpart, and their difference is phonologically relevant (i. e. there are word pairs that differ only in the duration of one phone--for example 'tor'--'tor' or 'szal'--'szall') [14]. However, it is known that such minimal word pairs are relatively rare [14], and inspecting the vocabulary of our specific dictation task we found no such words. Hence most of the long/short consonant consonant

Any speech sound characterized by an articulation in which a closure or narrowing of the vocal tract completely or partially blocks the flow of air; also, any letter or symbol representing such a sound.
 labels were fused, and this way we worked with just 44 phone classes. One phone model was associated with each of these classes, that is we applied monophone modelling and this far no context-dependent models were tested in the system. The decoder built on these HMM phone models performs a combination of Viterbi and multi-stack decoding [4]. For speed efficiency it contains several built-in pruning pruning, the horticultural practice of cutting away an unwanted, unnecessary, or undesirable plant part, used most often on trees, shrubs, hedges, and woody vines.  criteria. First, it applies beam pruning, so only the hypotheses with a score no worse than the best score minus a threshold are kept. Second, the number of hypotheses extended at every time point is limited, corresponding to multi-stack decoding with a stack size constraint. The maximal evaluated phone duration can also be fixed. With the proper choice of these parameters the decoder on a typical PC runs faster than real-time on the medical dictation task.

5 Acoustic modelling II: HMM/ANN phone models over 2D-cepstrum features

Our alternative, more experimental acoustic model employs the HMM/ANN hybrid technology [2]. The basic difference between this and the standard HMM scheme is that here the emission probabilities are modelled by Artificial Neural Networks (artificial intelligence) artificial neural network - (ANN, commonly just "neural network" or "neural net") A network of many very simple processors ("units" or "neurons"), each possibly having a (small amount of) local memory.  (ANNs) instead of the conventional Gaussian mixtures (GMM GMM Generalized Method of Moments (economics)
GMM Gaussian Mixture Model
GMM General Membership Meeting
GMM Good Mobile Messaging
GMM GPRS Mobility Management
GMM Global Marijuana March
GMM Genetically Modified Microorganisms
). In the simplest configuration one can train the neural net neural network also neural net
n.
A real or virtual device, modeled after the human brain, in which several interconnected elements process information simultaneously, adapting and learning from past patterns.

Noun 1.
 over the usual 39 MFCC coefficients--whose result can serve as a baseline for comparison with the conventional HMM. However, ANNs seem to be more capable of modelling the observation context than the GMM technology, so the hybrid models are usually trained over longer time windows. The easiest way of doing this is to specify a couple of neighboring neigh·bor  
n.
1. One who lives near or next to another.

2. A person, place, or thing adjacent to or located near another.

3. A fellow human.

4. Used as a form of familiar address.

v.
 frames as input to the net: in a typical arrangement 4 neighboring frames are used on both sides of the actual frame [2]. Another option is to apply some kind of transformation on the data block of several neighboring frames. Knowing that the modulation components play an important role in human speech perception, performing a frequency analysis over the feature trajectories seems reasonable. When this analysis is applied to the cepstral coefficients, the resulting feature set is usually referred to as the 2D-cepstrum [6]. Research shows that most of the linguistic information is in the modulation frequency components between 1 and 16 Hz, especially between 2 and 10 Hz. This means that not all of the components of a frequency analysis have to be retained, and so the 2D-cepstrum offers a compact representation of a longer temporal context.

In the experiments we tried to find the smallest feature set that would give the best recognition results. Running the whole recognition test for each parameter setting would have required too much time so, as a quick indicator of the efficiency of a feature set we used the frame-level classification score. Hence the values given in the following tables are frame-level accuracy values measured on a held-out data set of 20% of the training data.

First of all we tried to extend the data of the 'target' frame by neighboring frames, without applying any transformation. The results shown in Table 1 indicate that training on more than 5 neighboring frames significantly increased the number of features and hidden neurons Neurons
Nerve cells in the brain, brain stem, and spinal cord that connect the nervous system and the muscles.

Mentioned in: Speech Disorders
 (and also significantly raised the training time) without bringing any real improvement in the score.

In the experiments with the 2D-cepstrum we first tried to find the optimal size of the temporal window. Hence we varied the size of the DFT DFT - discrete Fourier transform  analysis between 8, 16, 32, and 64, always keeping the first and second components (1) (both the real and the imaginary parts), and combined these with the static MFCC coefficients. The results displayed in Table 2 indicate that the optimum is somewhere between 16 and 32 (corresponding to 160 and 320 milliseconds). This is smaller than the 400 ms value found optimal in [6] and the 310 ms value reported in [13], but this might depend on the amount of training data available (a larger database would cover more of the possible variations and hence would allow a larger window size). Of course, one could also experiment with combining various window sizes as Kanedera did [6], but we did not run such multi-resolution tests.

As the next step we examined whether it was worth retaining more components. In the case of the 16-point DFT we kept 3 components, while for the 32-point DFT we tried retaining 5 components (the highest center frequency being 18.75 Hz and 15.625 Hz, respectively). The results (see Table 3) show that the higher modulation frequency components are less useful, which accords with what is known about the importance of the various modulation frequencies.

Finally, we tried varying the type of transformation applied. Motlicek reported that there is no need to keep both the real and imaginary parts of the DFT coefficients: using just one of them is sufficient. Also, he obtained a similar performance when replacing the complex DFT with the DCT (Discrete Cosine Transform) An algorithm that is widely used for data compression. Similar to Fast Fourier Transform, DCT converts data (pixels, waveforms, etc.) into sets of frequencies. The first frequencies in the set are the most meaningful; the latter, the least.  [10]. Our findings agree more with those of Kanedera [6], that is we obtained slightly worse results with these modifications (see Table 4). Hence we opted for the complex DFT, using both the real and imaginary coefficients. One advantage of the complex DFT over the DCT might be that when only some of its coefficients are required (as in our case), it can be very efficiently computed using a recursive See recursion.

recursive - recursion
 formula [5].

6 Domain-specific language modelling

A special difficulty of creating language models for Hungarian is the highly agglutinative [3] nature of the language. This means that most words are formed by joining several morphemes together, and those modifications of the meaning that other languages express e.g. by pronouns or prepositions in Hungarian are handled by affixes (for example 'in my house' is 'hazamban') [7]. Because of this, in a large vocabulary modelling task the application of a morphologic analyzer/generator seems inevitable. First, simply listing and storing all the possible word forms would be almost impossible (e.g. an average noun can have about 700 inflected in·flect  
v. in·flect·ed, in·flect·ing, in·flects

v.tr.
1. To alter (the voice) in tone or pitch; modulate.

2. Grammar To alter (a word) by inflection.

3.
 forms). Second, if we simply handled all these inflected forms as different words, then achieving a certain coverage rate in Hungarian would require a text about 5 times bigger than that in German and 20 times bigger than that in English [11]. Hence the training of conventional n-gram models would require significantly larger corpora in Hungarian than in English, or even in German. A possible solution might be to train the n-grams over morphemes instead of word forms, but then again the handling of the morphology would be necessary.

Though decent morphological tools exist now for Hungarian, in our medical dictation system we preferred to avoid the complications incurred by morphology. In fact, the restricted vocabulary is one of the reasons why we opted for the medical dictation task. For, as we mentioned earlier, the thyroid gland medical reports contain only about 2500 different word forms. Although these many words could be easily managed even by a simple list ('linear lexicon'), we organized the words into a lexical tree where the common prefixes of the lexical entries are shared. Apart from storage reduction advantages, this representation also speeds up decoding, as it eliminates redundant acoustic evaluations [4]. A prefix The beginning or to add to the beginning. To prefix a header onto a packet means to place the header characters in front of the packet. "To prefix" at the beginning is the opposite of "to append" characters at the end. See prepend.

1.
 tree representation is probably far more useful for agglutinative languages than for English because of the many inflected forms of the same stem.

The limited size of the vocabulary and the highly restricted (i.e. low-perplexity) nature of the sentences used in the reports allowed us to create very efficient n-grams. Moreover, we did not really have to worry about out-of-vocabulary words, since we had all the reports from the previous six years, so the risk of encountering unknown words during usage seemed minimal. The system currently applies 3-grams by default, but it is able to 'back off' to smaller n-grams (in the worst case to a small e constant) when necessary. During the evaluation of the n-grams the system applies a language model look-ahead technique. This means that the language model returns its scores as early as possible, not just at word endings. For this reason the lexical trees are stored in a factored form, so that when several words share a common prefix, the maximum of their probabilities is associated with that prefix [4]. These techniques allow a more efficient pruning of the search space.

Besides word n-grams we also experimented with constructing class n-grams. For this purpose the words were grouped into classes according to according to
prep.
1. As stated or indicated by; on the authority of: according to historians.

2. In keeping with: according to instructions.

3.
 their parts-of-speech category. The words were categorized using the POS tagger Noun 1. pos tagger - a tagging program whose labels indicate a word's part of speech
part-of-speech tagger

tagging program, tagger - a computer program that attaches labels to the grammatical constituents of textual matter
 software developed at our university [8]. This software associates one or more MSD (MicroSoft Diagnostics) A utility that accompanied Windows 3.1 and DOS 6 that reported on the internal configuration of the PC. A variety of information on disks, video, drivers, IRQs and port addresses was provided.  (morpho-syntactic description) code with the words, and we constructed the class n-grams over these codes. With the help of the class n-grams the language model can be made more robust in those cases when the word n-gram encounters an unknown word, so it practically performs a kind of language model smoothing. In previous experiments we found that the application of the language model lookahead technique and class n-grams brought about a 30% decrease in the word error rate when it was applied in combination with our HMM-based fast decoder [1]. Figure 2 shows an example of a prefix tree storing four words, along with their MSD codes.

[FIGURE 2 OMITTED]

7 Experimental results and discussion

For testing purposes we recorded 20-20 medical reports from 2 male and 2 female speakers. The language model applied in the tests was constructed based on just 500 reports instead of all the 8546 we had collected. This subset contained almost all the sentence types that occur in the reports, so this restriction mostly reduced the dictionary by removing a lot of rarely occurring words (e.g. dates and disease names). Besides the HMM decoder we tested the HMM/ANN hybrid system A hybrid system is a dynamic system that exhibits both continuous and discrete dynamic behavior — a system that can both flow (described by a differential equation) and jump (described by a difference equation).  in three configurations: the net being trained on one frame of data, on five neighboring frames, and on the best 2D-cepstrum feature set (static MEFC features plus 5 modulation components using a 32-point DFT with both Re and Im parts). The results are listed in Table 5 below. Comparing the first two lines, we see that when using the classic MFCC features the HMM and the HMM/ANN system performed quite similarly on the male speakers. For some reason, however, the HMM system did not like the set of female voices. The remaining rows of the table show that extending the net's input with an observation context - either by neighboring frames or by modulation features--brought only very modest improvements over the baseline results. We think the reason for this is that in the current arrangement the recognizer relies very strongly on the language model, thanks to the high predictability of the sentences. We suspect that the improvement in the acoustic modelling will be better seen in the scores when we apply the system to a linguistically less restricted domain. Pure phone recognition tests (i.e. recognition experiments with no language model support) that could verify this conjecture are just under development.

8 Conclusions

This paper reported the current state of a Hungarian project for the automated dictation of medical reports. We described the acoustic and linguistic training data collected and the current state of development in both the acoustic and linguistic modelling areas. Recognition results were also given over a somewhat restricted subset of the full domain. For the next step we plan to extend the vocabulary and language model to cover all the available data, and then to test the system over other dictation domains as well. Our preliminary results indicate that for tasks over larger vocabularies several further improvements will be required. On the acoustic modelling side we intend to implement speaker adaptation and context-dependent models within the HMM system. We also plan to continue our research on observation context modelling within the HMM/ANN system. Finally, the language model will also need to be improved in many respects, especially when handling certain special features like dates and abbreviations.

Received: May 12, 2004

References

[1] A. Banhalmi, A. Kocsor, and D. Paczolay. 2005. Supporting a Hungarian dictation system with novel language models (in Hungarian). In: Proc. of the 3rd Hungarian Conf. on Computational Linguistics computational linguistics (CL)

Use of digital computers in linguistics research. The simplest examples are the use of computers to scan text and produce such aids as word lists, frequency counts, and concordances.
, pp. 337-347.

[2] H. Bourlard and N. Morgan. 1994. Connectionist Speech Recognition--A Hybrid Approach. Kluwer Academic.

[3] D. Crystal. 2003. A Dictionary of Linguistics and Phonetics phonetics (fōnĕt`ĭks, fə–), study of the sounds of languages from three basic points of view. Phonetics studies speech sounds according to their production in the vocal organs (articulatory phonetics), their physical properties . Blackwell Publishing.

[4] X. Huang, A. Acero, and H.-W. Hon. 2001. Spoken Language Processing
For the processing of language by computers, see Natural language processing.


Language processing refers to the way human beings process speech or writing and understand it as language.
. Prentice Hall Prentice Hall is a leading educational publisher. It is an imprint of Pearson Education, Inc., based in Upper Saddle River, New Jersey, USA. Prentice Hall publishes print and digital content for the 6-12 and higher education market. History
In 1913, law professor Dr.
.

[5] E. Jacobsen and R. Lyons. 2004. An update to the sliding DFT. IEEE (Institute of Electrical and Electronics Engineers, New York, www.ieee.org) A membership organization that includes engineers, scientists and students in electronics and allied fields.  Signal Processing See DSP.  Magazine, 21(1):110-111.

[6] N. Kanedera, H. Hermansky, and T. Arai. 1998. Desired characteristics of modulation spectrum for robust automatic speech recognition. In: Proc. ICASSP ICASSP International Conference on Acoustics, Speech and Signal Processing
ICASSP Ieee International Conference on Acoustics Speech and Signal Processing
 '98, pp. 613-616.

[7] A. Kornai. 1994. On Hungarian morphology. Hungarian Academy of Sciences The Hungarian Academy of Sciences (in short: HAS, in Hungarian: Magyar Tudományos Akadémia) was founded in 1825, when Count István Széchenyi offered one year's income of his estate for the purposes of a Learned Society .

[8] A. Kuba, A. Hocza, and J. Csirik. 2004. POS (1) See point of sale and packet over SONET.

(2) "Parent over shoulder." See digispeak.

POS - point of sale
 tagging of Hungarian with combined statistical and rule-based methods. In: Proc. TSD TSD Tay-Sachs disease.  2004, pp. 113-121.

[9] Medisoft. 2004. www.medisoftspeech.hu

[10] R Motlicek. 2003. Modeling of Spectra and Temporal Trajectories in Speech Processing Speech processing is the study of speech signals and the processing methods of these signals.

The signals are usually processed in a digital representation whereby speech processing can be seen as the intersection of digital signal processing and natural language processing.
. Ph.D. Thesis, Brno University of Technology Brno University of Technology (abbreviated: BUT; in Czech: Vysoké učení technické v Brně – czech abbreviation: VUT) is an university located in Brno, Czech Republic. .

[11] G. Nemeth and Cs. Zainko. 2001. Word unit based multilingual comparative analysis of text corpora. In: Proc. Eurospeech 2001, pp. 2035-2038.

[12] Nuance. 2007. http://www.nuance.co.uk/naturallyspeaking/

[13] R Schwarz, R Matejka, and J. Cernocky. 2003. Recognition of phoneme phoneme

Smallest unit of speech distinguishing one word (or word element) from another (e.g., the sound p in tap, which differentiates that word from tab and tag). The term is usually restricted to vowels and consonants, but some linguists include differences of pitch,
 strings using TRAP technique. In: Proc. Eurospeech 2003, pp. 825-828.

[14] P. Siptar, M. Torkenczy. 2000. The phonology phonology, study of the sound systems of languages. It is distinguished from phonetics, which is the study of the production, perception, and physical properties of speech sounds; phonology attempts to account for how they are combined, organized, and convey meaning  of Hungarian. Oxford University Press.

[15] M. Szarvas and S. Furui. 2002. Finite-state transducer transducer, device that accepts an input of energy in one form and produces an output of energy in some other form, with a known, fixed relationship between the input and output.  based Hungarian LVCSR with explicit modeling of phonological changes. In: Proc. ICSLP ICSLP International Conference on Spoken Language Processing (IEEE)
ICSLP International Conference on Speech and Language Processing
ICSLP International Conference and Symposium on Logic Programming
 2002, pp. 1297-1300.

[16] K. Vicsi, A. Kocsor, Cs. Teleki, and L. Toth. 2004. Hungarian speech database for computer-using environments in offices (in Hungarian). In: Proc. 2nd Hungarian Confi on Computational Linguistics, pp. 315-318.

[17] K. Vicsi, Sz. Velkei, Gy. Szaszak, G. Borostyan, G. Gordos The Gordos is a type of Zoid, a race of mechanical lifeforms from the fictional Zoids universe. Overview
The Gordos is a Stegosaurus-type Zoid created by the Helic Republic.
. 2006. Speech recognizer for preparing medical reports--Development experiences of a Hungarian speaker independent continuous speech recognizer. Hiradastechnika, Vol. 61, No. 7, pp. 22-27.

(1) The DC offset being indexed as the zeroth (jargon) zeroth - First.

Since zero is the lowest value of an unsigned binary integer, which is one of the most fundamental types in programming and hardware design, it is often natural to count from zero rather than one, especially when the integer is actually an index, as
 component.

Andras Banhalmi, Denes Paczolay, Laszo Toth and Andras Kocsor ([dagger])

Research Group on Artificial Intelligence

Hungarian Academy of Sciences and the University of Szeged

Aradi vertanfik tere 1, H-6720 Szeged

{banhahni, pdenes, tothl, kocsor} @ inf.u-szeged.hu

([dagger]) Applied Intelligence Laboratory Ltd. and Research Group on Artificial Intelligence NPC 1. (complexity) NPC - NP-complete.
2. (architecture) NPC - Next Program Counter.
, Petofi Sgt. 43, H-6723 Szeged, Hungary
Table 1: The effect of varying the observation
context size.

Obs. size     Hidden neurons     Frame accuracy

1 frames           150               64.16%
3 frames           200               67.51%
5 frames           250               68.67%
7 frames           300               68.81%
9 frames           350               68.76%

Table 2: Frame-level results at various DFT sizes.

DFT size     Hidden neurons     Frame accuracy

8                 200               64.63%
16                200               67.60%
32                200               67.01%
64                200               64.75%

Table 3: Frame-level results with more DFT components.

DFT Size      Components       H. n.     Frame acc.

16              1, 2, 3         250        68.40%
32           1, 2, 3, 4, 5      300        70.64%

Table 4: The effect of varying the transformation type.

Transform       H. neurons     Frame accuracy

DFT Re + Im        300             70.64%
DFT Re only        220             65.81%
DCT                220             68.00%

Table 5: Word recognition accuracies of the various models and
feature sets.

Model
Type      Feature Set           Male 1   Male 2   Female 1   Female 2

HMM       MFCC + [DELTA] +      97.75%   98.22%    93.40%     93.39%
          [DELTA][DELTA]

HMM/ANN   MFCC + [DELTA] +      97.65%   97.37%    96.78%     96.91%
          [DELTA][DELTA]

HMM/ANN   5-frames * (MFCC +    97.65%   97.74%    96.67%     98.05%
          [DELTA] +
          [DELTA][DELTA])

HMM/ANN   MFCC + 5 Mod. Comp.   97.88%   97.83%    96.86%     96.42%
          (Re + Im)
COPYRIGHT 2007 Slovenian Society Informatika
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2007 Gale, Cengage Learning. All rights reserved.

 Reader Opinion

Title:

Comment:



 

Article Details
Printer friendly Cite/link Email Feedback
Author:Banhalmi, Andras; Paczolay, Denes; Toth, Laszo; Kocsor, Andras
Publication:Informatica
Article Type:Report
Date:Jun 1, 2007
Words:4260
Previous Article:Preliminary numerical experiments in multiobjective optimization of a metallurgical production process.
Next Article:Supervised machine learning: a review of classification techniques.
Topics:

Terms of use | Copyright © 2014 Farlex, Inc. | Feedback | For webmasters