Printer Friendly

ASR: looking over the engine, checking under the hood.

The key to success in voice applications nowadays, from CRM applications to mobile devices and automotive solutions, is a well designed, natural and truly accurate speech interface, which cannot be realized without a first-rate ASR (Automatic Speech Recognition) to power it. After all, it must be remembered that your voice-enabled service is the face your company presents to your customers, and so the naturalness and user-friendliness of your voice interface is the key to enhancing the customer experience or, if you get it wrong, making it a nightmare.

ASR technologies have currently reached a high-level of maturity, enabling the proliferation of commercial applications onto the market. For more discerning integrators, however, for whom the quality of their solutions and the satisfaction of their customers is central, identifying which products are capable of delivering a high level of performance is not always easy. It requires knowledge and experience to be able to distinguish the best-in-breed products from those that are frankly not up to the job. Accurate, multilingual speech recognition on large-scale vocabularies, while indispensable, is really just the starting point.

The main requirements for a high quality ASR are: speaker independence, enabling the recognition of continuous speech from any speaker, without prior training; high accuracy, along with the capability to return a set of N-best hypotheses together with reliable confidence values, which is key to building a good dialogue-flow management; and the capability to support both grammar-based applications and the use of Statistical Language Models for more complex interactions.

These are the essential requirements for any speech recognition technology, but, if you take a look 'under the hood' at how an ASR engine has been engineered, you will soon discover that it is common practice to use either HMM (Hidden Markov Models) or NN (Neural Networks) for the core algorithms. Loquendo ASR actually combines both of these approaches, resulting in high performance speech recognition and increased efficiency with large vocabularies (from several hundred, up to hundreds of thousands of words).


The efficiency of the ASR is also fundamental, to reduce hardware infrastructure costs: an ASR with low computational power requirements enables a larger number of recognition channels to run simultaneously. Loquendo ASR has been carefully optimized and is, in fact, so efficient that its core engine can also be used on embedded platforms such as smartphones and navigation devices.

Extended Standards Support should also be considered when evaluating an ASR: compliancy with MRCP (for client-server architectures), complete support for grammar standards, such as W3C SRGS and SISR, as it enables optimization for VoiceXML applications; support for AURORA DSR (for distributed speech recognition), all ensure customer investments are future-proof.

A highly accurate phonetic transcriber is also fundamental since it enables better recognition results. Loquendo ASR is based on the same phonetic transcriber as Loquendo TTS, whose accuracy is tested both automatically and by means of very thorough human listening.

Loquendo Speaker Verification is an extension to the ASR module, and it enables more accurate verification by combining both speaker and knowledge verification (i.e. by matching 'who said it' with 'what was said').

ASR tuning (e.g. to the environment, to the speaker) and the ability to learn from the field are key factors for success (or failure), determined by the availability of the right tools, such as the Acoustic Model Adaptation Tool or the Phonetic Learning Tool, rather than having to rely on costly professional services. An embedded denoising module significantly improves performance in noisy environments by cleaning up the signal while computing spectral parameters.

In addition to providing all the functionalities described above, Loquendo ASR can also perform more specialized tasks: e.g. the 'word spotting' function--recognizing keywords within audio streams; the 'Garbage rules' definition, to match arbitrary short spoken sequences not modeled by the grammar (expressions like "Um, Er...", "Well", "Let me think", etc.). This latter approach in particular adds more flexibility to the use of traditional grammars, giving the user a more natural interaction experience.

We hope you have found this information useful, and that you'll now have the courage to open up the hood and take a look inside a speech recognition engine. Of course, do not hesitate to contact us to try out Loquendo ASR for yourself. You will find all the features mentioned above in a technology designed both to give voice to your customers and to help you understand them better than ever.
COPYRIGHT 2008 Information Today, Inc.
No portion of this article can be reproduced without the express written permission from the copyright holder.
Copyright 2008 Gale, Cengage Learning. All rights reserved.

Article Details
Printer friendly Cite/link Email Feedback
Title Annotation:Automatic Speech Recognition
Author:Parr, Simon
Publication:Speech Technology Magazine
Geographic Code:1USA
Date:Oct 1, 2008
Previous Article:The case for call recording: legal issues abound regarding notification and privacy when call centers monitor and capture customer contacts.
Next Article:Unified in care and communications: Cancer Treatment Centers of America rolls out VoIP and IP contact center solutions to improve patient care.

Terms of use | Privacy policy | Copyright © 2019 Farlex, Inc. | Feedback | For webmasters